[ Contents] [ Prev] [ Next] [ Index] [ Report an Error]

Understanding Failover

Chassis cluster employs a number of highly efficient failover mechanisms that promote high availability to increase your system's overall reliability and productivity.

Before You Begin

For background information, read:

This topic includes:

About Redundancy Group Failover

A redundancy group is a collection of objects that fail over as a group. Each redundancy group monitors a set of objects (physical interfaces), and each monitored object is assigned a weight. Each redundancy group has an initial threshold of 255. When a monitored object fails, the weight of the object is subtracted from the threshold value of the redundancy group. When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group will fail over as well. Graceful restart of the routing protocols enables the SRX Series device to minimize traffic disruption during a failover. For more information, see Redundancy Group Interface Monitoring.

Because back-to-back redundancy group 0 failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed. On a failover, the previous primary node moves to the secondary-hold state and stays there until the hold-down interval expires, after which it moves to the secondary state.

The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down-interval statement. Redundancy groups x (redundancy groups numbered 1 through 128) have a default dampening time of 1 second, with a range of 0 through 1800 seconds. The hold-down interval affects manual failovers, as well as automatic failovers associated with monitoring failures.

About Manual Failover

You can initiate a redundancy group x failover manually. A manual failover applies until a failback event occurs.

For example, suppose that the user manually does a redundancy group 1 failover from node 0 to node 1. Then an interface that redundancy group 1 is monitoring fails, dropping the threshold value of the new primary redundancy group to zero. This event is considered a failback event, and the system returns control to the original redundancy group.

You can also initiate a redundancy group 0 failover manually if you want to change the primary node for redundancy group 0. You cannot enable preemption for redundancy group 0.

When you do a manual failover for redundancy group 0, the node in the primary state transitions to the secondary-hold state. The node will stay in the secondary-hold state for the default or configured time (a minimum of 300 seconds) and then transition to the secondary state.

State transitions in cases where one node is in the secondary-hold state and the other node reboots, or the control link connection or fabric link connection is lost to that node, are described as follows:

Keep in mind that during an in-service software upgrade (ISSU), the transitions described above will not happen. Instead, the other (primary) node will transition directly to the secondary state because Juniper releases earlier than 10.0 do not interpret the secondary-hold state. While you start an ISSU, if one of the nodes has one or more redundancy groups in the secondary-hold state, you must wait for them to move to the secondary state before you can do manual failovers to make all the redundancy groups be primary on one node. For more information about ISSUs, see Low-Impact ISSU Chassis Cluster Upgrades.

Caution: Be cautious and judicious in your use of redundancy group 0 manual failovers. A redundancy group 0 failover implies a Routing Engine failover, in which case all processes running on the primary node are killed and then spawned on the new master Routing Engine. This failover could result in loss of state, such as routing state, and degrade performance by introducing system churn.


[ Contents] [ Prev] [ Next] [ Index] [ Report an Error]