Configuring Cluster Failover Parameters
SRX Series devices in a chassis cluster uses heartbeat transmissions to determine the “health” of the control link. If the number of missed heartbeats has reached the configured threshold, the system assesses whether a failure condition exists. For more information, see the following topics:
Understanding Chassis Cluster Control Link Heartbeats, Failure, and Recovery
- Understanding Chassis Cluster Control Link Heartbeats
- Understanding Chassis Cluster Control Link Failure and Recovery
Understanding Chassis Cluster Control Link Heartbeats
You specify the heartbeat threshold and heartbeat interval when you configure the chassis cluster.
The system monitors the control link's status by default.
For dual control links, which are supported on SRX5600 and SRX5800 lines, the Juniper Services Redundancy Protocol process (jsrpd) sends and receives the control heartbeat messages on both control links. As long as heartbeats are received on one of the control links, Junos OS considers the other node to be alive.
The product of the heartbeat-threshold
option and the
heartbeat-interval
option defines the wait time before failover
is triggered. The default values of these options produce a wait time of 3 seconds.
A heartbeat-threshold of 5 and a heartbeat-interval of 1000 milliseconds would yield
a wait time of 5 seconds. Setting the heartbeat-threshold to 4 and the
heartbeat-interval to 1250 milliseconds would also yield a wait time of 5
seconds.
In a chassis cluster environment, if more than 1000 logical interfaces are used, the cluster heartbeat timers are recommended to be increased from the default of 3 seconds. At maximum capacity on an SRX4600, SRX5400, SRX5600 or an SRX5800 device, we recommend that you increase the configured time before failover to at least 5 seconds.
Understanding Chassis Cluster Control Link Failure and Recovery
If the control link fails, Junos OS changes the operating state of the secondary node to ineligible for a 180-second countdown. If the fabric link also fails during the 180 seconds, Junos OS changes the secondary node to primary; otherwise, after 180 seconds the secondary node state changes to disabled.
When the control link is down, a system log message is generated.
A control link failure is defined as not receiving heartbeats over the control link while heartbeats are still being received over the fabric link.
In the event of a legitimate control link failure, redundancy group 0 remains primary on the node on which it is currently primary, inactive redundancy groups x on the primary node become active, and the secondary node enters a disabled state.
When the secondary node is disabled, you can still log in to the management port and run diagnostics.
To determine if a legitimate control link failure has occurred, the system relies on redundant liveliness signals sent across both the control link and the fabric link.
The system periodically transmits probes over the fabric link and heartbeat signals over the control link. Probes and heartbeat signals share a common sequence number that maps them to a unique time event. Junos OS identifies a legitimate control link failure if the following two conditions exist:
The threshold number of heartbeats were lost.
At least one probe with a sequence number corresponding to that of a missing heartbeat signal was received on the fabric link.
If the control link fails, the 180-second countdown begins and the secondary node state is ineligible. If the fabric link fails before the 180-second countdown reaches zero, the secondary node becomes primary because the loss of both links is interpreted by the system to indicate that the other node is dead. Because concurrent loss of both control and fabric links means that the nodes are no longer synchronizing states nor comparing priorities, both nodes might thus temporarily become primary, which is not a stable operating state. However, once the control link is reestablished, the node with the higher priority value automatically becomes primary, the other node becomes secondary, and the cluster returns to normal operation.
When a legitimate control link failure occurs, the following conditions apply:
Redundancy group 0 remains primary on the node on which it is currently primary (and thus its Routing Engine remains active), and all redundancy groups x on the node become primary.
If the system cannot determine which Routing Engine is primary, the node with the higher priority value for redundancy group 0 is primary and its Routing Engine is active. (You configure the priority for each node when you configure the
redundancy-group
statement for redundancy group 0.)The system disables the secondary node.
To recover a device from the disabled mode, you must reboot the device. When you reboot the disabled node, the node synchronizes its dynamic state with the primary node.
If you make any changes to the configuration while the secondary node is disabled, execute the commit command to synchronize the configuration after you reboot the node. If you did not make configuration changes, the configuration file remains synchronized with that of the primary node.
You cannot enable preemption for redundancy group 0. If you want to change the primary node for redundancy group 0, you must do a manual failover.
When you use dual control links (supported on SRX5600 and SRX5800 devices), note the following conditions:
Host inbound or outbound traffic can be impacted for up to 3 seconds during a control link failure. For example, consider a case where redundancy group 0 is primary on node 0 and there is a Telnet session to the Routing Engine through a network interface port on node 1. If the currently active control link fails, the Telnet session will lose packets for 3 seconds, until this failure is detected.
A control link failure that occurs while the commit process is running across two nodes might lead to commit failure. In this situation, run the commit command again after 3 seconds.
For SRX5600 and SRX5800 devices, dual control links require a second Routing Engine on each node of the chassis cluster.
You can specify that control link recovery be done automatically
by the system by setting the control-link-recovery
statement.
In this case, once the system determines that the control link is
healthy, it issues an automatic reboot on the disabled node. When
the disabled node reboots, the node joins the cluster again.
Example: Configuring Chassis Cluster Control Link Recovery
This example shows how to enable control link recovery, which allows the system to automatically take over after the control link recovers from a failure.
Requirements
Before you begin:
Understand chassis cluster control links. See Understanding Chassis Cluster Control Plane and Control Links.
Understand chassis cluster dual control links. See Understanding Chassis Cluster Dual Control Links.
Connect dual control links in a chassis cluster. See Dual Control Link Connections for SRX Series Firewalls in a Chassis Cluster.
Overview
You can enable the system to perform control link recovery automatically. After the control link recovers, the system takes the following actions:
It checks whether it receives at least three consecutive heartbeats on the control link or, in the case of dual control links (SRX5600 and SRX5800 devices only), on either control link. This is to ensure that the control link is not flapping and is healthy.
After it determines that the control link is healthy, the system issues an automatic reboot irrespective of the state of the node (ineligible or disable) when the control link failed. When the node reboots, it can rejoin the cluster. There is no need for any manual intervention.
In this example, you enable chassis cluster control link recovery.
Configuration
Procedure
Step-by-Step Procedure
To enable chassis cluster control-link-recovery:
Enable control link recovery.
{primary:node0}[edit] user@host# set chassis cluster control-link-recovery
If you are done configuring the device, commit the configuration.
{primary:node0}[edit] user@host# commit