- play_arrow Overview
- play_arrow Setting Up a Chassis Cluster
- SRX Series Chassis Cluster Configuration Overview
- SRX Series Chassis Cluster Slot Numbering and Logical Interface Naming
- Preparing Your Equipment for Chassis Cluster Formation
- Connecting SRX Series Firewalls to Create a Chassis Cluster
- Example: Setting the Node ID and Cluster ID for Security Devices in a Chassis Cluster
- Chassis Cluster Management Interfaces
- Chassis Cluster Fabric Interfaces
- Chassis Cluster Control Plane Interfaces
- Chassis Cluster Redundancy Groups
- Chassis Cluster Redundant Ethernet Interfaces
- Configuring Chassis Clustering on SRX Series Devices
- Example: Enabling Eight-Queue Class of Service on Redundant Ethernet Interfaces on SRX Series Firewalls in a Chassis Cluster
- Conditional Route Advertisement over Redundant Ethernet Interfaces on SRX Series Firewalls in a Chassis Cluster
- play_arrow Chassis Cluster Operations
- Aggregated Ethernet Interfaces in a Chassis Cluster
- NTP Time Synchronization on Chassis Cluster
- Active/Passive Chassis Cluster Deployments
- Example: Configuring an SRX Series Services Gateway as a Full Mesh Chassis Cluster
- Example: Configuring an Active/Active Layer 3 Cluster Deployment
- Multicast Routing and Asymmetric Routing on Chassis Cluster
- Ethernet Switching on Chassis Cluster
- Media Access Control Security (MACsec) on Chassis Cluster
- Understanding SCTP Behavior in Chassis Cluster
- Example: Encrypting Messages Between Two Nodes in a Chassis Cluster
- play_arrow Upgrading or Disabling a Chassis Cluster
- play_arrow Troubleshooting
- Troubleshooting a Control Link Failure in an SRX Chassis Cluster
- Troubleshooting a Fabric Link Failure in an SRX Chassis Cluster
- Troubleshooting a Redundancy Group that Does Not Fail Over in an SRX Chassis Cluster
- Troubleshooting an SRX Chassis Cluster with One Node in the Primary State and the Other Node in the Disabled State
- Troubleshooting an SRX Chassis Cluster with One Node in the Primary State and the Other Node in the Lost State
- Troubleshooting an SRX Chassis Cluster with One Node in the Hold State and the Other Node in the Lost State
- Troubleshooting Chassis Cluster Management Issues
- Data Collection for Customer Support
- play_arrow Configuration Statements and Operational Commands
- play_arrow Chassis Cluster Support on SRX100, SRX210, SRX220, SRX240, SRX550M, SRX650, SRX1400, SRX3400, and SRX3600 Devices
Monitoring of Global-Level Objects in a Chassis Cluster
Use Feature Explorer to confirm platform and release support for specific features.
Review the Platform-Specific Monitoring Objects Behavior section for notes related to your platform.
There are various types of objects to monitor as you work with devices configured as chassis clusters, including global-level objects and objects that are specific to redundancy groups. This section describes the monitoring of global-level objects.
Understanding SPU Monitoring
SPU monitoring tracks the health of the SPUs and of the central point (CP). The chassis manager on each SPC monitors the SPUs and the central point, and also maintains the heartbeat with the Routing Engine chassisd. In this hierarchical monitoring system, chassisd is the center for hardware failure detection. SPU monitoring is enabled by default.
Persistent SPU and central point failure on a node is deemed a catastrophic Packet Forwarding Engine (PFE) failure. In this case, the node's PFE is disabled in the cluster by reducing the priorities of redundancy groups x to 0.
A central point failure triggers failover to the secondary node. The failed node's PFE, which includes all SPCs and all I/O cards (IOCs), is automatically restarted. If the secondary central point has failed as well, the cluster is unable to come up because there is no primary device. Only the data plane (redundancy group x) is failed over.
A single, failed SPU causes failover of redundancy group x to the secondary node. All IOCs and SPCs on the failed node are restarted and redundancy group x is failed over to the secondary node. Failover to the secondary node is automatic without the need for user intervention. When the failed (former) primary node has its failing component restored, failback is determined by the preempt configuration for the redundancy group x. The interval for dead SPU detection is 30 seconds.
This event triggers an alarm, indicating that a new field-replaceable unit (FRU) is needed.
Understanding flowd Monitoring
Flowd monitoring tracks the health of the flowd process. Flowd monitoring is enabled by default.
Persistent flowd failure on a node is deemed a catastrophic Packet Forwarding Engine (PFE) failure. In this case, the node's PFE is disabled in the cluster by reducing the priorities of redundancy groups x to 0.
A failed flowd process causes failover of redundancy group x to the secondary node. Failover to the secondary node is automatic without the need for user intervention. When the failed (former) primary node has its failing component restored, failback is determined by the preempt configuration for the redundancy group x.
During SPC and flowd monitoring failures on a local node, the data plane redundancy group RG1+ fails over to the other node that is in a good state. However, the control plane RG0 does not fail over and remains primary on the same node as it was before the failure.
Understanding Cold-Sync Monitoring
The process of synchronizing the data plane runtime objects (RTOs) on the startup of the SPUs or flowd is called cold sync. When all the RTOs are synchronized, the cold-sync process is complete, and the SPU or flowd on the node is ready to take over for the primary node, if needed. The process of monitoring the cold-sync state of all the SPUs or flowd on a node is called cold-sync monitoring. Keep in mind that when preempt is enabled, cold-sync monitoring prevents the node from taking over the primary role until the cold-sync process is completed for the SPUs or flowd on the node. Cold-sync monitoring is enabled by default.
When the node is rebooted, or when the SPUs or flowd come back up from failure, the priority for all the redundancy groups 1+ is 0. When an SPU or flowd comes up, it tries to start the cold-sync process with its mirror SPU or flowd on the other node.
If this is the only node in the cluster, the priorities for all the redundancy groups 1+ stay at 0 until a new node joins the cluster. Although the priority is at 0, the device can still receive and send traffic over its interfaces. A priority of 0 implies that it cannot fail over in case of a failure. When a new node joins the cluster, all the SPUs or flowd, as they come up, will start the cold-sync process with the mirror SPUs or flowd of the existing node.
When the SPU or flowd of a node that is already up detects the cold-sync request from the SPU or flowd of the peer node, it posts a message to the system indicating that the cold-sync process is complete. The SPUs or flowd of the newly joined node posts a similar message. However, they post this message only after all the RTOs are learned and cold-sync is complete. On receipt of completion messages from all the SPUs or flowd, the priority for redundancy groups 1+ moves to the configured priority on each node if there are no other failures of monitored components, such as interfaces. This action ensures that the existing primary node for redundancy 1+ groups always moves to the configured priority first. The node joining the cluster later moves to its configured priorities only after all its SPUs or flowd have completed their cold-sync process. This action in turn guarantees that the newly added node is ready with all the RTOs before it takes over the primary role.
Understanding Cold-Sync Monitoring with SPU Replacement or Expansion
If your SRX5600 or SRX5800 Firewall is part of a chassis cluster, when you replace a Services Processing Card (SPC) with a SPC2 or an SPC3 on the device, you must fail over all redundancy groups to one node.
The following events take place during this scenario:
When the SPC2 is installed on a node (for example, on node 1, the secondary node), node 1 is shut down so the SPC2 can be installed.
Once node 1 is powered up and rejoins the cluster, the number of SPUs on node 1 will be higher than the number of SPUs on node 0, the primary node. Now, one node (node 0) still has an old SPC while the other node has the new SPC2; SPC2s have four SPUs per card, and the older SPCs have two SPUs per card.
The cold-sync process is based on node 0 total SPU number. Once those SPUs in node 1 corresponding to node 0 SPUs have completed the cold-sync, the node 1 will declare cold-sync completed. Since the additional SPUs in node 1 do not have the corresponding node 0 SPUs, there is nothing to be synchronized and failover from node 0 to node 1 does not cause any issue.
SPU monitoring functionality monitors all SPUs and reports if there are any SPU failure.
For example assume that both nodes originally have 2 existing SPCs and you have replaced both SPCs with SPC2 on node 1. Now we have 4 SPUs in node 0 and 8 SPUs in node 1. The SPU monitoring function monitors the 4 SPUs on node 0 and 8 SPUs on node 1. If any of those 8 SPUs failed in node 1, the SPU monitoring will still report to the Juniper Services Redundancy Protocol (jsrpd) process that there is an SPU failure. The jsrpd process controls chassis clustering.
Once node 1 is ready to failover, you can initiate all redundancy group failover manually to node 1. Node 0 will be shut down to replace its SPC with the SPC2. After the replacement, node 0 and node 1 will have exactly the same hardware setup.
Once node 0 is powered up and rejoins the cluster, the system will operate as a normal chassis cluster.
When the cold-sync process is still in progress on SRX Series Firewall in chassis cluster, and if the control link is down, a delay (of 30 seconds) is expected before the node takes transition from the secondary state to the primary state.
Platform-Specific Monitoring Objects Behavior
Use Feature Explorer to confirm platform and release support for specific features.
Use the following table to review platform-specific behaviors on your platform.
Platform | Difference |
---|---|
SRX Series |
|