Troubleshooting an SRX Chassis Cluster with One Node in the Primary State and the Other Node in the Lost State

date_range 28-Nov-23

arrow_backward

arrow_forward

Problem

Description
Environment
Symptoms

Description

The nodes of the SRX chassis cluster are in primary and lost states.

Environment

SRX chassis cluster

Symptoms

One node of the cluster is in the primary state and the other node is in the lost state. Run the show chassis cluster status command on each node to view the status of the node. Here is a sample output:

content_copy zoom_out_map
{primary:node0}
root@primary-srx> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   100         primary        no       no  
    node1                   0           lost           no       no  

Redundancy group: 1 , Failover count: 1
    node0                   100         primary        no       no  
    node1                   0           lost           no       no

Diagnosis

Is the node that is in the lost state powered on?
- Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
  - If you are able to access the node, proceed to Step 3.
  - If you are unable to access the node and if the device is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
- No: Power on the node and proceed to Step 2.
After both nodes are powered on, run the show chassis cluster status command again. Is the node still in the lost state?
- Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
  - If you are able to access the node, proceed to Step 3.
  - If you are unable to access the node and if the node is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
- No: Powering on the device has resolved the issue.
Connect a console to the primary node, and run the show chassis cluster status command. Does the output show this node as primary and the other node as lost?
- Yes: This might indicate a split-brain scenario. Each node would show itself as primary and the other node as lost. Run the following commands to verify which node is processing the traffic:
  - show security monitoring
  - show security flow session summary
  - monitor interface traffic
  Isolate the node that is not processing the traffic. You can isolate the node from the network by removing all the cables except the control and fabric links. Proceed to Step 4.
- No: Proceed to Step 4.
Verify that all the FPCs are online on the node that is in the lost state by running the show chassis fpc pic-status command. Are all the FPCs online?
- Yes: Proceed to Step 5.
- No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
Are the nodes connected through a switch?
- Yes: See Troubleshooting a Fabric Link Failure in an SRX Chassis Cluster and Troubleshooting a Control Link Failure in an SRX Chassis Cluster.
- No: Proceed to Step 6.
Create a backup of the configuration from the node that is currently primary:
```
{primary:node0} 
root@primary-srx# show configuration | save /var/tmp/cfg-bkp.txt
```
Copy the configuration to the node that is in the lost state, and load the configuration:
```
root@lost-srx# load override <terminal or filename>
```
Note:
If you use the terminal option, paste the complete configuration into the window. Make sure that you use Ctrl+D at the end of the configuration.

If you use the filename option, provide the path to the configuration file (for example: /var/tmp/Primary_saved.conf), and press Enter.

When you connect to the node in the lost state through a console, you might see the state as either primary or hold/disabled. If the node is in the hold/disabled state, a fabric link failure might have occurred before the device went into the lost state. To troubleshoot this issue, follow the steps in Troubleshooting a Fabric Link Failure in an SRX Chassis Cluster.

Commit the changes after the configuration is loaded. If the problem persists, then replace the existing control and fabric links on this device with new cables and reboot the node:
```
{primary:node1}[edit]
root@lost-srx# request system reboot
```
Is the issue resolved?
- No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.

arrow_backward PREVIOUS Troubleshooting an SRX Chassis Cluster with One Node in the Primary State and the Other Node in the Disabled State

NEXT arrow_forward Troubleshooting an SRX Chassis Cluster with One Node in the Hold State and the Other Node in the Lost State

Chassis Cluster User Guide for SRX Series Devices

ON THIS PAGE