Troubleshooting an SRX Chassis Cluster with One Node in the Primary State and the Other Node in the Lost State
Problem
Description
The nodes of the SRX chassis cluster are in primary and lost states.
Environment
SRX chassis cluster
Symptoms
One node of the cluster is in the primary
state and the other node is in the lost state. Run the show chassis
cluster status
command on each node to view the status of the
node. Here is a sample output:
{primary:node0} root@primary-srx> show chassis cluster status Cluster ID: 1 Node Priority Status Preempt Manual failover Redundancy group: 0 , Failover count: 1 node0 100 primary no no node1 0 lost no no Redundancy group: 1 , Failover count: 1 node0 100 primary no no node1 0 lost no no
Diagnosis
Is the node that is in the lost state powered on?
Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
If you are able to access the node, proceed to Step 3.
If you are unable to access the node and if the device is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
No: Power on the node and proceed to Step 2.
-
After both nodes are powered on, run the
show chassis cluster status
command again. Is the node still in the lost state?-
Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
-
If you are able to access the node, proceed to Step 3.
-
If you are unable to access the node and if the node is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-
-
No: Powering on the device has resolved the issue.
-
-
Connect a console to the primary node, and run the
show chassis cluster status
command. Does the output show this node as primary and the other node as lost?-
Yes: This might indicate a split-brain scenario. Each node would show itself as primary and the other node as lost. Run the following commands to verify which node is processing the traffic:
-
show security monitoring
-
show security flow session summary
-
monitor interface traffic
Isolate the node that is not processing the traffic. You can isolate the node from the network by removing all the cables except the control and fabric links. Proceed to Step 4.
-
-
No: Proceed to Step 4.
-
-
Verify that all the FPCs are online on the node that is in the lost state by running the
show chassis fpc pic-status
command. Are all the FPCs online?-
Yes: Proceed to Step 5.
-
No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-
-
Are the nodes connected through a switch?
-
Yes: See Troubleshooting a Fabric Link Failure in an SRX Chassis Cluster and Troubleshooting a Control Link Failure in an SRX Chassis Cluster.
-
No: Proceed to Step 6.
-
-
Create a backup of the configuration from the node that is currently primary:
{primary:node0}
root@primary-srx# show configuration | save /var/tmp/cfg-bkp.txt
Copy the configuration to the node that is in the lost state, and load the configuration:
root@lost-srx# load override <terminal or filename>
Note:If you use the
terminal
option, paste the complete configuration into the window. Make sure that you use Ctrl+D at the end of the configuration.If you use the
filename
option, provide the path to the configuration file (for example: /var/tmp/Primary_saved.conf), and press Enter.When you connect to the node in the lost state through a console, you might see the state as either primary or hold/disabled. If the node is in the hold/disabled state, a fabric link failure might have occurred before the device went into the lost state. To troubleshoot this issue, follow the steps in Troubleshooting a Fabric Link Failure in an SRX Chassis Cluster.
Commit the changes after the configuration is loaded. If the problem persists, then replace the existing control and fabric links on this device with new cables and reboot the node:
{primary:node1}[edit]
root@lost-srx# request system reboot
Is the issue resolved?
-
No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-