Upgrading a Chassis Cluster Using In-Service Software Upgrade

Understanding ISSU for a Chassis Cluster

In-service software upgrade (ISSU) enables a software upgrade from one Junos OS version to a later Junos OS version with little or no downtime. ISSU is performed when the devices are operating in chassis cluster mode only.

The chassis cluster ISSU feature enables both devices in a cluster to be upgraded from supported Junos OS versions with a minimal disruption in traffic and without a disruption in service.

ISSU provides the following benefits:

Eliminates network downtime during software image upgrades
Reduces operating costs, while delivering higher service levels
Allows fast implementation of new features

ISSU has the following limitations:

ISSU is available only for Junos OS Release 10.4R4 or later.
ISSU does not support software downgrades.
If you upgrade from a Junos OS version that supports only IPv4 to a version that supports both IPv4 and IPv6, the IPv4 traffic continue to work during the upgrade process. If you upgrade from a Junos OS version that supports both IPv4 and IPv6 to a version that supports both IPv4 and IPv6, both the IPv4 and IPv6 traffic continue to work during the upgrade process. Junos OS Release 10.2 and later releases support flow-based processing for IPv6 traffic.
During an ISSU, you cannot bring any PICs online. You cannot perform operations such as commit, restart, or halt.
During an ISSU, operations like fabric monitoring, control link recovery, and RGX preempt are suspended.
During an ISSU, you cannot commit any configurations.

For details about ISSU support status, see knowledge base article KB17946.

The following process occurs during an ISSU for devices in a chassis cluster. The sequences given below are applicable when RG-0 is node 0 (primary node). Note that you must initiate an ISSU from RG-0 primary. If you initiate the upgrade on node 1 (RG-0 secondary), an error message is displayed.

At the beginning of a chassis cluster ISSU, the system automatically fails over all RG-1+ redundancy groups that are not primary on the node from which the ISSU is initiated. This action ensures that all the redundancy groups are active on only the RG-0 primary node.

The automatic failover of all RG-1+ redundancy groups is available from Junos OS release 12.1 or later. If you are using Junos OS release 11.4 or earlier, before starting the ISSU, ensure that all the redundancy groups are all active on only the RG-0 primary node.

After the system fails over all RG-1+ redundancy groups, it sets the manual failover bit and changes all RG-1+ primary node priorities to 255, regardless of whether the redundancy group failed over to the RG-0 primary node.
The primary node (node 0) validates the device configuration to ensure that it can be committed using the new software version. Checks are made for disk space availability for the /var file system on both nodes, unsupported configurations, and unsupported Physical Interface Cards (PICs).

If the disk space available on either of the Routing Engines is insufficient, the ISSU process fails and returns an error message. However, unsupported PICs do not prevent the ISSU. The software issues a warning to indicate that these PICs will restart during the upgrade. Similarly, an unsupported protocol configuration does not prevent the ISSU. However, the software issues a warning that packet loss might occur for the protocol during the upgrade.
When the validation succeeds, the kernel state synchronization daemon (ksyncd) synchronizes the kernel on the secondary node (node 1) with the node 0.
Node 1 is upgraded with the new software image. Before being upgraded, the node 1 gets the configuration file from node 0 and validates the configuration to ensure that it can be committed using the new software version. After being upgraded, it is resynchronized with node 0.
The chassis cluster process (chassisd) on the node 0 prepares other software processes for the lSSU. When all the processes are ready, chassisd sends a message to the PICs installed in the device.
The Packet Forwarding Engine on each Flexible PIC Concentrator (FPC) saves its state and downloads the new software image from node 1. Next, each Packet Forwarding Engine sends a message (unified-ISSU ready) to the chassisd.
After receiving the message (unified-ISSU ready) from a Packet Forwarding Engine, the chassisd sends a reboot message to the FPC on which the Packet Forwarding Engine resides. The FPC reboots with the new software image. After the FPC is rebooted, the Packet Forwarding Engine restores the FPC state and a high-speed internal link is established with node 1 running the new software. The chassisd is also reestablished with node 0.
After all Packet Forwarding Engines have sent a ready message using the chassisd on node 0, other software processes are prepared for a node switchover. The system is ready for a switchover at this point.
Node switchover occurs and node 1 becomes the new primary node (hitherto secondary node 1).
The new secondary node (hitherto primary node 0) is now upgraded to the new software image.

When both nodes are successfully upgraded, the ISSU is complete.

When upgrading a version cluster that does not support encryption to a version that supports encryption, upgrade the first node to the new version. Without the encryption configured and enabled, two nodes with different versions can still communicate with each other and service is not broken. After upgrading the first node, upgrade the second node to the new version. Users can decide whether to turn on the encryption feature after completing the upgrade. Encryption must be deactivated before downgrading to a version that does not support encryption. This ensures that communication between an encryption-enabled version node and a downgraded node does not break, because both are no longer encrypted.

Note:

The policies in the Routing Engine and Packet Forwarding Engine must be in sync for the configuration to be committed. When the policy configurations are modified and the policies are out of sync, the system displays an error message.

As a workaround, you must use the request security policies resync command to synchronize the configuration of security policies in the Routing Engine and Packet Forwarding Engine, in case if you notice security policies are out of sync after an upgrade.

ISSU System Requirements

You can use ISSU to upgrade from an ISSU-capable software release to a later release.

To perform an ISSU, your device must be running a Junos OS release that supports ISSU for the specific platform. See Table 1 for platform support.

Table 1: ISSU Platform Support
Device	Junos OS Release
SRX5800 and SRX5600	10.4R4 or later
SRX5400	12.1X46-D20 or later
SRX1500	15.1X49-D70 or later
SRX1600 and SRX2300, SRX4120	23.4R1 or later
SRX4100 and SRX4200	15.1X49-D80 or later
SRX4300	24.2R1 or later
SRX4600	17.4R1 or later

For additional details on ISSU support and limitations, see ISSU/ICU Upgrade Limitations on SRX Series Devices.

Note the following limitations related to an ISSU:

The ISSU process is terminated if the Junos OS version specified for installation is a version earlier than the one currently running on the device.
The ISSU process is terminated if the specified upgrade conflicts with the current configuration, the components supported, and so forth.
ISSU does not support the extension application packages developed using the Junos OS SDK.
ISSU does not support version downgrading on all supported SRX Series Firewalls.
ISSU occasionally fails under heavy CPU load.

To downgrade from an ISSU-capable release to an earlier release (ISSU-capable or not), use the request system software add command. Unlike an upgrade using the ISSU process, a downgrade using the request system software add command might cause network disruptions and loss of data.

We strongly recommend that you perform ISSU under the following conditions:

When both the primary and secondary nodes are healthy
During system maintenance period
During the lowest possible traffic period
When the Routing Engine CPU usage is less than 40 percent

In cases where ISSU is not supported or recommended, while still downtime during the system upgrade must be minimized, the minimal downtime procedure can be used, see knowledge base articleKB17947.

Upgrading Both Devices in a Chassis Cluster Using ISSU

Before you begin the ISSU for upgrading both the devices, note the following guidelines:

Ensure the following ISSU pre-check requirements are met:
- All redundancy groups priority is greater than 0
- All redundancy groups are either primary or secondary in state
- There exists enough (double the image size) space available in the /var/tmp
- Usage of CPU is under 80% within 5 seconds period
If the pre-check requirements are not met, ISSU will terminate at the beginning.
Back up the software using the request system snapshot command on each Routing Engine to back up the system software to the device’s hard disk.
If you are using Junos OS Release 11.4 or earlier, before starting the ISSU, set the failover for all redundancy groups so that they are all active on only one node (primary). See Initiating a Chassis Cluster Manual Redundancy Group Failover.

If you are using Junos OS Release 12.1 or later, Junos OS automatically fails over all RGs to the RG0 primary.
We recommend that you enable graceful restart for routing protocols before you start an ISSU.

On all supported SRX Series Firewalls, the first recommended ISSU from release is Junos OS Release 10.4R4.

The chassis cluster ISSU feature enables both devices in a cluster to be upgraded from supported Junos OS versions with a traffic impact similar to that of redundancy group failovers.

To perform an ISSU from the CLI on Routing Engine2:

Download the software package from the Juniper Networks Support website: https://www.juniper.net/support/downloads/
Copy the package on primary node of the cluster. We recommend that you copy the package to the/var/tmp directory, which is a large file system on the hard disk. Note that the node from where you initiate the ISSU must have the software image.
user@host>file copy ftp://username:prompt@ftp.hostname.net/filename /var/tmp/filename
Verify the current software version running on both nodes by issuing the show version command on the primary node.
Start the ISSU from the node that is primary for all the redundancy groups by entering the following command:
Wait for both nodes to complete the upgrade (After which you are logged out of the device).
Wait a few minutes, and then log in to the device again. Verify by using the show version command that both devices in the cluster are running the new Junos OS release.
Verify that all policies, zones, redundancy groups, and other real-time objects (RTOs) return to their correct states.
Make node 0 the primary node again by issuing the request chassis cluster failover node node-number redundancy-group group-number command.

If you want redundancy groups to automatically return to node 0 as the primary after an in-service software upgrade (ISSU), you must set the redundancy group priority such that node 0 is primary and enable the preempt option. Note that this method works for all redundancy groups except redundancy group 0. You must manually set the failover for redundancy group 0.

To set the redundancy group priority and enable the preempt option, see Example: Configuring Chassis Cluster Redundancy Groups.

To manually set the failover for a redundancy group, see Initiating a Chassis Cluster Manual Redundancy Group Failover.

During the upgrade, both devices might experience redundancy group failovers, but traffic is not disrupted. Each device validates the package and checks version compatibility before beginning the upgrade. If the system finds that the new package version is not compatible with the currently installed version, the device refuses the upgrade or prompts you to take corrective action. Sometimes a single feature is not compatible, in which case, the upgrade software prompts you to either terminate the upgrade or turn off the feature before beginning the upgrade.

If you want to operate the SRX Series Firewall back as a standalone device or to remove a node from a chassis cluster, ensure that you have terminated the ISSU procedure on both the nodes (in case ISSU procedure is initiated)

To start ISSU process on SRX5K devices with Routing Engine3 and on SRX1600, SRX2300, SRX4120, and SRX4300 devices:

Run the following command to start ISSU:

Rolling Back Devices in a Chassis Cluster After an ISSU

If an ISSU fails to complete and only one device in the cluster is upgraded, you can roll back to the previous configuration on the upgraded device alone by issuing one of the following commands on the upgraded device:

request chassis cluster in-service-upgrade abort
request system software rollback node node-id reboot
request system reboot

Enabling an Automatic Chassis Cluster Node Failback After an ISSU

If you want redundancy groups to automatically return to node 0 as the primary after the an in-service software upgrade (ISSU), you must set the redundancy group priority such that node 0 is primary and enable the preempt option. Note that this method works for all redundancy groups except redundancy group 0. You must manually set the failover for a redundancy group 0. To set the redundancy group priority and enable the preempt option, see Example: Configuring Chassis Cluster Redundancy Groups. To manually set the failover for a redundancy group, see Initiating a Chassis Cluster Manual Redundancy Group Failover.

To upgrade node 0 and make it available in the chassis cluster, manually reboot node 0. Node 0 does not reboot automatically.

Log Error Messages used for Troubleshooting ISSU-Related Problems

The following problems might occur during an ISSU upgrade. You can identify the errors by using the details in the logs. For detailed information about specific system log messages, see System Log Explorer.

Chassisd Process Errors
Understanding Common Error Handling for ISSU
ISSU Support-Related Errors
Initial Validation Checks Failure
Installation-Related Errors
Redundancy Group Failover Errors
Kernel State Synchronization Errors

Problem

Description

Errors related to chassisd.

Solution

Use the error messages to understand the issues related to chassisd.

When ISSU starts, a request is sent to chassisd to check whether there are any problems related to the ISSU from a chassis perspective. If there is a problem, a log message is created.

Understanding Common Error Handling for ISSU

Problem
Solution

Problem

Description

You might encounter some problems in the course of an ISSU. This section provides details on how to handle them.

Solution

Any errors encountered during an ISSU result in the creation of log messages, and ISSU continues to function without impact to traffic. If reverting to previous versions is required, the event is either logged or the ISSU is halted, so as not to create any mismatched versions on both nodes of the chassis cluster. Table 2 provides some of the common error conditions and the workarounds for them. The sample messages used in the Table 2 are from the SRX1500 device and are also applicable to all supported SRX Series Firewalls.

Table 2: ISSU-Related Errors and Solutions
Error Conditions	Solutions
Attempt to initiate an ISSU when previous instance of an ISSU is already in progress	The following message is displayed: `warning: ISSU in progress` You can abort the current ISSU process, and initiate the ISSU again using the `request chassis cluster in-service-upgrade abort` command.
Reboot failure on the secondary node	No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster. error: [Oct 6 12:30:16]: Reboot secondary node failed (error-code: 4.1) error: [Oct 6 12:30:16]: ISSU Aborted! Backup node maybe in inconsistent state, Please restore backup node [Oct 6 12:30:16]: ISSU aborted. But, both nodes are in ISSU window. Please do the following: 1. Rollback the node with the newer image using rollback command Note: use the 'node' option in the rollback command otherwise, images on both nodes will be rolled back 2. Make sure that both nodes (will) have the same image 3. Ensure the node with older image is primary for all RGs 4. Abort ISSU on both nodes 5. Reboot the rolled back node
Secondary node failed to complete the cold synchronization	The primary node times out if the secondary node fails to complete the cold synchronization. Detailed console messages are displayed that you manually clear existing ISSU states and restore the chassis cluster. No service downtime occurs in this scenario. [Oct 3 14:00:46]: timeout waiting for secondary node node1 to sync(error-code: 6.1) Chassis control process started, pid 36707 error: [Oct 3 14:00:46]: ISSU Aborted! Backup node has been upgraded, Please restore backup node [Oct 3 14:00:46]: ISSU aborted. But, both nodes are in ISSU window. Please do the following: 1. Rollback the node with the newer image using rollback command Note: use the 'node' option in the rollback command otherwise, images on both nodes will be rolled back 2. Make sure that both nodes (will) have the same image 3. Ensure the node with older image is primary for all RGs 4. Abort ISSU on both nodes 5. Reboot the rolled back node
Failover of newly upgraded secondary failed	No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster. [Aug 27 15:28:17]: Secondary node0 ready for failover. [Aug 27 15:28:17]: Failing over all redundancy-groups to node0 ISSU: Preparing for Switchover error: remote rg1 priority zero, abort failover. [Aug 27 15:28:17]: failover all RGs to node node0 failed (error-code: 7.1) error: [Aug 27 15:28:17]: ISSU Aborted! [Aug 27 15:28:17]: ISSU aborted. But, both nodes are in ISSU window. Please do the following: 1. Rollback the node with the newer image using rollback command Note: use the 'node' option in the rollback command otherwise, images on both nodes will be rolled back 2. Make sure that both nodes (will) have the same image 3. Ensure the node with older image is primary for all RGs 4. Abort ISSU on both nodes 5. Reboot the rolled back node {primary:node1}
Upgrade failure on primary	No service downtime occurs, because the secondary node fails over as primary and continues to provide required services.
Reboot failure on primary node	Before the reboot of the primary node, devices being out of the ISSU setup, no ISSU-related error messages are displayed. The following reboot error message is displayed if any other failure is detected: Reboot failure on Before the reboot of primary node, devices will be out of ISSU setup and no primary node error messages will be displayed. Primary node

ISSU Support-Related Errors

Problem
Solution

Problem

Description

Installation failure occurs because of unsupported software and unsupported feature configuration.

Solution

Use the following error messages to understand the compatibility-related problems:

Initial Validation Checks Failure

Problem
Solution

Problem

Description

The initial validation checks fail.

Solution

The validation checks fail if the image is not present or if the image file is corrupt. The following error messages are displayed when initial validation checks fail when the image is not present and the ISSU is aborted:

When Image Is Not Present

When Image File Is Corrupted

If the image file is corrupted, the following output displays:

The primary node validates the device configuration to ensure that it can be committed using the new software version. If anything goes wrong, the ISSU aborts and error messages are displayed.

Installation-Related Errors

Problem
Solution

Problem

Description

The install image file does not exist or the remote site is inaccessible.

Solution

Use the following error messages to understand the installation-related problems:

ISSU downloads the install image as specified in the ISSU command as an argument. The image file can be a local file or located at a remote site. If the file does not exist or the remote site is inaccessible, an error is reported.

Redundancy Group Failover Errors

Problem
Solution

Problem

Description

Problem with automatic redundancy group (RG) failure.

Solution

Use the following error messages to understand the problem:

Kernel State Synchronization Errors

Problem
Solution

Problem

Description

Errors related to ksyncd.

Solution

Use the following error messages to understand the issues related to ksyncd:

ISSU checks whether there are any ksyncd errors on the secondary node (node 1) and displays the error message if there are any problems and aborts the upgrade.

Managing Chassis Cluster ISSU-Related Problems

This topic includes the following sections:

Viewing ISSU Progress
Stopping ISSU Process if it Halts During an Upgrade
Recovering the Node in Case of a Failed ISSU

Viewing ISSU Progress

Problem
Solution

Problem

Description

Rather than wait for an ISSU failure, you can display the progress of the ISSU as it occurs, noting any message indicating that the ISSU was unsuccessful. Providing such messages to JTAC can help with resolving the issue.

Solution

After starting an ISSU, issue the show chassis cluster information issu command. Output similar to the following is displayed indicating the progress of the ISSU for all Services Processing Units (SPUs).

Stopping ISSU Process if it Halts During an Upgrade

Problem
Solution

Problem

Description

The ISSU process halts in the middle of an upgrade.

Solution

If the ISSU fails to complete and only one device in the cluster is upgraded, you can roll back to the previous configuration on the upgraded device alone by issuing one of the following commands on the upgraded device:

request chassis cluster in-service-upgrade abort to terminate the ISSU on both nodes.
request system software rollback node node-id reboot to roll back the image.
request system reboot to reboot the rolled back node.

Recovering the Node in Case of a Failed ISSU

Problem
Solution

Problem

Description

The ISSU procedure stops progressing.

Solution

Open a new session on the primary device and issue the request chassis cluster in-service-upgrade abort command.

This step terminates an in-progress ISSU . This command must be issued from a session other than the one on which you issued the request system in-service-upgrade command that launched the ISSU. If the node is being upgraded, this command cancels the upgrade. The command is also helpful in recovering the node in case of a failed ISSU.

When an ISSU encounters an unexpected situation that necessitates a termination, the system message provides you with detailed information about when and why the upgrade stopped along with recommendations for the next steps to take.

For example, the following message is issued when a node fails to become RG-0 secondary when it boots up:

If you attempt to upgrade a device pair running a Junos OS release earlier than Release 9.6, ISSU fails without changing anything on either device in the cluster. Devices running Junos OS releases earlier than Release 9.6 must be upgraded separately using individual device upgrade procedures.

If the secondary device experiences a power-off condition before it boots up using the new image specified when the ISSU was initiated, the newly upgraded device will still be waiting to end the ISSU after power is restored. To end the ISSU process on Routing Engine3 and Routing Engine2 on both the nodes, issue the request chassis cluster in-service-upgrade abort command.

Platform-Specific In-Service Software Upgrade Behavior

Use Feature Explorer to confirm platform and release support for specific features.

Use the following table to review platform-specific behaviors for your platform.

Platform	Difference
SRX Series	SRX1500, SRX4100, and SRX4200 Firewalls support to upgrade from Junos OS 17.4 to successive 17.4 releases and cannot upgrade to 17.4 releases from previous Junos OS releases. SRX5400, SRX5600, and SRX5800 Firewalls support to upgrade from Junos OS 17.3 to successive 17.3 releases and cannot upgrade to 17.3 and higher releases from earlier Junos OS releases. SRX1500, SRX1600, SRX2300, SRX4120, SRX4100, SRX4200, SRX4300, and SRX4600, Firewalls do not support the `request system snapshot` command. SRX1500, SRX4100, and SRX4200 Firewalls that support ISSU allow you to remove the original image file. Include `unlink` to the `user@host> request system software in-service-upgrade image-name-with-full-path unlink` command.

Platform

Difference

SRX Series

SRX1500, SRX4100, and SRX4200 Firewalls support to upgrade from Junos OS 17.4 to successive 17.4 releases and cannot upgrade to 17.4 releases from previous Junos OS releases.
SRX5400, SRX5600, and SRX5800 Firewalls support to upgrade from Junos OS 17.3 to successive 17.3 releases and cannot upgrade to 17.3 and higher releases from earlier Junos OS releases.
SRX1500, SRX1600, SRX2300, SRX4120, SRX4100, SRX4200, SRX4300, and SRX4600, Firewalls do not support the request system snapshot command.

SRX1500, SRX4100, and SRX4200 Firewalls that support ISSU allow you to remove the original image file. Include unlink to the user@host> request system software in-service-upgrade image-name-with-full-path unlink command.

ON THIS PAGE

Understanding ISSU for a Chassis Cluster

ISSU System Requirements

Upgrading Both Devices in a Chassis Cluster Using ISSU

Rolling Back Devices in a Chassis Cluster After an ISSU

Enabling an Automatic Chassis Cluster Node Failback After an ISSU

Log Error Messages used for Troubleshooting ISSU-Related Problems

Chassisd Process Errors

Problem

Description

Solution

Understanding Common Error Handling for ISSU

Problem

Description

Solution

ISSU Support-Related Errors

Problem

Description

Solution

Initial Validation Checks Failure

Problem

Description

Solution

Installation-Related Errors

Problem

Description

Solution

Redundancy Group Failover Errors

Problem

Description

Solution

Kernel State Synchronization Errors

Problem

Description

Solution

Platform-Specific In-Service Software Upgrade Behavior

Change History Table

ON THIS PAGE

Upgrading a Chassis Cluster Using In-Service Software Upgrade

Understanding ISSU for a Chassis Cluster

ISSU System Requirements

Upgrading Both Devices in a Chassis Cluster Using ISSU

See Also

Rolling Back Devices in a Chassis Cluster After an ISSU

Enabling an Automatic Chassis Cluster Node Failback After an ISSU

Log Error Messages used for Troubleshooting ISSU-Related Problems

Chassisd Process Errors

Problem

Description

Solution

Understanding Common Error Handling for ISSU

Problem

Description

Solution

ISSU Support-Related Errors

Problem

Description

Solution

Initial Validation Checks Failure

Problem

Description

Solution

Installation-Related Errors

Problem

Description

Solution

Redundancy Group Failover Errors

Problem

Description

Solution

Kernel State Synchronization Errors

Problem

Description

Solution

Managing Chassis Cluster ISSU-Related Problems

Viewing ISSU Progress

Problem

Description

Solution

Stopping ISSU Process if it Halts During an Upgrade

Problem

Description

Solution

Recovering the Node in Case of a Failed ISSU

Problem

Description

Solution

Platform-Specific In-Service Software Upgrade Behavior

Related Documentation

Change History Table