Configuring FPC Error Levels and Actions
Example: Configuring FPC Error Detection and Self-Healing on T Series Core Routers
Managing FPC Errors
Powering Off Packet Forwarding Engines
Configuring Sanity Polling
Configuring the Junos OS to Make a Flexible PIC Concentrator Stay Offline
Configuring an SFM to Stay Offline
Resynchronizing FPC Sequence Numbers with Active FPCs when an FPC Comes Online
Enabling a Routing Engine to Reboot on Hard Disk Errors
Handling Thermal Health Events Using Thermal Health Check and PSM Watchdog

Managing Errors

Configuring FPC Error Levels and Actions

Starting with Junos OS Release 13.3 or Release 14.2 for M320 routers, you can use MX Series, PTX Series, and T Series routers to configure Packet Forwarding Engine (PFE)-related error levels on FPCs and the actions to perform when a specified threshold is reached. In Junos OS Release 13.2 and earlier, Packet Forwarding Engine errors would disable the FPC. When you use the error command, Packet Forwarding Engine errors can be isolated, which reduces the need for a field replacement. Using the error command, you can classify errors according to severity, set an automatic recovery action for each severity, and configure the actions to perform when a specified threshold is reached. This command is available at the [edit chassis fpc slot-number] and [edit chassis] hierarchies.

To configure Packet Forwarding Engine error levels and actions for an FPC:

(Optional) Configure the fatal error level threshold and action. A fatal error is an error that results in blockage of considerable amount of traffic across modules.
If the severity level of the error is fatal, the action is carried out when the total number of errors reaches the threshold value. After the threshold value is crossed, for every occurrence of the error, an action is carried out.
(Optional) Configure the major error level threshold and action. A major error is an error that results in continuing loss of packet traffic but does not affect other modules.
If the severity level of the error is major, the action is carried out when the total number of errors reaches the threshold value. After the threshold value is crossed, for every occurrence of the error, an action is carried out.
(Optional) Configure the minor error level threshold and action. A minor error is an error that results in the loss of a single packet but is fully recoverable.
If the severity level is minor, the action is carried out only once when the total number of errors reaches the threshold value

Starting with Junos OS Release 18.1R3, MX Series routers support configuration of error thresholds and actions at the error scope and error category levels. Use the command set chassis fpc fpc-slot error scope error-scope category category (fatal | major | minor) threshold error-threshold action (alarm | disable-pfe | get-state | offline | log | reset | trap | online-pfe | reset-pfe) to configure a threshold and action for a particular error scope and category at the FPC level. You can also configure these features at the chassis level (at the [edit chassis] hierarchy). However, threshold and action configured at the [edit chassis fpc] hierarchy overrides the same configuration at the [edit chassis] hierarchy.

You can use the command show chassis fpc errors to view the error information at the error scope and category level.

For Junos OS Evolved, you can use the following show commands to view the error information:

show system errors count—Displays system-wide errors and its count.
show system errors active—Displays current active errors in the system.
show system errors active fpc <slot number> —Displays active errors for the specified FPC.
show system errors fru detail—Displays detailed FRU-specific error.
show system errors fru detail fpc <slot number>—Displays information about detected errors based on the FRU.

If you have configured the action log against a particular error threshold, the system logs the event when the error count breaches the set threshold. The following sample syslog messages indicate an error threshold breach and the resultant action being taken:

The offline, reset, disable-pfe, offline-pfe and reset-pfeactions are mutually exclusive with respect to configuration. The specified PFE is disabled automatically, if offline-pfe or reset-pfe is configured.

Note: A default FPC major alarm action is added for MPC6E. The option disable-pfe is available from Junos 17.4 and later versions.

The following table provides details about PFE error mapping actions and the system response:

Table 1: PFE Error Mapping Action and Response
Action	Response
`disable-pfe`	Disables all PFE interfaces, alarms and logs.
`offline`	Takes the FPC offline, disables the alarms and logs.
`reset`	Takes the FPC offline and resets to online, enables the alarms and logs.
`reset-pfe`	Powers-off the PFE, disables the alarms and logs, then, powers-on the PFE, enables the alarms and logs.
`offline-pfe`	Powers-off the PFE, disables the alarms and logs,

Example: Configuring FPC Error Detection and Self-Healing on T Series Core Routers

This example shows how to configure error detection and self-healing on a Juniper Networks T Series Core Router with Type 5 FPC.

Requirements
Overview
Configuration
Verification

Requirements

This example uses the following hardware and software components:

Juniper Networks T4000 Core Router with Type 5 FPCs.
Junos OS Release 13.3 or later.

Before you proceed, ensure that the required connections are complete and the interfaces are functional.

Overview

FPC error detection and self-healing involves configuring a set of actions to be performed on each FPC, when the number of errors for a particular severity increases beyond a user-configured threshold. The error severity is categorized into fatal, major, and minor. Recovery actions include raising an alarm, generating log entries, getting the current state of the FPC, restarting the FPC, taking the FPC offline, and resetting the FPC. For a particular FPC and error severity, you can configure the error threshold to any value within the allowed limits and map the threshold to an action. In this example, you will set these errors on FPC 0 in Juniper Networks T4000 Core Router.

Configuration

To configure the error detection and self-healing, you need to set the error severity, threshold values corresponding to each error severity, and actions to be performed when the threshold value is crossed.

CLI Quick Configuration
Configuring the Error Detection and Self-Healing
Results

CLI Quick Configuration

To quickly configure this example, copy the following commands, paste them into a text file, remove any line breaks, change any details necessary to match your network configuration, and then copy and paste the commands into the CLI at the [edit interfaces] hierarchy level.

Configuring the Error Detection and Self-Healing

Step-by-Step Procedure

The following example requires you to navigate various levels in the configuration hierarchy. For information about navigating the CLI, see Using the Using the CLI Editor in Configuration Mode and the CLI User Guide.

Configure the threshold value and associated action for fatal errors.
1. Set the error severity to fatal.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error fatal
2. Set the threshold value for fatal errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error fatal threshold 1
3. Set the associated action for fatal errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error fatal threshold 1 action reset

Configure the threshold value and associated action for major errors.
1. Set the error severity to major.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error major
2. Set the threshold value for major errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error major threshold 1
3. Set the associated action for major errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error major threshold 1 action alarm

Configure the threshold value and associated action for minor errors.
1. Set the error severity to minor.
  
  [edit interfaces]
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error minor
2. Set the threshold value for minor errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error minor threshold 10
3. Set the associated action for minor errors.
  
  [edit interfaces]
  
  user@host# set chassis fpc 0 error minor threshold 10 action log

Results

The following is the result of the configuration for the fatal severity level.

If you are done configuring the devices, enter commit from configuration mode.

Verification

To verify that the configuration is successful and the router in configured with the correct action, use the show chassis fpc errors command.

Verifying the Configured Actions Related to Fatal Severity of FPC Error

Purpose
Action
Meaning

Purpose

Make sure that the threshold value and the associated action are set for fatal errors.

Action

Meaning

The sample output shows Fatal error at FPC 0 with 0 error Occurred (no previous occurrences), 0 error Cleared (no previous occurrences) with Threshold value set to 1 and Action-Taken set to RESET.

Managing FPC Errors

On the PTX series routers, you can disable an FPC error or modify the severity of the error at the error-id level. See FPC self-healing for details on PTX platforms that support this feature.

The error-id, which uniquely identifies an FPC error, is represented in the uniform resource identifier (URI) format and is composed of a module identifier and an error identifier. If an error occurs, you can find the error-id in the system log messages.

Modifying Severity of an Error
Disabling an Error

Modifying Severity of an Error

Though you cannot configure a new error severity, you can modify the existing severity of an error. For example, if you do not want to treat a particular error (identified by an error-id) as fatal anymore, you can modify its severity to major or minor as required.

Note:

You cannot modify the error severity at a group (for example, category) level.

To modify the severity of an error, use the following command:

See the following example:

In the above example, you modified the severity of the error ID “/cpu/0/memory/0/memory-uncorrected-error” in FPC 3 to minor.

Disabling an Error

To configure the system to stop reporting an error, identify the error-id and disable it. You can find the error-id in the system log messages. To disable an error, use the following command:

See the following example:

In the above example, you disabled the error “/cpu/0/memory/0/memory-uncorrected-error” in FPC 3.

Powering Off Packet Forwarding Engines

You can power on or power off the Packet Forwarding Engines in a running system, or keep a Packet Forwarding Engine powered off when the FPC comes online. The following are a couple of scenarios in which this feature is used.

When the Packet Forwarding Engine ASIC is malfunctioning.
To conserve power in case the deployment does not require the full capacity of the system.

To power off a Packet Forwarding Engine, use the following steps:

To power on a packet forwarding engine, use the following steps:

Note:

You need to apply this configuration to both the Packet Forwarding Engines in an ASIC to be able to commit the configuration.

Note:

On MX series routers with MPC10E-15C-MRATE, you can power off or power on only the Packet Forwarding Engine 2. The Packet Forwarding Engines 0 and 1 do not support this command. On the MPC10E-15C-MRATE, operating the Packet Forwarding Engine 2 requires the Packet Forwarding Engines 0 and 1 to be functional. You can use the command show chassis fpc fpc-lot detail to view the Packet Forwarding Engine power ON/OFF status and bandwidth for the individual Packet Forwarding Engines in the MPC10E-15C-MRATE.

You can use the show chassis fpc fpc-slot detail command to view the Packet Forwarding Engine power on/off configuration status. See an example below:

Configuring Sanity Polling

You can configure the sanity-poll statement for a particular FPC or FEB or CFEB to start a periodic sanity check for that FPC or FEB or CFEB. The periodic sanity check includes checking for error conditions such as “register sanity issues,” “high temperature,” “hardware failure,” and so on. If you do not configure the sanity-poll statement, then sanity polling is disabled.

Note:

Currently, periodic sanity check is performed only on the routing chip register.

Sanity polling periodically checks for an error condition in an FPC or FEB or CFEB and performs the appropriate actions in case of an error.

To configure sanity polling for an FPC on T Series routers and M320 routers, include the sanity-poll statement and its substatements at the [edit chassis fpc slot-number] hierarchy level:

To configure sanity polling for a FEB on the M120 router, include the sanity-poll statement and its substatements at the [edit chassis feb slot-number] hierarchy level:

To configure sanity polling for a CFEB on M7i and M10 routers, include the sanity-poll statement and its substatements at the [edit chassis cfeb slot-number] hierarchy level:

Note:

On a TX Matrix or TX Matrix Plus router, you can configure the sanity-poll statement at the [edit chassis lcc number fpc number] hierarchy level.

The sanity-poll statement comprises the following substatements:

The retry-count statement specifies the number of rechecks to be performed after the occurrence of a particular error condition. If an error exists in all the periodic checks, then sanity polling reports an error and proceeds to perform the appropriate actions (described as options of the on-error statement).

For example, if the periodic sanity check detects an error in the FPC or FEB or CFEB and if you configure the retry count number to 15, sanity polling does not report the error immediately. Sanity polling checks 15 times for the same error condition. If an error persists in all 15 rechecks, then it reports an error and takes the appropriate actions.

If you do not configure the retry-count statement, then by default, the sanity-poll statement rechecks the detected error 10 times before reporting an error condition.
If sanity polling detects an error condition, the on-error statement performs the appropriate actions to eliminate the error.

The following actions are common to all kinds of error conditions:
- To generate a chassis alarm, configure the raise-alarm statement. The chassis alarm is displayed in the front panel of the chassis.
- To reboot the FPC or FEB or CFEB after generating a core file, configure the power cycle statement. This statement is useful for temporary software errors that are eliminated after reboot.
- To halt the FPC or FEB or CFEB, configure the power off statement. This statement is useful in case of permanent hardware failure.
  
  CAUTION:
  The power off statement halts the FPC. Ensure that you have backup paths through a different FPC or FEB or CFEB to avoid service outage.
  
  Note:
  The power cycle and power off statements are mutually exclusive: You can configure either the power cycle or the power off action for an error.
- To trigger the core file, configure the write-coredump statement.

You can configure multiple actions for a given FPC or FEB or CFEB. If you do not configure any actions, the sanity-poll statement generates only FPC or FEB or CFEB system log messages.

Configuring the Junos OS to Make a Flexible PIC Concentrator Stay Offline

By default, a Flexible PIC Concentrator (FPC) is configured to restart after a system reboot. You can use the request chassis fpc operational mode command to take an FPC offline, but on Junos OS the FPC attempts to restart when you enter a commit CLI command. To configure an FPC to stay offline and prevent it from restarting, include the power off statement at the [edit chassis fpc slot-number] hierarchy level:

To bring an FPC online that is configured to stay offline and configure it to stay online, include the power on statement at the [edit chassis fpc slot-number] hierarchy level:

Configuring an SFM to Stay Offline

By default, if you use the request chassis sfm CLI command to take a Switching and Forwarding Module (SFM) offline, the SFM attempts to restart when you enter a commit CLI command. To prevent a restart, you can configure an SFM to stay offline. This feature is useful for repair situations.

To configure an SFM to stay offline, include the sfm statement at the [edit chassis] hierarchy level:

slot number—Slot number in which the SFM is installed.
power off—Take the SFM offline and configure it to remain offline.

For example, the following statement takes an SFM in slot 3 offline:

Use the show chassis sfm CLI command to confirm the offline status:

To bring the SFM back online, delete the edit chassis sfm statement and then commit the configuration.

Resynchronizing FPC Sequence Numbers with Active FPCs when an FPC Comes Online

On M320, T320, T640, T1600, T4000, TX Matrix, and TX Matrix Plus routers, when you bring a Flexible PIC Concentrator (FPC) online, the sequence number on the FPC may not be synchronized with the other active FPCs in the router, which may result in the loss of a small amount of initial traffic.

To avoid any traffic loss, include the fpc-resync statement at the [edit chassis] hierarchy level. This ensures that the sequence numbers of the FPC that is brought online is resynchronized with the other active FPCs in the router.

Note:

In order to prevent null-route filtering, the fpc-resync command will have no effect if a single LMNR based FPC and one or more I-chip FPCs exist in the same chassis.

Enabling a Routing Engine to Reboot on Hard Disk Errors

When a hard disk error occurs, a Routing Engine might enter a state in which it responds to local pings and interfaces remain up, but no other processes are responding.

To recover from this situation, you can configure a single Routing Engine to reboot automatically when a hard disk error occurs. To enable this feature, include the on-disk-failure reboot statement at the [edit chassis routing-engine] hierarchy level.

For dual Routing Engine environments, you can configure a backup Routing Engine to assume primary role automatically, if it detects a hard disk error on the primary Routing Engine. To enable this feature, include the on-disk-failure statement at the [edit chassis redundancy failover] hierarchy level. For information about this statement, see the Junos OS High Availability User Guide.

You can configure the Routing Engine to halt (instead of rebooting) when the hard disk fails on the Routing Engine. To configure this feature, include the disk-failure-action (halt | reboot) statement at the [edit chassis routing-engine on-disk-failure] hierarchy level:

Use the halt option to configure the Routing Engine to halt when the hard disk fails. Use the reboot option to configure the Routing Engine to reboot when the hard disk fails.

Handling Thermal Health Events Using Thermal Health Check and PSM Watchdog

You can use the thermal health check feature to configure an action to be taken on detection of a thermal health event such as power leakage. The thermal check feature monitors the power supply module (PSM) power output and FRU power consumption and if it detects that the PSM power output exceeds the FRU power consumption by a user-defined threshold, it assumes that there is a thermal health event, and takes an action based on user configuration. You can configure actions such as auto shutdown or alarms to be initiated on detection of a thermal health event. An example of the configuration is as follows: set chassis thermal-health-check action-onfail auto-shutdown shutdown-timer 10 power-threshold 700. This example configuration enables the software to detect a thermal health event if the power leak exceeds 700W, and shuts down the system 10 seconds after the thermal health failure is detected.

The thermal health check feature works only if:

The router has the high capacity AC or DC power distribution units (PDU) installed in both the slots, and each PDU has equal number of PSMs. Both AC PSM and DC PSM are supported.

The supported PSMs and PDUs are listed below:
- High Capacity AC PSM (model: PSM2-PTX-AC; firmware: 0210 or later; hardware revision: 06 or later)
- High Capacity 60A DC PSM (model: PSM2-PTX-DC; firmware: 0315 or later; hardware revision: 09 or later)
- High Capacity 60A DC PDU (model: PDU2-PTX-DC; use the firmware version 0404 or later with hardware revision 07; use the firmware version 0503 or later with hardware revision 08)
- High Capacity AC Delta PDU (model: PDU2-PTX-AC-D; firmware: 0305 or later; hardware revision: 04 or later)
- High Capacity AC Wye PDU (model: PDU2-PTX-AC-W; firmware: 0305 or later; hardware revision: 03 or later)
- High Capacity Single Phase AC PDU (model: PDU2-PTX-AC-SP; firmware: 0102 or later; hardware revision: 03 or later)
Each PDU has at least three PSMs that are online, and each online PSM is consuming above 60A current (in case of an AC PSM) or above 100A current (in case of a DC PSM).
None of the FRUs (RE, SIB, and FPC) is in the ’Present’ state.

On the router, you can also configure the PSM watchdog feature at the [edit chassis] hierarchy. If a thermal health event causes Junos to go down, the PSM watchdog feature detects it and shuts down the router. In the watchdog configuration, you can specify the watchdog timer in seconds. After the specified duration, the watchdog expires. You can also specify the frequency (in minutes) at which Junos resets the watchdog counter. If the watchdog counter doesn’t get reset because of reasons such as Routing Engine crash, the PSM turns off the output power on watchdog timer expiry and thereby shuts down the router.

Example configurations are as follows:

Use set chassis psm watchdog timeout 600 pat-frequency 2. This command enables PSM watchdog with the watchdog timer set to 600 seconds and the counter is set to be reset every 2 minutes.
Use set chassis thermal-health-check fet-failure-check action-onfail auto-shutdown shutdown-timer 10.. This command enables thermal health check, and shutdowns the system, 10 seconds after FET failure is detected.

Note:

The PSM watchdog feature works only if all the online PSMs in the router support this feature.

In short, if the Routing Engine software is running when a thermal event occurs, the thermal health check feature detects the thermal event and takes an action. However, if the Routing Engine software goes down in a thermal health event, it is the PSM watchdog timer that detects this issue and brings down the system.

Change History Table

Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.

Release

Description

13.3

ON THIS PAGE

Managing Errors

Configuring FPC Error Levels and Actions

Example: Configuring FPC Error Detection and Self-Healing on T Series Core Routers

Requirements

Overview

Configuration

CLI Quick Configuration

Configuring the Error Detection and Self-Healing

Step-by-Step Procedure

Results

Verification

Verifying the Configured Actions Related to Fatal Severity of FPC Error

Purpose

Action

Meaning

Managing FPC Errors

Modifying Severity of an Error

Disabling an Error

Powering Off Packet Forwarding Engines

Configuring Sanity Polling

Configuring the Junos OS to Make a Flexible PIC Concentrator Stay Offline

Configuring an SFM to Stay Offline

Resynchronizing FPC Sequence Numbers with Active FPCs when an FPC Comes Online

Enabling a Routing Engine to Reboot on Hard Disk Errors

Handling Thermal Health Events Using Thermal Health Check and PSM Watchdog

Related Documentation

Change History Table