ON THIS PAGE
Example: Configuring FPC Error Detection and Self-Healing on T Series Core Routers
Configuring the Junos OS to Make a Flexible PIC Concentrator Stay Offline
Resynchronizing FPC Sequence Numbers with Active FPCs when an FPC Comes Online
Handling Thermal Health Events Using Thermal Health Check and PSM Watchdog
Managing Errors
Configuring FPC Error Levels and Actions
Starting with Junos OS Release 13.3 or Release 14.2 for M320 routers, you can use MX
Series, PTX Series, and T Series routers to configure Packet Forwarding Engine
(PFE)-related error levels on FPCs and the actions to perform when a specified
threshold is reached. In Junos OS Release 13.2 and earlier, Packet Forwarding Engine
errors would disable the FPC. When you use the error
command,
Packet Forwarding Engine errors can be isolated, which reduces the need for a field
replacement. Using the error
command, you can classify errors
according to severity, set an automatic recovery action for each severity, and
configure the actions to perform when a specified threshold is reached. This command
is available at the [edit chassis fpc slot-number]
and
[edit chassis]
hierarchies.
To configure Packet Forwarding Engine error levels and actions for an FPC:
-
(Optional) Configure the fatal error level threshold and action. A fatal error is an error that results in blockage of considerable amount of traffic across modules.
[edit chassis fpc fpc-number error] user@host# set fatal action action user@host# set fatal threshold threshold-level
If the severity level of the error is fatal, the action is carried out when the total number of errors reaches the threshold value. After the threshold value is crossed, for every occurrence of the error, an action is carried out.
-
(Optional) Configure the major error level threshold and action. A major error is an error that results in continuing loss of packet traffic but does not affect other modules.
[edit chassis fpc fpc-number error] user@host# set major action action user@host# set major threshold threshold-level
If the severity level of the error is major, the action is carried out when the total number of errors reaches the threshold value. After the threshold value is crossed, for every occurrence of the error, an action is carried out.
-
(Optional) Configure the minor error level threshold and action. A minor error is an error that results in the loss of a single packet but is fully recoverable.
[edit chassis fpc fpc-number error] user@host# set minor action action user@host# set minor threshold threshold-level
If the severity level is minor, the action is carried out only once when the total number of errors reaches the threshold value
Starting with Junos OS Release 18.1R3, MX Series routers support configuration of
error thresholds and actions at the error scope and error category levels. Use the
command set chassis fpc fpc-slot error scope
error-scope category category (fatal |
major | minor) threshold error-threshold action (alarm |
disable-pfe | get-state | offline | log | reset | trap | online-pfe |
reset-pfe)
to configure a threshold and action for a particular error
scope and category at the FPC level. You can also configure these features at the
chassis level (at the [edit chassis]
hierarchy). However, threshold
and action configured at the [edit chassis fpc]
hierarchy overrides
the same configuration at the [edit chassis]
hierarchy.
You can use the command show chassis fpc errors
to view the error
information at the error scope and category level.
For Junos OS Evolved, you can use the following show
commands to
view the error information:
-
show system errors count
—Displays system-wide errors and its count. -
show system errors active
—Displays current active errors in the system. -
show system errors active fpc <slot number>
—Displays active errors for the specified FPC. -
show system errors fru detail
—Displays detailed FRU-specific error. -
show system errors fru detail fpc <slot number>
—Displays information about detected errors based on the FRU.
If you have configured the action log
against a particular error
threshold, the system logs the event when the error count breaches the set
threshold. The following sample syslog messages indicate an error threshold breach
and the resultant action being taken:
Sep 17 23:12:10 sw-s3-u8-03 fpc0 Error: /fpc/0/pfe/0/cm/0/PE_Chip/1/PECHIP_CMERROR_OQB_INT_REG_RD_ADDR_ERR (0x21078b), scope: pfe, category: functional, severity: minor, module: PE Chip, type: Description for PECHIP_CMERROR_OQB_INT_REG_RD_ADDR_ERR Sep 17 23:12:10 sw-s3-u8-03 fpc0 Performing action log for error /fpc/0/pfe/0/cm/0/PE_Chip/1/PECHIP_CMERROR_OQB_INT_REG_RD_ADDR_ERR (0x21078b) in module: PE Chip with scope: pfe category: functional level: minor
The offline, reset, disable-pfe, offline-pfe
and reset-pfe
actions are mutually exclusive with respect to configuration. The specified
PFE is disabled automatically, if offline-pfe
or reset-pfe
is configured.
disable-pfe
is available from Junos 17.4 and
later versions. The following table provides details about PFE error mapping actions and the system response:
Action | Response |
---|---|
disable-pfe |
Disables all PFE interfaces, alarms and logs. |
offline |
Takes the FPC offline, disables the alarms and logs. |
reset |
Takes the FPC offline and resets to online, enables the alarms and logs. |
reset-pfe |
Powers-off the PFE, disables the alarms and logs, then, powers-on the PFE, enables the alarms and logs. |
offline-pfe |
Powers-off the PFE, disables the alarms and logs, |
Example: Configuring FPC Error Detection and Self-Healing on T Series Core Routers
This example shows how to configure error detection and self-healing on a Juniper Networks T Series Core Router with Type 5 FPC.
Requirements
This example uses the following hardware and software components:
Juniper Networks T4000 Core Router with Type 5 FPCs.
Junos OS Release 13.3 or later.
Before you proceed, ensure that the required connections are complete and the interfaces are functional.
Overview
FPC error detection and self-healing involves configuring a set of actions to be performed on each FPC, when the number of errors for a particular severity increases beyond a user-configured threshold. The error severity is categorized into fatal, major, and minor. Recovery actions include raising an alarm, generating log entries, getting the current state of the FPC, restarting the FPC, taking the FPC offline, and resetting the FPC. For a particular FPC and error severity, you can configure the error threshold to any value within the allowed limits and map the threshold to an action. In this example, you will set these errors on FPC 0 in Juniper Networks T4000 Core Router.
Configuration
To configure the error detection and self-healing, you need to set the error severity, threshold values corresponding to each error severity, and actions to be performed when the threshold value is crossed.
CLI Quick Configuration
To quickly configure this example, copy the following commands, paste them into a text file, remove any line breaks, change any details necessary to match your network configuration, and then copy and paste the commands into the CLI at the [edit interfaces] hierarchy level.
set chassis fpc 0 fatal threshold 1 action resetset chassis fpc 0 major threshold 1 action alarmset chassis fpc 0 minor threshold 10 action log
Configuring the Error Detection and Self-Healing
Step-by-Step Procedure
The following example requires you to navigate various levels in the configuration hierarchy. For information about navigating the CLI, see Using the Using the CLI Editor in Configuration Mode and the CLI User Guide.
Configure the threshold value and associated action for fatal errors.
Set the error severity to fatal.
[edit interfaces]
user@host# set chassis fpc 0 error fatal
Set the threshold value for fatal errors.
[edit interfaces]
user@host# set chassis fpc 0 error fatal threshold 1
Set the associated action for fatal errors.
[edit interfaces]
user@host# set chassis fpc 0 error fatal threshold 1 action reset
Configure the threshold value and associated action for major errors.
Set the error severity to major.
[edit interfaces]
user@host# set chassis fpc 0 error major
Set the threshold value for major errors.
[edit interfaces]
user@host# set chassis fpc 0 error major threshold 1
Set the associated action for major errors.
[edit interfaces]
user@host# set chassis fpc 0 error major threshold 1 action alarm
Configure the threshold value and associated action for minor errors.
Set the error severity to minor.
[edit interfaces]
[edit interfaces]
user@host# set chassis fpc 0 error minor
Set the threshold value for minor errors.
[edit interfaces]
user@host# set chassis fpc 0 error minor threshold 10
Set the associated action for minor errors.
[edit interfaces]
user@host# set chassis fpc 0 error minor threshold 10 action log
Results
The following is the result of the configuration for the fatal severity level.
user@host# set chassis fpc 0 error ? Possible completions: + apply-groups Groups from which to inherit configuration data + apply-groups-except Don't inherit configuration data from these groups > fatal FPC Fatal errors (default threshold = 1) > major FPC Major Level errors (default threshold = 1) > minor FPC Minor Level errors (default threshold = 10)user@host# set chassis fpc 0 error fatal action ? Possible completions: alarm Raise FPC alarm get-state Retreive FPC state for debugging log Log occurence to system log file offline Offline FPC offline-pic Offline PICs associated with PFE on FPC reset Reset FPCuser@host# set chassis fpc 0 error fatal action resetuser@host# set chassis fpc 0 error fatal threshold ? Possible completions: <threshold> Error count at which to take the action (0..4294967295)user@host# set chassis fpc 0 error fatal threshold 1
If you are done configuring the devices, enter commit
from configuration mode.
Verification
To verify that the configuration is successful
and the router in configured with the correct action, use the show chassis fpc errors
command.
Verifying the Configured Actions Related to Fatal Severity of FPC Error
Purpose
Make sure that the threshold value and the associated action are set for fatal errors.
Action
user@host> show chassis fpc errors FPC Level Occurred Cleared Threshold Action-Taken Action 0 Fatal 0 0 1 RESET Pfe-State: pfe-0 -ENABLED | pfe-1 -ENABLED | pfe-2 -ENABLED | pfe-3 -ENABLED | pfe-4 -ENABLED | pfe-5 -ENABLED | pfe-6 -ENABLED | pfe-7 -ENABLED |
Meaning
The sample output shows Fatal
error at FPC 0
with 0
error Occurred
(no
previous occurrences), 0
error Cleared
(no previous occurrences) with Threshold
value set to 1
and Action-Taken
set to RESET
.
Managing FPC Errors
On the PTX series routers, you can disable an FPC error or modify the severity of the error at the error-id level. See FPC self-healing for details on PTX platforms that support this feature.
The error-id, which uniquely identifies an FPC error, is represented in the uniform resource identifier (URI) format and is composed of a module identifier and an error identifier. If an error occurs, you can find the error-id in the system log messages.
Modifying Severity of an Error
Though you cannot configure a new error severity, you can modify the existing severity of an error. For example, if you do not want to treat a particular error (identified by an error-id) as fatal anymore, you can modify its severity to major or minor as required.
You cannot modify the error severity at a group (for example, category) level.
To modify the severity of an error, use the following command:
user@host# set chassis fpc fpc-slot error error-id severity new-severity
See the following example:
user@host# set chassis fpc 3 error "/cpu/0/memory/0/ECC_CORRECTED_ERROR" severity minor
In the above example, you modified the severity of the error
ID “/cpu/0/memory/0/memory-uncorrected-error”
in FPC 3 to minor
.
Disabling an Error
To configure the system to stop reporting an error, identify the error-id and disable it. You can find the error-id in the system log messages. To disable an error, use the following command:
user@host# set chassis fpc fpc-slot error error-id state disable
See the following example:
user@host# set chassis fpc 3 error "/cpu/0/memory/0/ECC_CORRECTED_ERROR" state disable
In the above example, you disabled the error “/cpu/0/memory/0/memory-uncorrected-error”
in FPC 3.
Powering Off Packet Forwarding Engines
You can power on or power off the Packet Forwarding Engines in a running system, or keep a Packet Forwarding Engine powered off when the FPC comes online. The following are a couple of scenarios in which this feature is used.
When the Packet Forwarding Engine ASIC is malfunctioning.
To conserve power in case the deployment does not require the full capacity of the system.
To power off a Packet Forwarding Engine, use the following steps:
user@host# set chassis fpc slot-number pfe pfe-id power off
user@host# commit
To power on a packet forwarding engine, use the following steps:
user@host# set chassis fpc slot-number pfe pfe-id power on
user@host# commit
You need to apply this configuration to both the Packet Forwarding Engines in an ASIC to be able to commit the configuration.
On MX series routers with MPC10E-15C-MRATE, you can power
off or power on only the Packet Forwarding Engine 2. The Packet Forwarding
Engines 0 and 1 do not support this command. On the MPC10E-15C-MRATE,
operating the Packet Forwarding Engine 2 requires the Packet Forwarding
Engines 0 and 1 to be functional. You can use the command show
chassis fpc fpc-lot detail
to view the
Packet Forwarding Engine power ON/OFF status and bandwidth for the
individual Packet Forwarding Engines in the MPC10E-15C-MRATE.
You can use the show chassis fpc fpc-slot detail
command to view the Packet Forwarding Engine power
on/off configuration status. See an example below:
user@router> show chassis fpc 0 detail Slot 0 information: State Online Temperature 41 degrees C / 105 degrees F (PFE_24-HBM) Temperature 44 degrees C / 111 degrees F (PFE_25-HBM) Temperature 43 degrees C / 109 degrees F (PFE_26-HBM) Temperature 41 degrees C / 105 degrees F (PFE_27-HBM) Temperature 40 degrees C / 104 degrees F (PFE_28-HBM) Temperature 40 degrees C / 104 degrees F (PFE_29-HBM) Temperature 38 degrees C / 100 degrees F (PFE_30-HBM) Temperature 39 degrees C / 102 degrees F (PFE_31-HBM) Start time 2020-10-28 00:46:17 PDT Uptime 1 day, 1 hour, 34 minutes, 48 seconds Max power consumption 825 Watts PFE Information: PFE Power ON/OFF Bandwidth SLC 0 On 500 1 On 500 2 On 500 3 On 500 4 On 500 5 On 500 6 On 500 7 On 500
Configuring Sanity Polling
You can configure the sanity-poll
statement for
a particular FPC or FEB or CFEB to start a periodic sanity check for
that FPC or FEB or CFEB. The periodic sanity check includes checking
for error conditions such as “register sanity issues,”
“high temperature,” “hardware failure,” and
so on. If you do not configure the sanity-poll
statement,
then sanity polling is disabled.
Currently, periodic sanity check is performed only on the routing chip register.
Sanity polling periodically checks for an error condition in an FPC or FEB or CFEB and performs the appropriate actions in case of an error.
To configure sanity polling for an FPC on T Series routers and M320 routers, include the
sanity-poll
statement and its substatements at the[edit chassis fpc slot-number]
hierarchy level:[edit chassis] fpc slot-number { sanity-poll { retry-count number; on-error { raise-alarm; power (cycle | off); write-coredump; } } }
To configure sanity polling for a FEB on the M120 router, include the
sanity-poll
statement and its substatements at the[edit chassis feb slot-number]
hierarchy level:[edit chassis] feb slot-number { sanity-poll { retry-count number; on-error { raise-alarm; power (cycle | off); write-coredump; } } }
To configure sanity polling for a CFEB on M7i and M10 routers, include the
sanity-poll
statement and its substatements at the[edit chassis cfeb slot-number]
hierarchy level:[edit chassis] cfeb slot-number { sanity-poll { retry-count number; on-error { raise-alarm; power (cycle | off); write-coredump; } } }
On a TX Matrix or TX Matrix Plus router, you can configure
the sanity-poll
statement at the [edit chassis lcc number fpc number]
hierarchy
level.
The sanity-poll
statement comprises the following
substatements:
The
retry-count
statement specifies the number of rechecks to be performed after the occurrence of a particular error condition. If an error exists in all the periodic checks, then sanity polling reports an error and proceeds to perform the appropriate actions (described as options of theon-error
statement).For example, if the periodic sanity check detects an error in the FPC or FEB or CFEB and if you configure the
retry count number
to 15, sanity polling does not report the error immediately. Sanity polling checks 15 times for the same error condition. If an error persists in all 15 rechecks, then it reports an error and takes the appropriate actions.If you do not configure the
retry-count
statement, then by default, thesanity-poll
statement rechecks the detected error 10 times before reporting an error condition.If sanity polling detects an error condition, the
on-error
statement performs the appropriate actions to eliminate the error.The following actions are common to all kinds of error conditions:
To generate a chassis alarm, configure the
raise-alarm
statement. The chassis alarm is displayed in the front panel of the chassis.To reboot the FPC or FEB or CFEB after generating a core file, configure the
power cycle
statement. This statement is useful for temporary software errors that are eliminated after reboot.To halt the FPC or FEB or CFEB, configure the
power off
statement. This statement is useful in case of permanent hardware failure.CAUTION:The
power off
statement halts the FPC. Ensure that you have backup paths through a different FPC or FEB or CFEB to avoid service outage.Note:The
power cycle
andpower off
statements are mutually exclusive: You can configure either thepower cycle
or thepower off
action for an error.To trigger the core file, configure the
write-coredump
statement.
You can configure multiple actions for a given FPC or FEB or
CFEB. If you do not configure any actions, the sanity-poll
statement generates only FPC or FEB or CFEB system log messages.
Configuring the Junos OS to Make a Flexible PIC Concentrator Stay Offline
By default, a Flexible PIC Concentrator (FPC) is configured to restart after a system reboot. You
can use the request chassis fpc
operational mode command to take an
FPC offline, but on Junos OS the FPC attempts to restart when you enter a
commit
CLI command. To configure an FPC to stay offline and
prevent it from restarting, include the power off
statement at the
[edit chassis fpc slot-number]
hierarchy
level:
[edit chassis fpc slot-number] power off;
To bring an FPC online that is configured to stay
offline and configure it to stay online, include the power on
statement at the [edit chassis fpc slot-number]
hierarchy level:
[edit chassis fpc slot-number] power on;
Configuring an SFM to Stay Offline
By default, if you use the request chassis
sfm
CLI command to take a Switching and Forwarding Module (SFM)
offline, the SFM attempts to restart when you enter a commit
CLI command. To prevent a restart, you can configure an SFM to stay
offline. This feature is useful for repair situations.
To configure an SFM to stay offline, include the sfm
statement at the [edit chassis]
hierarchy level:
[edit chassis] sfm slot-number { power off; }
slot number
—Slot number in which the SFM is installed.power off
—Take the SFM offline and configure it to remain offline.
For example, the following statement takes an SFM in slot 3 offline:
[edit chassis] sfm 3 power off;
Use the show chassis sfm
CLI command
to confirm the offline status:
user@host# show chassis sfm Temp CPU Utilization (%) Memory Utilization (%) Slot State (C) Total Interrupt DRAM (MB) Heap Buffer 0 Online 34 2 0 64 16 47 1 Online 38 2 0 64 16 47 2 Online 42 2 0 64 16 47 3 Offline --- Configured power off ---
To bring the SFM back online, delete the edit
chassis sfm
statement and then commit the configuration.
Resynchronizing FPC Sequence Numbers with Active FPCs when an FPC Comes Online
On M320, T320, T640, T1600, T4000, TX Matrix, and TX Matrix Plus routers, when you bring a Flexible PIC Concentrator (FPC) online, the sequence number on the FPC may not be synchronized with the other active FPCs in the router, which may result in the loss of a small amount of initial traffic.
To avoid any traffic loss, include the fpc-resync
statement at the [edit chassis]
hierarchy level. This
ensures that the sequence numbers of the FPC that is brought online
is resynchronized with the other active FPCs in the router.
[edit chassis] fpc-resync;
In order to prevent null-route filtering, the fpc-resync
command will have no effect if a single LMNR based FPC and one or
more I-chip FPCs exist in the same chassis.
Enabling a Routing Engine to Reboot on Hard Disk Errors
When a hard disk error occurs, a Routing Engine might enter a state in which it responds to local pings and interfaces remain up, but no other processes are responding.
To recover from this situation, you can configure
a single Routing Engine to reboot automatically when a hard disk error
occurs. To enable this feature, include the on-disk-failure reboot
statement at the [edit chassis routing-engine]
hierarchy
level.
[edit chassis routing-engine] on-disk-failure { disk-failure-action (halt | reboot); }
For dual Routing Engine environments, you can configure
a backup Routing Engine to assume primary role automatically, if it
detects a hard disk error on the primary Routing Engine. To enable
this feature, include the on-disk-failure
statement at
the [edit chassis redundancy failover]
hierarchy level.
For information about this statement, see the Junos OS High Availability User Guide.
You can configure the Routing Engine to halt (instead of rebooting)
when the hard disk fails on the Routing Engine. To configure this
feature, include the disk-failure-action (halt | reboot)
statement at the [edit chassis routing-engine on-disk-failure]
hierarchy level:
[edit chassis routing-engine] on-disk-failure { disk-failure-action (halt | reboot); }
Use the halt option to configure the Routing Engine to halt when the hard disk fails. Use the reboot option to configure the Routing Engine to reboot when the hard disk fails.
Handling Thermal Health Events Using Thermal Health Check and PSM Watchdog
You can use the thermal health check feature to configure an
action to be taken on detection of a thermal health event such as
power leakage. The thermal check feature monitors the power supply
module (PSM) power output and FRU power consumption and if it detects
that the PSM power output exceeds the FRU power consumption by a user-defined
threshold, it assumes that there is a thermal health event, and takes
an action based on user configuration. You can configure actions such
as auto shutdown or alarms to be initiated on detection of a thermal
health event. An example of the configuration is as follows: set chassis thermal-health-check action-onfail auto-shutdown shutdown-timer
10 power-threshold 700
. This example configuration enables the
software to detect a thermal health event if the power leak exceeds
700W, and shuts down the system 10 seconds after the thermal health
failure is detected.
The thermal health check feature works only if:
The router has the high capacity AC or DC power distribution units (PDU) installed in both the slots, and each PDU has equal number of PSMs. Both AC PSM and DC PSM are supported.
The supported PSMs and PDUs are listed below:
High Capacity AC PSM (model: PSM2-PTX-AC; firmware: 0210 or later; hardware revision: 06 or later)
High Capacity 60A DC PSM (model: PSM2-PTX-DC; firmware: 0315 or later; hardware revision: 09 or later)
High Capacity 60A DC PDU (model: PDU2-PTX-DC; use the firmware version 0404 or later with hardware revision 07; use the firmware version 0503 or later with hardware revision 08)
High Capacity AC Delta PDU (model: PDU2-PTX-AC-D; firmware: 0305 or later; hardware revision: 04 or later)
High Capacity AC Wye PDU (model: PDU2-PTX-AC-W; firmware: 0305 or later; hardware revision: 03 or later)
High Capacity Single Phase AC PDU (model: PDU2-PTX-AC-SP; firmware: 0102 or later; hardware revision: 03 or later)
Each PDU has at least three PSMs that are online, and each online PSM is consuming above 60A current (in case of an AC PSM) or above 100A current (in case of a DC PSM).
None of the FRUs (RE, SIB, and FPC) is in the ’Present’ state.
On the router, you can also configure the PSM watchdog feature at the [edit chassis] hierarchy. If a thermal health event causes Junos to go down, the PSM watchdog feature detects it and shuts down the router. In the watchdog configuration, you can specify the watchdog timer in seconds. After the specified duration, the watchdog expires. You can also specify the frequency (in minutes) at which Junos resets the watchdog counter. If the watchdog counter doesn’t get reset because of reasons such as Routing Engine crash, the PSM turns off the output power on watchdog timer expiry and thereby shuts down the router.
Example configurations are as follows:
- Use
set chassis psm watchdog timeout 600 pat-frequency 2
. This command enables PSM watchdog with the watchdog timer set to 600 seconds and the counter is set to be reset every 2 minutes. - Use
set chassis thermal-health-check fet-failure-check action-onfail auto-shutdown shutdown-timer 10.
. This command enables thermal health check, and shutdowns the system, 10 seconds after FET failure is detected.
The PSM watchdog feature works only if all the online PSMs in the router support this feature.
In short, if the Routing Engine software is running when a thermal event occurs, the thermal health check feature detects the thermal event and takes an action. However, if the Routing Engine software goes down in a thermal health event, it is the PSM watchdog timer that detects this issue and brings down the system.
Change History Table
Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.