Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Navigation

Fabric Fault Handling Overview

The T4000 router consists of a Switch Interface Board (SIB) with fabric bandwidth double the capacity of the T1600 router. The fabric fault management functionality is similar to that in T1600 routers. This topic describes the fabric fault handling functionality on T4000 routers.

The fabric fault management functionality involves monitoring all high-speed links connected to the fabric and the ones within the fabric core for link failures and link errors.

Action is taken based on the fault and its location. The actions include:

  • Reporting link errors in system log files and sending this information to the Routing Engine.
  • Reporting link failures at the Flexible Port Concentrator (FPC) or at the SIB and sending this information to the Routing Engine.
  • Marking a SIB in Check state.
  • Moving a SIB into Fault state.

The SIB in T4000 routers forms the core of the fabric with 4:1 redundancy—the redundant SIB becomes active when the active SIB becomes nonfunctional, is deactivated, or is removed. The following are the high-level indications of fabric faults that are monitored by Junos OS:

  • An SNMP trap is generated whenever a SIB is reported as Check or Fault.
  • show chassis alarms—Indicates that a SIB is in Check or Fault state.
  • show chassis sibs—Indicates that a SIB is in Check or Fault state or that a SIB is in Offline state when the SIB initializes (this occurs when the SIB does not power on fully).
  • show chassis fabric fpcs—Indicates whether any fabric links are in error on the FPCs’ side.
  • show chassis fabric sibs—Indicates whether any fabric links are in error on the SIBs’ side.
  • The /var/log/messages system log messages file at the Routing Engine has error messages with the prefix CHASSISD_FM_ERROR.
  • The SIBs display the FAIL LED.

The fabric planes in the chassis determine whether the chassis is a T640 router, a T1600 router, or a T4000 router. Power entry modules (PEMs), FPCs, or fan trays do not determine chassis personality. Alarms are raised if the old PEMs or fan trays are present in a T4000 chassis. You can identify a router based on its fabric planes:

  • If all planes present are F16-based SIBs, the chassis is a T640 chassis.
  • If all planes present are SF-based SIBs, the chassis is a T1600 chassis.
  • If all planes present are XF-based SIBs, the chassis is a T4000 chassis.

Note: Note that mixing of fabric planes is not a supported configuration except during upgrade. You can change the personality of a chassis without a reboot by changing all the fabric planes and by issuing the set chassis fabric upgrade-mode CLI command to check the personality. If you do not issue the set chassis fabric upgrade-mode CLI command, the personality does not change until the next boot.

In T4000 routers, you come across the following faults:

  • Board-level faults—These faults occur during initialization or during runtime. Power failure during board initialization, high-speed links transmit error, and polled I/O error during runtime are some examples of board-level faults.
  • Link-level faults—These faults occur during initialization or during runtime. Link training failure at initialization time (failure of the data plane links between an FPC and a SIB to be trained when the FPC or SIB is initialized), error detected on the channel between the SIB and a Packet Forwarding Engine, cyclic redundancy check (CRC) errors detected at runtime, and Packet Forwarding Engine destination errors are types of link-level faults.
  • Faults based on environmental conditions—These faults occur during runtime. Sudden removal of an FPC or a SIB might result in an operator error. When a SIB becomes too hot or when SIB voltages are beyond thresholds, the errors generated are classified into environmental errors.

You can implement one of the following options to handle the faults:

  • Log the error and raise an alarm.
  • Switch over to the spare plane, if available.
  • Continue with a reduced number of parts of a plane.
  • Continue with a reduced number of usable planes.
  • Use polling-based fault handling.
  • Monitor high-speed link errors and manually bring the link down to a suitable threshold.

The polled I/O errors and the link errors are monitored every 500 milliseconds, and the board exhaust temperature and board voltages are monitored every 10 seconds.

Published: 2013-03-07