Related Documentation
- Detection and Corrective Actions of FPCs with Degraded Fabric on MX Series Routers
- Detection and Recovery of Fabric-Related Failures Caused by Traffic Black Holes on MX Series Routers
- redundancy-mode
- show chassis fabric redundancy-mode
- Configuring Redundancy Fabric Mode for Active Control Boards on MX Series Routers
Corrective Actions for Fabric Failures on MX Series Routers
This topic contains the following sections that describe different fabric failure scenarios, the detection methods used, and the corrective actions for the faults:
- Traffic Black Hole Healing
- FPCs with Degraded Fabric
- Complete Black Hole Towards a Single Destination Only
- Redundancy Fabric Mode on Active Control Boards
Traffic Black Hole Healing
Packet Forwarding Engine destinations can become unreachable for the following reasons:
- The control boards go offline as a result of a CLI command or a pressed physical button.
- The fabric control boards are turned offline because of high temperature.
- Voltage or polled I/O errors in the SIBs detected by the SPMB.
- All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the SIBs are online.
- Complete fabric loss caused by destination timeouts, even when the SIBs are online.
When the system detects any unreachable Packet Forwarding Engine destinations, healing from a traffic black hole is attempted. If the healing fails, the system turns off the interfaces, thereby stopping the traffic black hole.
The recovery process consists of the following phases:
- Fabric plane restart phase: Healing is attempted by restarting the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and a single Flexible PIC Concentrator (FPC) is bad. An error message is generated to specify that a black hole is the reason for the fabric plane being turned offline. This phase is performed for fabric plane errors only.
- Fabric plane and FPC restart phase: The system waits for
the first phase to be completed before examining the system state
again. If the black hole condition still persists after the first
phase is performed or if the problem occurs again within a duration
of 10 minutes, healing is attempted by restarting both the fabric
planes and the FPCs. If you the configured the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy
level to disable restart of the FPCs when a recovery is attempted,
an alarm is triggered to indicate that a traffic black hole has occurred.
In this second phase, three steps are taken:
- All the FPCs that have destination errors on a PFE are turned offline
- The fabric planes are turned offline and brought back online, one by one, starting with the spare plane.
- The FPCs that were turned offline are brought back online.
- FPC offline phase: The system waits for the second phase to be completed before examining the system state again. Traffic black hole is limited by turning the FPCs offline and by turning off interfaces because previous attempts at recovery have failed. If the problem is not resolved by restarting the FPCs or if the problem recurs within 10 minutes after restarting the FPCs, this phase is performed.
The three phases are controlled by timers. During these phases, if an event (such as offlining/onlining FPCs or fabric planes) times out, then the phase skips that event and proceeds to the next event. The timer control has a timeout value of 10 minutes. If the first fabric error occurs in a system with two or more FPCs, the fabric planes are restarted. If another fabric error occurs within the next 10 minutes, the fabric planes and FPCs are restarted. However, if the second fabric error occurs outside of the timeout period of 10 minutes, then the first phase is performed, which is the restart of only the fabric planes.
In cases where all the destination timeouts are traced to a bad FPC, for example, one source FPC or one destination FPC, only that FPC is turned offline and online. The fabric planes are not turned offline and online. If another fabric fault occurs within the period of 10 minutes, the FPC is turned offline.
By default, the system limits black-hole time by detecting severely degraded fabric. No user interaction is necessary.
FPCs with Degraded Fabric
You can configure an FPC with degraded fabric to be moved to the offline state. On an MX960, MX480, or MX240 router, you can configure link errors or bad fabric planes. This configuration is particularly useful in partial black hole scenarios where bringing the FPC offline results in faster re-routing. To configure this option on an FPC, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level. For more information, see Detection and Corrective Actions of FPCs with Degraded Fabric on MX Series Routers.
Complete Black Hole Towards a Single Destination Only
In certain deployments, an FPC indicates a complete black hole towards a single destination only, but it functions properly for other destinations. Such cases are identified and the affected FPC is recovered. Consider a sample scenario in which the active planes are 0,1,2,3 and the spare planes are 4,5,6,7 in the connection between FPC 0 and FPC1. If FPC 0 has single link failures for planes 0 and 1 and if FPC 1 has single link failures for planes 2 and 3, a complete black hole occurs between the two FPCs. Both FPC 0 and FPC 1 undergo a phased mode of recovery and fabric healing takes place.
Redundancy Fabric Mode on Active Control Boards
You can configure the active control board to be in redundancy mode or in increased fabric bandwidth mode. To configure redundancy mode for the active control board, use the redundancy-mode redundant statement at the [edit chassis fabric] hierarchy level. In redundancy mode, all the FPCs use 4 fabric planes as active planes, regardless of the type of the FPC. You can enable increased fabric bandwidth of active control boards for optimal and efficient performance and traffic handling. On an MX960, MX480, or MX240 router, you can use the redundancy-mode increased-bandwidth statement at the [edit chassis fabric] hierarchy level to enable increased fabric bandwidth mode for the active control board to cause all the available fabric planes to be used. In this mode, the maximum number of available fabric planes are used for MX routers with Trio chips and the MPC3E. On MX960 routers with active control boards, 6 active planes are used, and on MX240 and MX480 routers with active control boards, 8 active planes are used.
Increased fabric bandwidth mode is enabled by default on MX routers with Switch Control Board (SCB). On MX routers that contain the enhanced SCB with Trio chips and the MPC3E, redundancy mode is enabled by default. For more information, see Configuring Redundancy Fabric Mode for Active Control Boards on MX Series Routers.
Related Documentation
- Detection and Corrective Actions of FPCs with Degraded Fabric on MX Series Routers
- Detection and Recovery of Fabric-Related Failures Caused by Traffic Black Holes on MX Series Routers
- redundancy-mode
- show chassis fabric redundancy-mode
- Configuring Redundancy Fabric Mode for Active Control Boards on MX Series Routers
Published: 2013-01-24
Related Documentation
- Detection and Corrective Actions of FPCs with Degraded Fabric on MX Series Routers
- Detection and Recovery of Fabric-Related Failures Caused by Traffic Black Holes on MX Series Routers
- redundancy-mode
- show chassis fabric redundancy-mode
- Configuring Redundancy Fabric Mode for Active Control Boards on MX Series Routers