Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Navigation
Guide That Contains This Content
[+] Expand All
[-] Collapse All

Detection and Recovery of Fabric-Related Failures Caused by Traffic Black Holes on MX Series Routers

A traffic black hole occurs when a router is unable to transmit data packets to other neighboring routers, although the interfaces on that router continue to be in the active state. As a result, the other neighboring routers continue to forward traffic to the impacted router, which drops the arriving packets without sending a notification to the other routers.

When a Packet Forwarding Engine in a router is unable to send traffic to other Packet Forwarding Engines over the data plane within the same router, the router is unable to transmit any packets to a neighboring router, although the interfaces are advertised as active on the control plane. Fabric failure can be one of the reasons for traffic black holes.

The following fabric failure scenarios can occur:

  • Removal of the control board
  • High-speed link 2 (HSL2) training failures
  • Single link failure on an Flexible PIC Concentrator (FPC)
  • Multiple link failures on the same FPC or the same fabric plane
  • Multiple link failures randomly on an FPC or a fabric plane
  • Intermittent cyclic redundancy check (CRC) errors
  • A total traffic black hole for only one destination and not to other destinations

When an FPC does not forward traffic due to a certain reason to other FPCs within the device, the control protocol on the Routing Engine is unable to detect this condition. The traffic transmission is not diverted to the functional, active FPCs and, instead, the packets are continued to be sent to the affected FPC and are dropped at that point. The following might be the causes for an FPC being unable to forward traffic:

  • All the planes in the system are in theOffline or Fault state.
  • All the Packet Forwarding Engines on the DPC might have disabled the fabric streams due to destination errors.

If all the Switch Control Boards (SCBs) lose connectivity to the DPCs, then all the interfaces are brought down. If a Packet Forwarding Engine of a DPC loses complete connectivity to or from the fabric, then that DPC is brought down.

System hardware failures can be of the following types:

  • A single occurrence or a rare failure for a brief period (such as environmental spikes). This failure is effectively healed without manual intervention by restarting the fabric plane and restarting the FPCs and the fabric plane, if necessary.
  • Repeated failures that occur frequently.
  • A permanent failure.

A recovery from any case of reduced throughput, such as multiple Packet Forwarding Engine destination timeouts on multiple planes is not attempted. Recovery from traffic black hole is attempted only when all the planes are in the Offline or Fault state or when the destinations are unreachable on all active planes.

If a black hole occurs because of a single bad FPC, which is either a common source or common destination of the destination timeout, if you the configured the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level, no recovery action is taken. The show chassis fabric reachability command output can be used to verify the status of the fabric and the FPC. An alarm is triggered to indicate that the particular FPC is causing a traffic black hole.

Fabric-Failure Detection Methods on MX Series Routers

The chassis daemon (chassisd) process detects the removal of a control board. The removal of the control board causes all the active planes that reside on that board to be disabled and a switchover is performed. If the active Routing Engine is also unplugged along with the control board, the detection of the control board removal is delayed until the switchover of the Routing Engine occurs and the reconnection in the primary, backup Routing Engine pair occurs. If the control board is turned offline by specifying the request chassis cb slot slot-number offline or a pressed physical button to cause a graceful shutdown, a fabric failure does not occur, even if the control board is moved to the offline state.

If active fabric planes are removed because of removal of the control board on the master RE, the DPC takes the local action of disabling removed planes. If spare planes are available, DPC initiates switchover to spare planes. If an active control board on a backup RE is removed, the master RE performs the switchover. The software attempts to optimize the duration of traffic black hole by disabling all removed planes. The spare planes are transitioned to the online state one by one.

Fabric self-ping is a mechanism to detect any issues in the fabric data path. Each Packet Forwarding Engine forwards fabric data cells that are destined to itself over all active fabric planes. To transmit the data cell, the Packet Forwarding Engine fabric sends the request cells over an active plane and waits for a grant packet. The destination Packet Forwarding Engine sends a grant packet over the same plane on which the request cell is received. When the grant cell is received, the source Packet Forwarding Engine sends the data cell.

The Packet Forwarding Engine fabric contains the capability to detect grant delays. If grants are not received within a certain period of time, a destination timeout is declared. Destination timeout on a certain plane by a Packet Forwarding Engine on two or more FPCs is considered as an indication for plane failures. Even if one Packet Forwarding Engine on an FPC flashes an error, the FPC is considered to be in error. Destination timeouts are noticed when the Packet Forwarding Engine sends traffic actively because requests are sent only for valid data cells. The software takes an appropriate action based on the destination timeout. For self-ping, a data cell is destined to the source Packet Forwarding Engine only.

Fabric ping failure messages are sent to the fabric manager on the Routing Engine, which collates all of the errors reported by all the Dense Port Concentrators (DPCs) and takes a corrective action. For example, a ping failure for all links of the same DPC might indicate a problem on the DPC. Ping failure for multiple DPCs for the same fabric plane might denote a problem with the fabric.

If the Routing Engine determines that a fabric plane is down, based on the information on errors it receives from the DPCs or the Packet Forwarding Engines, over a period of 5 seconds, it indicates a fabric failure. The duration of 5 seconds is the period for which the Routing Engine collates the errors from all of the DPCs.

Fabric self-ping packets are periodically sent to check the sanity of the fabric links. Self pings are sent at interval of 500 ms. The destination timeout is also checked in intervals of 500 ms. If two timeouts ocur successively, self ping failure is detected. When a destination timeout is received, the Packet Forwarding Engine fabric stops the sending of packets to the fabric. To examine the link condition again, the software resets the credits to ensure that new requests are sent again. When a self-ping failure occurs, the DPC removes the affected plane from sending data to all destinations. This method ensures that self-ping is not attempted to be sent again on the defective plane.

The following guidelines apply to the self-ping capability:

  • By default, self pings are not sent on spare fabric planes because spare planes do not carry traffic.
  • The size of self-ping packets is large enough to enable the cells to be loaded over all the active fabric planes (maximum of 8 for MX Series routers).
  • A detection of received self-ping packets is not performed.
  • High priority queue is used to enable self-ping to be sent for oversubscription cases.

Published: 2013-03-07