Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Data Center Quantized Congestion Notification (DCQCN)

Remote Direct Memory Access (RDMA) provides high throughput and ultra-low latency, with low CPU overhead, necessary for modern data center applications. RDMA is deployed using the RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. Data Center Quantized Congestion Notification (DCQCN) is an end-to-end congestion control scheme for RoCEv2. Junos supports DCQCN by combining ECN and PFC to overcome the limitations of PFC to support end-to-end lossless Ethernet.

Understanding DCQCN

Priority-based Flow Control (PFC) is a lossless transport and congestion relief feature that works by providing granular link-level flow control for each IEEE 802.1p code point (priority) on a full-duplex Ethernet link. When the receive buffer on a switch interface fills to a threshold, the switch transmits a pause frame to the sender (the connected peer) to temporarily stop the sender from transmitting more frames. The buffer threshold must be low enough so that the sender has time to stop transmitting frames and the receiver can accept the frames already on the wire before the buffer overflows. The switch automatically sets queue buffer thresholds to prevent frame loss.

When congestion forces one priority on a link to pause, all of the other priorities on the link continue to send frames. Only frames of the paused priority are not transmitted. When the receive buffer empties below another threshold, the switch sends a message that starts the flow again. However, depending on the amount of traffic on a link or assigned to a priority, pausing traffic can cause ingress port congestion and spread congestion through the network.

Explicit congestion notification (ECN) enables end-to-end congestion notification between two endpoints on TCP/IP based networks. The two endpoints are an ECN-enabled sender and an ECN-enabled receiver. You must enable ECN on both endpoints and on all of the intermediate devices between the endpoints for ECN to work properly. Any device in the transmission path that does not support ECN breaks the end-to-end ECN functionality.

ECN notifies networks about congestion with the goal of reducing packet loss and delay by making the sending device decrease the transmission rate until the congestion clears, without dropping packets. RFC 3168, The Addition of Explicit Congestion Notification (ECN) to IP, defines ECN.

DCQCN is a combination of ECN and PFC to support end-to-end lossless Ethernet. ECN helps overcome the limitations of PFC to achieve lossless Ethernet. The idea behind DCQCN is to allow ECN to do flow control by decreasing the transmission rate when congestion starts, thereby minimizing the time PFC is triggered, which stops the flow altogether.

The correct operation of DCQCN requires balancing two conflicting requirements:

  1. Ensuring PFC does not trigger too early, that is, before giving ECN a chance to send congestion feedback to slow the flow.

  2. Ensuring PFC does not trigger too late, thereby causing packet loss due to buffer overflow.

To achieve the above key requirements, calculate and configure properly the following three important paramaters:

  1. Headroom Buffers—A PAUSE message sent to an upstream device takes some time to arrive and take effect. To avoid packet drops, the PAUSE sender must reserve enough buffer to process any packets it might receive during this time. This includes packets that were in flight when the PAUSE was sent as well as the packets sent by the upstream device while it is processing the PAUSE message. You allocate headroom buffers on a per port per priority basis out of the global shared buffer. You can control the amount of headroom buffers that are allocated for each port and priority using the MRU and cable length parameters in the CNP. If you see minor ingress drops even after PFC is triggered, you can eliminate those drops by increasing the headroom buffers for that port and priority combination.

  2. PFC Threshold—This is an ingress threshold. This is the maximum size an ingress priority group can grow to before a PAUSE message is sent to the upstream device. Each PFC priority gets its own priority group at each ingress port. PFC thresholds are set per priority group at each ingress port. There are two components in the PFC threshold—the PG MIN threshold and the PG shared threshold. Once PG MIN and PG shared thresholds are reached for a priority group, PFC is generated for that corresponding priority. The switch sends a RESUME message when the queue falls below the PFC thresholds.

  3. ECN Threshold—This is an egress threshold. The ECN threshold is equal to the WRED start-fill-level value. Once an egress queue exceeds this threshold, the switch starts ECN marking for packets on that queue. For DCQCN to be effective, this threshold must be lower than the ingress PFC threshold to ensure PFC does not trigger before the switch has a chance to mark packets with ECN. Setting a very low WRED fill level increases ECN marking probability. For example with default shared buffer setting, a WRED start-fill-level of 10 percent ensures lossless packets are ECN marked. But with a higher fill level, the probability of ECN marking is less. For example, with two ingress port with lossless traffic to the same egress port and a WRED start-fill-level of 50 percent, no ECN marking will occur, because ingress PFC thresholds will be met first.

Configuring DCQCN (Junos OS)

To enable DCQCN, configure both ECN and PFC for a traffic flow.

  1. Configure classifiers for ROCEv2 traffic and for Congestion Notification Packets (CNP). For example:
  2. Configure ECN on the egress port for a lossless flow. For example:
  3. Configure PFC on the ingress port for the same lossless flow. For example:
  4. Configure the shared buffers. For example:
    Note:

    You must follow these rules to commit the configuration on platforms running Junos OS:

    • You must configure all three or none of the ingress partitions.

    • You must configure all three or none of the egress partitions.

    • The sum of the ingress shared buffer configuration for all partitions must be 100 percent.

    • The sum of the egress shared buffer configuration for all partitions must be 100 percent.

  5. Configure forwarding classes and assign queues. For example:
  6. Verify your configuration. For example:
  7. Commit your configuration.

Configuring DCQCN (Junos OS Evolved)

To enable DCQCN, configure both ECN and PFC for a traffic flow.

  1. Configure classifiers for ROCEv2 traffic and for Congestion Notification Packets (CNP). For example:
  2. Configure ECN on the egress port for a lossless flow. For example:
  3. Configure PFC on the ingress port for the same lossless flow. For example:
  4. Configure the shared buffers. For example:
    Note:

    You must follow these rules to commit the configuration on platforms running Junos OS Evolved:

    • You must configure all three of the ingress partitions.

    • The sum of the ingress shared buffer configuration for all partitions must be 100 percent.

    • For lossy and lossless buffer partitions both the ingress and egress buffer-partition percentages should be equal.

    • QFX5000 switches running Junos OS Evolved do not have a dedicated service pool for multicast traffic due to hardware limitations, so multicast traffic uses lossy service pool shared buffers.

    Setting dynamic-threshold for the lossless ingress buffer partition is optional. ECN uses this option for the threshold calculation on lossless queues. If you don't configure this option, dynamic-threshold uses its default value of 7.

  5. Configure forwarding classes and assign queues. For example:
  6. Verify your configuration. For example:
  7. Commit your configuration.