Data Center Quantized Congestion Notification (DCQCN)
Remote Direct Memory Access (RDMA) provides high throughput and ultra-low latency, with low CPU overhead, necessary for modern data center applications. RDMA is deployed using the RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. Data Center Quantized Congestion Notification (DCQCN) is an end-to-end congestion control scheme for RoCEv2. Junos supports DCQCN by combining ECN and PFC to overcome the limitations of PFC to support end-to-end lossless Ethernet.
Understanding DCQCN
Priority-based Flow Control (PFC) is a lossless transport and congestion relief feature that works by providing granular link-level flow control for each IEEE 802.1p code point (priority) on a full-duplex Ethernet link. When the receive buffer on a switch interface fills to a threshold, the switch transmits a pause frame to the sender (the connected peer) to temporarily stop the sender from transmitting more frames. The buffer threshold must be low enough so that the sender has time to stop transmitting frames and the receiver can accept the frames already on the wire before the buffer overflows. The switch automatically sets queue buffer thresholds to prevent frame loss.
When congestion forces one priority on a link to pause, all of the other priorities on the link continue to send frames. Only frames of the paused priority are not transmitted. When the receive buffer empties below another threshold, the switch sends a message that starts the flow again. However, depending on the amount of traffic on a link or assigned to a priority, pausing traffic can cause ingress port congestion and spread congestion through the network.
Explicit congestion notification (ECN) enables end-to-end congestion notification between two endpoints on TCP/IP based networks. The two endpoints are an ECN-enabled sender and an ECN-enabled receiver. You must enable ECN on both endpoints and on all of the intermediate devices between the endpoints for ECN to work properly. Any device in the transmission path that does not support ECN breaks the end-to-end ECN functionality.
ECN notifies networks about congestion with the goal of reducing packet loss and delay by making the sending device decrease the transmission rate until the congestion clears, without dropping packets. RFC 3168, The Addition of Explicit Congestion Notification (ECN) to IP, defines ECN.
DCQCN is a combination of ECN and PFC to support end-to-end lossless Ethernet. ECN helps overcome the limitations of PFC to achieve lossless Ethernet. The idea behind DCQCN is to allow ECN to do flow control by decreasing the transmission rate when congestion starts, thereby minimizing the time PFC is triggered, which stops the flow altogether.
The correct operation of DCQCN requires balancing two conflicting requirements:
Ensuring PFC does not trigger too early, that is, before giving ECN a chance to send congestion feedback to slow the flow.
Ensuring PFC does not trigger too late, thereby causing packet loss due to buffer overflow.
To achieve the above key requirements, calculate and configure properly the following three important paramaters:
Headroom Buffers—A PAUSE message sent to an upstream device takes some time to arrive and take effect. To avoid packet drops, the PAUSE sender must reserve enough buffer to process any packets it might receive during this time. This includes packets that were in flight when the PAUSE was sent as well as the packets sent by the upstream device while it is processing the PAUSE message. You allocate headroom buffers on a per port per priority basis out of the global shared buffer. You can control the amount of headroom buffers that are allocated for each port and priority using the MRU and cable length parameters in the CNP. If you see minor ingress drops even after PFC is triggered, you can eliminate those drops by increasing the headroom buffers for that port and priority combination.
PFC Threshold—This is an ingress threshold. This is the maximum size an ingress priority group can grow to before a PAUSE message is sent to the upstream device. Each PFC priority gets its own priority group at each ingress port. PFC thresholds are set per priority group at each ingress port. There are two components in the PFC threshold—the
PG MIN
threshold and thePG shared
threshold. OncePG MIN
andPG shared
thresholds are reached for a priority group, PFC is generated for that corresponding priority. The switch sends a RESUME message when the queue falls below the PFC thresholds.ECN Threshold—This is an egress threshold. The ECN threshold is equal to the WRED start-fill-level value. Once an egress queue exceeds this threshold, the switch starts ECN marking for packets on that queue. For DCQCN to be effective, this threshold must be lower than the ingress PFC threshold to ensure PFC does not trigger before the switch has a chance to mark packets with ECN. Setting a very low WRED fill level increases ECN marking probability. For example with default shared buffer setting, a WRED start-fill-level of 10 percent ensures lossless packets are ECN marked. But with a higher fill level, the probability of ECN marking is less. For example, with two ingress port with lossless traffic to the same egress port and a WRED start-fill-level of 50 percent, no ECN marking will occur, because ingress PFC thresholds will be met first.
Configuring DCQCN (Junos OS)
To enable DCQCN, configure both ECN and PFC for a traffic flow.
Configuring DCQCN (Junos OS Evolved)
To enable DCQCN, configure both ECN and PFC for a traffic flow.