Understanding CoS Explicit Congestion Notification
Explicit congestion notification (ECN) enables end-to-end congestion notification between two endpoints on TCP/IP based networks. The two endpoints are an ECN-enabled sender and an ECN-enabled receiver. ECN must be enabled on both endpoints and on all of the intermediate devices between the endpoints for ECN to work properly. Any device in the transmission path that does not support ECN breaks the end-to-end ECN functionality.
ECN notifies networks about congestion with the goal of reducing packet loss and delay by making the sending device decrease the transmission rate until the congestion clears, without dropping packets. RFC 3168, The Addition of Explicit Congestion Notification (ECN) to IP, defines ECN.
ECN is disabled by default. Normally, you enable ECN only on queues that handle best-effort traffic because other traffic types use different methods of congestion notification—lossless traffic uses priority-based flow control (PFC) and strict-high priority traffic receives all of the port bandwidth it requires up to the point of a configured maximum rate.
You enable ECN on individual output queues (as represented by forwarding classes) by enabling ECN in the queue scheduler configuration, mapping the scheduler to forwarding classes (queues), and then applying the scheduler to interfaces.
For ECN to work on a queue, you must also apply a weighted random early detection (WRED) packet drop profile to the queue.
How ECN Works
Without ECN, switches respond to network congestion by dropping TCP/IP packets. Dropped packets signal the network that congestion is occurring. Devices on the IP network respond to TCP packet drops by reducing the packet transmission rate to allow the congestion to clear. However, the packet drop method of congestion notification and management has some disadvantages. For example, packets are dropped and must be retransmitted. Also, bursty traffic can cause the network to reduce the transmission rate too much, resulting in inefficient bandwidth utilization.
Instead of dropping packets to signal network congestion, ECN marks packets to signal network congestion, without dropping the packets. For ECN to work, all of the switches in the path between two ECN-enabled endpoints must have ECN enabled. ECN is negotiated during the establishment of the TCP connection between the endpoints.
ECN-enabled switches determine the queue congestion state based on the WRED packet drop profile configuration applied to the queue, so each ECN-enabled queue must also have a WRED drop profile. If a queue fills to the level at which the WRED drop profile has a packet drop probability greater than zero (0), the switch might mark a packet as experiencing congestion. Whether or not a switch marks a packet as experiencing congestion is the same probability as the drop probability of the queue at that fill level.
ECN communicates whether or not congestion is experienced by marking the two least-significant bits in the differentiated services (DiffServ) field in the IP header. The most significant six bits in the DiffServ field contain the Differentiated Services Code Point (DSCP) bits. The state of the two ECN bits signals whether or not the packet is an ECN-capable packet and whether or not congestion has been experienced.
ECN-capable senders mark packets as ECN-capable. If a sender is not ECN-capable, it marks packets as not ECN-capable. If an ECN-capable packet experiences congestion at the egress queue of a switch, the switch marks the packet as experiencing congestion. When the packet reaches the ECN-capable receiver (destination endpoint), the receiver echoes the congestion indicator to the sender (source endpoint) by sending a packet marked to indicate congestion.
After receiving the congestion indicator from the receiver, the source endpoint reduces the transmission rate to relieve the congestion. This is similar to the result of TCP congestion notification and management, but instead of dropping the packet to signal network congestion, ECN marks the packet and the receiver echoes the congestion notification to the sender. Because the packet is not dropped, the packet does not need to be retransmitted.
ECN Bits in the DiffServ Field
The two ECN bits in the DiffServ field provide four codes that determine if a packet is marked as an ECN-capable transport (ECT) packet, meaning that both endpoints of the transport protocol are ECN-capable, and if there is congestion experienced (CE), as shown in Table 1:
ECN Bits (Code) |
Meaning |
---|---|
00 |
Non-ECT—Packet is marked as not ECN-capable |
01 |
ECT(1)—Endpoints of the transport protocol are ECN-capable |
10 |
ECT(0)—Endpoints of the transport protocol are ECN-capable |
11 |
CE—Congestion experienced |
Codes 01 and 10 have the same meaning: the sending and receiving endpoints of the transport protocol are ECN-capable. There is no difference between these codes.
End-to-End ECN Behavior
After the sending and receiving endpoints negotiate ECN, the sending endpoint marks packets as ECN-capable by setting the DiffServ ECN field to ECT(1) (01) or ECT(0) (10). Every intermediate switch between the endpoints must have ECN enabled or it does not work.
When a packet traverses a switch and experiences congestion at an output queue that uses the WRED packet drop mechanism, the switch marks the packet as experiencing congestion by setting the DiffServ ECN field to CE (11). Instead of dropping the packet (as with TCP congestion notification), the switch forwards the packet.
At the egress queue, the WRED algorithm determines whether or not a packet is drop eligible based on the queue fill level (how full the queue is). If a packet is drop eligible and marked as ECN-capable, the packet can be marked CE and forwarded. If a packet is drop eligible and is not marked as ECN-capable, it might be dropped. See WRED Drop Profile Control of ECN Thresholds for more information about the WRED algorithm.
When the packet reaches the receiver endpoint, the CE mark tells the receiver that there is network congestion. The receiver then sends (echoes) a message to the sender that indicates there is congestion on the network. The sender acknowledges the congestion notification message and reduces its transmission rate. Figure 1 summarizes how ECN works to mitigate network congestion:
End-to-end ECN behavior includes:
-
The ECN-capable sender and receiver negotiate ECN capability during the establishment of their connection.
-
After successful negotiation of ECN capability, the ECN-capable sender sends IP packets with the ECT field set to the receiver.
Note:You must enable ECN on all of the intermediate devices in the path between the sender and the receiver.
-
If the WRED algorithm on a device egress queue determines that the queue is experiencing congestion and the packet is drop eligible, the device can mark the packet as “congestion experienced” (CE) to indicate to the receiver that there is congestion on the network. If the packet has already been marked CE (congestion has already been experienced at the egress of another switch), the switch forwards the packet with CE marked.
If there is no congestion at the egress queue, the device forwards the packet and does not change the ECT-enabled marking of the ECN bits, so the packet is still marked as ECN-capable but not as experiencing congestion.
-
The receiver receives a packet marked CE to indicate that congestion was experienced along the congestion path.
-
The receiver echoes (sends) a packet back to the sender with the ECE bit (bit 9) marked in the flag field of the TCP header. The ECE bit is the ECN echo flag bit, which notifies the sender that there is congestion on the network.
-
The sender reduces the data transmission rate and sends a packet to the receiver with the CWR bit (bit 8) marked in the flag field of the TCP header. The CWR bit is the congestion window reduced flag bit, which acknowledges to the receiver that the congestion experienced notification was received.
-
When the receiver receives the CWR flag, the receiver stops setting the ECE bit in replies to the sender.
Table 2 summarizes the behavior of traffic on ECN-enabled queues.
Incoming IP Packet Marking of ECN Bits |
ECN Configuration on the Output Queue |
Action if WRED Algorithm Determines Packet is Drop Eligible |
Outgoing Packet Marking of ECN Bits |
---|---|---|---|
Non-ECT (00) |
Does not matter |
Drop. |
No ECN bits marked |
ECT (10 or 01) |
ECN disabled |
Drop |
Packet dropped—no ECN bits marked |
ECT (10 or 01) |
ECN enabled |
Do not drop. Mark packet as experiencing congestion (CE, bits 11). |
Packet marked ECT (11) to indicate congestion |
CE (11) |
ECN disabled |
Drop |
Packet dropped—no ECN bits marked |
CE (11) |
ECN enabled |
Do not drop. Packet is already marked as experiencing congestion, forward packet without changing the ECN marking. |
Packet marked ECT (11) to indicate congestion |
When an output queue is not experiencing congestion as defined by the WRED drop profile mapped to the queue, all packets are forwarded, and no packets are dropped.
ECN Compared to PFC and Ethernet PAUSE
ECN is an end-to-end network congestion notification mechanism for IP traffic. Priority-based flow control (PFC) (IEEE 802.1Qbb) and Ethernet PAUSE (IEEE 802.3X) are different types of congestion management mechanisms.
ECN requires that an output queue must also have an associated WRED packet drop profile. Output queues used for traffic on which PFC is enabled should not have an associated WRED drop profile. Interfaces on which Ethernet PAUSE is enabled should not have an associated WRED drop profile.
PFC is a peer-to-peer flow control mechanism to support lossless traffic. PFC
enables connected peer devices to pause flow transmission during periods of
congestion. PFC enables you to pause traffic on a specified type of flow on a
link instead of on all traffic on a link. For example, you can (and should)
enable PFC on lossless traffic classes such as the fcoe
forwarding class. Ethernet PAUSE is also a peer-to-peer flow control mechanism,
but instead of pausing only specified traffic flows, Ethernet PAUSE pauses all
traffic on a physical link.
With PFC and Ethernet PAUSE, the sending and receiving endpoints of a flow do not communicate congestion information to each other across the intermediate switches. Instead, PFC controls flows between two PFC-enabled peer devices (for example, switches) that support data center bridging (DCB) standards. PFC works by sending a pause message to the connected peer when the flow output queue becomes congested. Ethernet PAUSE simply pauses all traffic on a link during periods of congestion and does not require DCB.
PFC works this way: if a switch output queue fills to a certain threshold, the switch sends a PFC pause message to the connected peer device that is transmitting data. The pause message tells the transmitting switch to pause transmission of the flow. When the congestion clears, the switch sends another PFC message to tell the connected peer to resume transmission. (If the output queue of the transmitting switch also reaches a certain threshold, that switch can in turn send a PFC pause message to the connected peer that is transmitting to it. In this way, PFC can propagate a transmission pause back through the network.)
See Understanding CoS Flow Control (Ethernet PAUSE and PFC) for more information. You can also refer to Understanding PFC Functionality Across Layer 3 Interfaces.
WRED Drop Profile Control of ECN Thresholds
You apply WRED drop profiles to forwarding classes (which are mapped to output queues) to control how the switch marks ECN-capable packets. A scheduler map associates a drop profile with a scheduler and a forwarding class, and then you apply the scheduler map to interfaces to implement the scheduling properties for the forwarding class on those interfaces.
Drop profiles define queue fill level (the percentage of queue fullness) and drop probability (the percentage probability that a packet is dropped) pairs. When a queue fills to a specified level, traffic that matches the drop profile has the drop probability paired with that fill level. When you configure a drop profile, you configure pairs of fill levels and drop probabilities to control how packets drop at different levels of queue fullness.
The first fill level and drop probability pair is the drop start point. Until the queue reaches the first fill level, packets are not dropped. When the queue reaches the first fill level, packets that exceed the fill level have a probability of being dropped that equals the drop probability paired with the fill level.
The last fill level and drop probability pair is the drop end point. When the queue reaches the last fill level, all packets are dropped unless they are configured for ECN.
Lossless queues (forwarding class configured with the no-loss
packet drop attribute) and strict-high priority queues do not use drop profiles.
Lossless queues use PFC to control the flow of traffic. Strict-high priority
queues receive all of the port bandwidth they require up to the configured
maximum bandwidth limit.
Different switches support different amounts of fill level/drop probability pairs in drop profiles.
Do not configure the last fill level as 100 percent.
The drop profile configuration affects ECN packets as follows:
-
Drop start point—ECN-capable packets might be marked as congestion experienced (CE).
-
Drop end point—ECN-capable packets are always marked CE.
As a queue fills from the drop start point to the drop end point, the probability that an ECN packet is marked CE is the same as the probability that a non-ECN packet is dropped if you apply the drop profile to best-effort traffic. As the queue fills, the probability of an ECN packet being marked CE increases, just as the probability of a non-ECN packet being dropped increases when you apply the drop profile to best-effort traffic.
At the drop end point, all ECN packets are marked CE, but the ECN packets are not dropped. When the queue fill level exceeds the drop end point, all ECN packets are marked CE. At this point, all non-ECN packets are dropped. ECN packets (and all other packets) are tail-dropped if the queue fills completely.
To configure a WRED packet drop profile and apply it to an output queue (using hierarchical scheduling on switches that support ETS):
Configure a drop profile using the statement
set class-of-service drop-profiles profile-name interpolate fill-level drop-start-point fill-level drop-end-point drop-probability 0 drop-probability percentage
.Map the drop profile to a queue scheduler using the statement
set class-of-service schedulers scheduler-name drop-profile-map loss-priority (low | medium-high | high) protocol any drop-profile profile-name
. The name of the drop-profile is the name of the WRED profile configured in Step 1.Map the scheduler, which Step 2 associates with the drop profile, to the output queue using the statement
set class-of-service scheduler-maps map-name forwarding-class forwarding-class-name scheduler scheduler-name
. The forwarding class identifies the output queue. Forwarding classes are mapped to output queues by default, and can be remapped to different queues by explicit user configuration. The scheduler name is the scheduler configured in Step 2.Associate the scheduler map with a traffic control profile using the statement
set class-of-service traffic-control-profiles tcp-name scheduler-map map-name
. The scheduler map name is the name configured in Step 3.Associate the traffic control profile with an interface using the statement
set class-of-service interface interface-name forwarding-class-set forwarding-class-set-name output-traffic-control-profile tcp-name
. The output traffic control profile name is the name of the traffic control profile configured in Step 4.The interface uses the scheduler map in the traffic control profile to apply the drop profile (and other attributes, including the enable ECN attribute) to the output queue (forwarding class) on that interface. Because you can use different traffic control profiles to map different schedulers to different interfaces, the same queue number on different interfaces can handle traffic in different ways.
You can configure a WRED packet drop profile and apply it to an output queue on switches that support port scheduling (ETS hierarchical scheduling is either not supported or not used). To configure a WRED packet drop profile and apply it to an output queue on switches that support port scheduling (ETS hierarchical scheduling is either not supported or not used):
Configure a drop profile using the statement
set class-of-service drop-profiles profile-name interpolate fill-level level1 level2 ... level32 drop-probability probability1 probability2 ... probability32
. You can specify as few as two fill level/drop probability pairs or as many as 32 pairs.Map the drop profile to a queue scheduler using the statement
set class-of-service schedulers scheduler-name drop-profile-map loss-priority (low | medium-high | high) drop-profile profile-name
. The name of the drop-profile is the name of the WRED profile configured in Step 1.Map the scheduler, which Step 2 associates with the drop profile, to the output queue using the statement
set class-of-service scheduler-maps map-name forwarding-class forwarding-class-name scheduler scheduler-name
. The forwarding class identifies the output queue. Forwarding classes are mapped to output queues by default, and can be remapped to different queues by explicit user configuration. The scheduler name is the scheduler configured in Step 2.Associate the scheduler map with an interface using the statement
set class-of-service interfaces interface-name scheduler-map scheduler-map-name
.The interface uses the scheduler map to apply the drop profile (and other attributes) to the output queue mapped to the forwarding class on that interface. Because you can use different scheduler maps on different interfaces, the same queue number on different interfaces can handle traffic in different ways.
Support, Limitations, and Notes
If the WRED algorithm that is mapped to a queue does not find a packet drop eligible, then the ECN configuration and ECN bits marking does not matter. The packet transport behavior is the same as when ECN is not enabled.
ECN is disabled by default. Normally, you enable ECN only on queues that handle best-effort traffic, and you do not enable ECN on queues that handle lossless traffic or strict-high priority traffic.
ECN supports the following:
-
IPv4 and IPv6 packets
-
Untagged, single-tagged, and double-tagged packets
-
The outer IP header of IP tunneled packets (but not the inner IP header)
ECN does not support the following:
-
IP packets with MPLS encapsulation
-
The inner IP header of IP tunneled packets (however, ECN works on the outer IP header)
-
Multicast, broadcast, and destination lookup fail (DLF) traffic
-
Non-IP traffic
To apply a WRED drop profile to non-ECT traffic, configure a multifield (MF) classifier to assign non-ECT traffic to a different output queue that is not ECN-enabled, and then apply the WRED drop profile to that queue.
Platform-Specific Behavior
Use Feature Explorer to confirm platform and release support for ECN.
Use the following table to review platform-specific behaviors for this feature.
Platform |
Difference |
---|---|
QFX5000 Series |
On QFX5K platforms, ECN functionality is tightly integrated with WRED thresholds. WRED thresholds are static, so ECN also works based on static calculations of buffer thresholds. However, actual shared buffer usage of queues is dynamic. Following are the formulas used for ECN marking threshold calculations at the time of ECN configuration.
During congestion for ECN capable packets, ECN CE marking starts
after In the above calculation of
Two parameters, Therefore, with certain shared buffer and WRED fill level configurations, there is a possibility of packet tail drops due to shared buffer exhaustion occurring even before ECN marking on ECN enabled lossy queues. For lossless queues, due to the above limitation, PFC can start from an ingress port before ECN marking, as the PFC XOFF threshold is dynamic, unlike the static ECN threshold. You can determine proper ECN marking thresholds by monitoring the peak buffer usage of congested queues and fine tuning the ECN/WRED thresholds accordingly. |
QFX10000 Series |
|
Change History Table
Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.