- About this Document
- Solution Benefits
- AI Use Case and Reference Design
- Solution Architecture
- Configuration Walkthrough
- NVIDIA Configuration
- Terraform Automation of Apstra for the AI Fabric
- Validation Framework
- Network Connectivity: Reference Examples
- WEKA Storage Solution
- Tested Optics
- Results Summary and Analysis
- Recommendations
ON THIS PAGE
Solution Architecture
The three fabrics described in the previous section (Frontend, GPU Backend, and Storage Backend), are interconnected together in the overall AI JVD solution architecture as shown in Figure 2.
Figure 2: AI JVD Solution Architecture
The number and switch type of the leaf and spine nodes, as well as the number and speed of the links between them, is determined by the type of fabric (Frontend, GPU Backend or Storage Backend) as they present different requirements. More details will be included in the respective fabric description sections.In the case of the GPU Backend fabric, the number of GPU servers, as well as the number of GPUs per server, are also factors determining the number and switch type of the leaf and spine nodes.
Frontend Fabric
The Frontend Fabric provides the infrastructure for users to interact with the AI systems to orchestrate training and inference tasks workflows using tools such as SLURM. These interactions do not generate heavy data flows nor have stringent requirements regarding latency or packet drops; thus, they do not impose rigorous demands on the fabric.
The Frontend Fabric design described in this JVD follows a traditional 3-stage IP Fabric architecture without HA, as shown in Figure 3. This architecture provides a simple and effective solution for the connectivity required in the Frontend. However, any fabric architecture including EVPN/VXLAN, could be used. If an HA-capable Frontend Fabric is required we recommend following the 3-Stage with Juniper Apstra JVD.
Figure 3: Frontend Fabric Architecture
The devices included in the Frontend fabric, and the connections between them, are summarized in the following table:
Table 1: Frontend devices
Nvidia DGX GPU Servers | Weka Storage Servers | Headend Servers | Frontend Leaf Nodes switch model (frontend-gpu-leaf & frontend-weka-leaf) | Frontend Spine Nodes switch model (frontend-spine#) |
---|---|---|---|---|
A100 x 8 H100 x 4 | Weka Storage Server x 8 | Headend-SVR x 3 | QFX5130-32CD x 2 | QFX5130-32CD x 2 |
Table 2: Connections between servers, leaf and spine nodes per cluster and stripe in the Frontend
GPU Servers to <=> Frontend Leaf Nodes | Weka Storage Servers <=> Frontend Leaf Nodes | Headend Servers <=> Frontend Leaf Nodes | Frontend Spine Nodes <=> Frontend Leaf Nodes |
---|---|---|---|
1 x 100GE links between each GPU server (A100-01 to A100-08, & H100-01 to H100-04) and the frontend-gpu-leaf switch. | 1 x 100GE links between each storage server (weka-1 to weka-8) and the frontend-weka-leaf switch. | 1 x 10GE links between each headend server (Headend-SVR-01 to Headend-SVR-03) and the frontend-weka-leaf switch. | 2 x 400GE links between each leaf node and each spine node. |
This fabric is a pure L3 IP fabric using EBGP for route advertisement. The IP addressing and EBGP configuration details are described in the networking section on this document.
GPU Backend Fabric
The GPU Backend fabric provides the infrastructure for GPUs to communicate with each other within a cluster, using RDMA over Converged Ethernet (RoCEv2). ROCEv2 boosts data center efficiency, reduces overall complexity, and increases data delivery performance by enabling the GPUs to communicate as they would with the InfiniBand protocol.
Packet loss can significantly impact job completion times and therefore should be avoided. Therefore, when designing the compute network infrastructure to support RoCEv2 for an AI cluster, one of the key objectives is to provide a lossless fabric, while also achieving maximum throughput, minimal latency, and minimal network interference for the AI traffic flows. ROCEv2 is more efficient over lossless networks, resulting in optimum job completion times.
The GPU Backend fabric in this JVD was designed with these goals in mind and follows a 3-stage IP clos architecture combined with NVIDIA’s Backend GPU Rail Optimized Stripe Architecture (discussed in the next section), as shown in Figure 4.
Figure 4: GPU Backend Fabric Architecture
We have built two different clusters in the AI lab, as shown in Figure 5, which share the same Frontend fabricand Storage Backend fabric but have separate GPU Backend fabrics. Each cluster is comprised of two stripes following the Rail Optimized Stripe Architecture described on page 17, but include different switch models as Leaf and Spine nodes, as well as Nvidia's server models.
These two clusters are not yet connected to each other and were tested separately. We have plans to connect them together using Juniper PTX devices as spine nodes in future JVD releases. Details for the two clusters will be included in this section.
Figure 5: AI JVD Lab Clusters
The GPU Backend in Cluster 1 consists of Juniper QFX5220, and QFX5230 switches as leaf nodes and either QFX5230s switches or PTX10008 routers acting as spine nodes. We tested the QFX5230s and PTX10008, acting as spine nodes separately, while maintaining the leaf nodes the same.
To facilitate switching between the setups using QFX5230s acting as spine nodes and the PTX10008 acting as spine, the two configurations of the Backend GPU blueprint in Apstra were saved and either one can be deployed at any time.
The GPU Backend in Cluster 2 consists of Juniper QFX5240 switches acting as both leaf nodes and spine nodes.
The GPU Backend devices included in this fabric, and the connections between them, are summarized in the following table:
Table 3: GPU Backend devices per cluster and stripe
Cluster | Stripe | Nvidia DGX GPU Servers | GPU Backend Leaf Nodes switch model (gpu-backend-leaf#) | GPU Backend Spine Nodes switch model (gpu-backend-spine#) |
---|---|---|---|---|
1 | 1 | A100-01 to A100-04 | QFX5230-64CD x 8 | QFX5230-64CD x 2 OR PTX10008 w/ JNP10K-LC1201 |
1 | 2 | A100-05 to A100-08 | QFX5220-32CD x 8 | |
2 | 1 | H100-01 to H100-02 | QFX5240-64OD x 8 | QFX5230-64OD x 4 |
2 | 2 | H100-03 to H100-04 | QFX5240-64OD x 8 |
Table 4: Connections between servers, leaf and spine nodes per cluster and stripe in the GPU Backend
Cluster | Stripe | GPU Servers <=> GPU Backend Leaf Nodes | GPU Backend Spine Nodes <=> GPU Backend Leaf Nodes |
---|---|---|---|
1 | 1 | 1 x 200GE links between each A100 server and each leaf node (200GE x 8 links per server) | 2 x 400GE links between each leaf node and each spines node (2 x 400GE x 2 links per leaf node) |
1 | 2 | 1 x 200GE links between each A100 server and each leaf nodes (200GE x 8 links per server) | 2 x 400GE links between each leaf node and each spines node (2 x 400GE x 2 links per leaf node) |
2 | 1 | 1 x 400GE links between each H100 server and each leaf nodes (400GE x 8 links per server) | 2 x 400GE links between each leaf node and each spines node (2 x 400GE x 4 links per leaf node) |
2 | 2 | 1 x 400GE links between each H100 server and each leaf nodes (400GE x 8 links per server) | 2 x 400GE links between each leaf node and each spines node (2 x 400GE x 4 links per leaf node) |
- The Nvidia A100 servers in the lab are connected to the fabric using 200GE interfaces while the H100 servers used 400GE interfaces.
- This fabric is a pure L3 IP fabric that uses EBGP for route advertisement (described in the networking section).
- Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers (described in the networking section).
The speed and number of links between the GPU servers and leaf nodes and between the leaf and spine nodes determines the oversubscription factor. As an example, consider the number of GPU servers available in the lab, and how they are connected to the GPU backend fabric as described above.
Table 5: Server to Leaf Bandwidth per stripe (per Cluster)
Cluster | AI Systems (server type) | Servers per Stripe | Server <=> Leaf Links per Server | Bandwidth of Server <=> Leaf Links [Gbps] | Total Bandwidth Servers <=> Leaf per stripe [Tbps} |
---|---|---|---|---|---|
1 | A100 | 4 | 8 | 200 | 4 x 8 x 200/1000 = 6.4 |
2 | H100 | 2 | 8 | 400 | 2 x 8 x 400/1000 = 6.4 |
Table 6: Leaf to Spine Bandwidth per stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe | Speed Of Leaf <=> Spine Links [Gbps] | Number of Spine Nodes | Total Bandwidth Leaf <=> Spine Per Stripe [Tbps] |
---|---|---|---|
8 | 2 x 400 | 2 | 12.8 |
The (over)subscription rate is simply calculated by comparing the numbers from the two tables above:
In cluster 1, the bandwidth between the servers and the leaf nodes is 6.4 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is 12.8 Tbps per stripe. This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, while still having extra capacity to accommodate additional servers without becoming oversubscribed.
Figure 6: Extra Capacity Example
We also tested connecting the H100 GPU servers along the A100 servers to the stripes in Cluster 1 as follows:
Figure 7: 1:1 Subscription Example
Table 7: Server to Leaf Bandwidth per stripe per cluster with all servers connected to same cluster
Cluster | Al Systems | Servers per Stripe | Server <=> Leaf Links per Server | Server <=> Leaf Links Bandwidth [Gbps] | Total Servers <=> Leaf Links Bandwidth per stripe [Tbps] |
---|---|---|---|---|---|
1 | A100 | 4 | 8 | 200 | 4 x 8 x 200/1000 = 6.4 |
H100 | 2 | 8 | 400 | 2 x 8 x 400/1000 = 6.4 | |
Total Bandwidth of Server <=> Leaf Links | 12.8 |
The bandwidth between the servers and the leaf nodes is now 12.8 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is also 12.8 Tbps per stripe (as shown in table above). This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, but now there is no extra capacity to accommodate additional servers. The subscription factor in this case is 1:1 (no over subscription).
To run oversubscription testing, we disabled some of the interfaces between the leaf and spines to reduce the available bandwidth as shown in the example in Figure 8:
Figure 8: 2:1 Oversubscription Example
The total Servers to Leaf Links bandwidth per stripe has not changed. It is still 12.8 Tbps as shown in table 3 in the previous scenario.
However, the bandwidth available between the leaf and spine nodes is now only 6.4 Tbps per stripe.
Table 8: Leaf to Spine Bandwidth per Stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe | Speed Of Leaf <=> Spine Links [Gbps] | Number of Spine Nodes | Total Bandwidth Leaf <=> Spine Per Stripe [Tbps] |
---|---|---|---|
8 | 1 x 400 | 2 | 6.4 |
This means that the fabric no longer has enough capacity to process all traffic between the GPUs even if this traffic was 100% inter-stripe, potentially causing congestion and traffic loss. The oversubscription factor in this case is 2:1.
Backend GPU Rail Optimized Stripe Architecture
A Rail Optimized Stripe Architecture provides efficient data transfer between GPUs, especially during computationally intensive tasks such as AI Large Language Models (LLM) training workloads, where seamless data transfer is necessary to complete the tasks within a reasonable timeframe. A Rail Optimized topology aims to maximize performance by providing minimal bandwidth contention, minimal latency, and minimal network interference, ensuring that data can be transmitted efficiently and reliably across the network.
In a Rail Optimized Stripe Architecture a stripe refers to a design module or building block, that can be replicated to scale up the AI cluster as shown in Figure 9.
Figure 9: Rail Optimized Stripe
The number of leaf switches in a single stripe is determined by the number of GPUs per server. In this JVD design, there are 8 leaf switches because each NVIDIA DGX H100 GPU server has 8 NVIDIA H100 Tensor core GPUs.
The maximum number of servers supported in a single stripe (N1) is determined by the Leaf node switch model. This is because to provide 1:1 subscription, the number of interfaces connecting the GPU servers, and the leaf nodes should be equal to the number of interfaces between the leaf and spine nodes.
Table 9: Maximum number of GPUs supported per stripe
Leaf Node QFX Model | Maximum number of 400 GE interfaces per switch | Maximum number of supported servers per stripe (N1) | Maximum number of GPUs supported per stripe |
---|---|---|---|
QFX5220-32CD | 32 | 16 | 16 x 8 = 128 |
QFX5230-64CD | 64 | 32 | 32 x 8 = 256 |
QFX5240-64OD | 64 | 32 | 32 x 8 = 256 |
- QFX5220-32CD switches provide 32 x 400 GE ports (16 can be used to connect to the servers and 16 will be used to connect to the spine nodes)
- QFX5230-64CD and QFX5240-64OD switches provide 64 x 400 GE ports (32 can be used to connect to the servers and 32 will be used to connect to the spine nodes)
To achieve larger scales, multiple stripes can be connected across Spine switches as shown in Figure 10.
Figure 10: Spines-connected Stripes
For example, assume that the desired number of GPUs is 16,000 and the fabric is using either QFX5230-64CD or QFX5240-64OD:
the number of servers per stripe (N1) = 32 => the maximum number of GPUs supported per stripe = 256
N2 = 16000/256 ≈ 63 stripes
- with N2 = 64 stripes & N1 servers = 32 the cluster can provide 16,384 GPUs.
- with N2 = 72 & N1 servers = 32 the cluster can provide 18432 GPUs.
The Stripes in the AI JVD setup consists of 8 Juniper QFX5220-32CD, QFX5230-64CD or QFX5240-64OD depending on the cluster and stripe. The number of GPUs supported on each cluster/stripe is shown in table 10.
Table 10: Maximum number of GPUs supported per cluster
Cluster | Stripe | Leaf Node QFX model | Maximum number of GPUs supported per stripe |
---|---|---|---|
1 | 1 | QFX5230-64CD | 16 x 8 = 128 |
1 | 2 | QFX5220-32CD | 32 x 8 = 256 |
Total number of GPUs supported by the cluster | = 384 | ||
2 | 1 | QFX5240-64OD | 32 x 8 = 256 |
2 | 2 | QFX5240-64OD | 32 x 8 = 256 |
Total number of GPUs supported by the cluster | = 512 |
What is Rail Optimized?
The GPUs on each server are numbered 1-8, where the number represents the GPU’s position in the server, as shown in Figure 11.
Figure 11: Rail Optimized Connections Between GPUs and Leaf Nodes
Communication between GPUs in the same server happens internally via high throughput NV-Links (Nvidia links) channels attached to internal NV-Switches, while communication between GPUs in different servers happens across the QFX fabric, which provides 400Gbps GPU-to-GPU bandwidth. Communication across the fabric occurs between GPUs on the same rail, which is the basis of the Rail-optimized architecture: Rails connect GPUs of the same order across one of the leaf nodes; that is, rail N connects GPUs in position N in all the servers across leaf switch N.
Figure 12 represents a topology with one stripe and 8 rails connecting GPUs 1-8 across leaf switches 1-8 respectively.
The example shows that communication between GPU 7 and GPU 8 in Server 1 happens internally across Nvidia’s NVlinks/NV-switch (not shown), while communication between GPU 1 in Server 1 and GPU 1 in Server N1 happens across Leaf switch 1 (within the same rail).
Notice that if any communication between GPUs in different stripes and different servers is required (e.g. GPU 4 in server 1 communicating with GPU 5 in Server N1), data is first moved to a GPU interface in the same rail as the destination GPU, thus sending data to the destination GPU without crossing rails.
Following this design, data between GPUs on different servers (but in the same stripe) is always moved on the same rail and across one single switch, which guarantees GPUs are 1 hop away from each other and creates separate independent high-bandwidth channels, which minimize contention and maximize performance.
Notice that this example is presuming Nvidia’s PXN feature is enabled. PXN can be easily enabled/disabled before a training or inference job in initiated.
Figure 12: GPU to GPU Communication Between Two Servers with PXN Enabled
For reference, Figure 13 shows an example with PXN disabled.
Figure 13: GPU to GPU Communication Between Two Servers Without PXN Enabled
The example shows that communication between GPU 4 in Server 1 and GPU 5 in Server N1 goes across Leaf switch 1, the Spine nodes, and Leaf switch 5 (between two different rails).
Storage Backend Fabric
The Storage Backend fabric provides the connectivity infrastructure for storage devices to be accessible from the GPU servers.
The performance of the storage infrastructure significantly impacts the efficiency of AI workflows. A storage system that provides quick access to data can significantly reduce the amount of time for training AI models. Similarly, a storage system that supports efficient data querying and indexing can minimize the completion time of preprocessing and feature extraction in an AI workflow.
The Storage Backend fabric design in the JVD also follows a 3-stage IP clos architecture as shown in Figure 16. There is no concept of rail-optimization in a storage cluster. Each GPU server has a single connection to the leaf nodes, instead of 8.
Figure 16: Storage Backend Fabric Architecture
The Storage Backend devices included in this fabric, and the connections between them, are summarized in the following table:
Table 16: Storage Backend devices
Nvidia DGX GPU Servers | Weka Storage Servers | Storage Backend Leaf Nodes switch model (storage-backend-gpu-leaf & storage-backend-weka-leaf) | Storage Backend Spine Nodes switch model (storage-backend-spine#) |
---|---|---|---|
A100 x 8 H100 x 4 | Weka storage server x 8 | QFX5130-32CD x 4 (2 storage-backend-gpu-leaf nodes, and 2 storage-backend-weka-leaf nodes) | QFX5130-32CD x 2 |
Table 17: Connections between servers, leaf and spine nodes in the Storage Backend
GPU Servers <=> Storage Backend GPU Leaf Nodes | Weka Storage Servers <=> Storage Backend Weka Leaf Nodes | Storage Backend Spine Nodes <=> Storage Backend Leaf nodes |
---|---|---|
1 x 100GE links between each H100 server and the storage-backend-gpu-leaf switch 1 x 200GE links between each A100 server and the storage-backend-gpu-leaf switch | 1 x 100GE links between each storage server (weka-1 to weka-8) and the storage-backend-weka-leaf switch | 2 x 400GE links between each leaf and spine nodes and the storage-backend-weka-leaf switch 3 x 400GE links between each leaf and spine nodes and the storage-backend-gpu-leaf switch |
The NVIDIA servers hosting the GPUs have dedicated storage network adapters (NVIDIA ConnectX) that support both the Ethernet and InfiniBand protocols and provide connectivity to external storage arrays.
Communications between GPUs and the storage devices leverage the WEKA distributed POSIX client which enables multiple data paths for transfer of stored data from the WEKA nodes to the GPU client servers. The WEKA client leverages the Data Plane Development Kit (DPDK) to offload TCP packet processing from the Operating System Kernel to achieve higher throughput.
This communication is supported by the Storage Backend fabric described in the previous section and exemplified in Figure 17.
Figure 17: GPU Backend to Storage Backend Communication
WEKA Storage Solution
In small clusters, it may be sufficient to use the local storage on each GPU server, or to aggregate this storage together using open-source or commercial software. In larger clusters with heavier workloads, an external dedicated storage system is required to provide dataset staging for ingest, and for cluster checkpointing during training. This JVD describes the infrastructure for dedicated storage using WEKA storage.
WEKA is a distributed data platform that allows high performance and concurrent access and allows all GPU Servers in the cluster to efficiently utilize a shared storage resource. With extreme I/O capabilities, the WEKA system can service the needs of all servers and scale to support hundreds or even thousands of GPUs.
Toward the end of this document, you can find more details on the WEKA storage system, including configuration settings, driver details, and more.
Scaling
The size of an AI cluster varies significantly depending on the specific requirements of the workload. The number of nodes in an AI cluster is influenced by factors such as the complexity of the machine learning models, the size of the datasets, the desired training speed, and the available budget. The number varies from a small cluster with less than 100 nodes to a data center-wide cluster comprising of 10000s of compute, storage, and networking nodes. A minimum of 4 spines must always be deployed for path diversity and reduction of PFC failure paths.
Table 18: Fabric Scaling - Devices and Positioning
Small | Medium | Large |
---|---|---|
64 – 2048 GPU | 2048 – 8192 GPU | 8192 – 32768 GPU |
With support for up to 2048 GPUs, the Juniper QFX5240-64CDs or QFX5230-64CD can be used as Spine and leaf devices to support single or dual-stripe applications. To follow best practice recommendations, a minimum of 4 Spines should be deployed, even in a single-stripe fabric. | With support for 2048 – 8192 GPUs, the Juniper QFX5240-64CDs can be used as Spine and leaf devices to achieve appropriate scale. This 3-stage, rail-based fabric design provides physical connectivity to 16 Stripes from 64 Spines and 1024 leaf nodes, maintaining a 1:1 subscription throughput model. | For infrastructures supporting more than 8192 GPUs, the Juniper PTX1000x Chassis spine and QFX5240 leaf nodes can support up to 32768 GPUs. This 3-stage, rail-based fabric design provides physical connectivity to 64 Stripes from 64 Spines and 4096 leaf nodes, maintaining a 1:1 subscription throughput model. |
![]() | ![]() | ![]() |
Juniper continues in its rapid innovation for increased scalability and low Job Completion Times in AI network fabrics with our recently introduced QFX5240 TH5 switch, delivering 64 ports of high-density 800GbE ports in a 2U fixed form factor with software to provide advanced network services tuned to the specific needs of AI workloads. These advanced services include Selective Load Balancing, Global Load Balancing, ISSU Fast Boot, Reactive Path Balancing, and more.
Juniper Hardware and Software Components
For this solution design, the Juniper products and software versions are below. The design documented in this JVD is considered the baseline representation for the validated solution. As part of a complete solutions suite, we routinely swap hardware devices with other models during iterative use case testing. Each switch platform validated in this document goes through the same rigorous role-based testing using specified versions of Junos OS and Apstra management software.
Juniper Hardware Components
The following table summarizes the switches tested and validated by role for the AI Data Center Network with Juniper Apstra JVD.
Table 19: Validated Devices and Positioning
Solution | Leaf Switches | Spine Switches |
---|---|---|
Frontend Fabric | QFX5130-32CD | QFX5130-32CD |
GPU Backend Fabric | QFX5230-64CD (CLUSTER 1-STRIPE 1) QFX5220-32CD (CLUSTER 1-STRIPE 2) QFX5240-64OD (CLUSTER 2) | QFX5230-64CD (CLUSTER 1) PTX10008 JNP10K-LC1201 (CLUSTER 1) QFX5240-64CD (CLUSTER 2) |
Storage Backend Fabric | QFX5220-32CD | QFX5220-32CD |
Juniper Software Components
The following table summarizes the software versions tested and validated by role.
Table 20: Platform Recommended Release
Platform | Role | Junos OS Release |
---|---|---|
QFX5130-32CD | Frontend Leaf | 23.43R2-S3 |
QFX5130-32CD | Frontend Spine | 23.43R2-S3 |
QFX5220-32CD | Storage Backend Leaf | 23.4X100-D20 |
QFX5220-32CD | Storage Backend Spine | 23.4X100-D20 |
QFX5220-32CD | GPU Backend Leaf | 23.4X100-D20 |
QFX5230-64CD | GPU Backend Leaf | 23.4X100-D20 |
QFX5230-64CD | GPU Backend Spine | 23.4X100-D20 |
QFX5240-64CD | GPU Backend Leaf | 23.4X100-D20 |
QFX5240-64CD | GPU Backend Spine | 23.4X100-D20 |
PTX10008 with LC1201 | GPU Backend Spine | 23.4R2-S3 |
IP Services for AI Networks
As described in the next few sections, various strategies can be employed to handle traffic congestion in the AI network.
Congestion Management
AI clusters pose unique demands on network infrastructure due to their high-density, and low-entropy traffic patterns, characterized by frequent elephant flows with minimal flow variation. Additionally, most AI modes require uninterrupted packet flow with no packet loss for training jobs to be completed.
For these reasons, when designing a network infrastructure for AI traffic flows, the key objectives include maximum throughput, minimal latency, and minimal network interference over a lossless fabric, resulting in the need to configure effective congestion control methods.
Data Center Quantized Congestion Notification (DCQCN), has become the industry-standard for end-to-end congestion control for RDMA over Converged Ethernet (RoCEv2) traffic. DCQCN congestion control methods offer techniques to strike a balance between reducing traffic rates and stopping traffic all together to alleviate congestion, without resorting to packet drops.
DCQCN combines two different mechanisms for flow and congestion control:
- Priority-Based Flow Control (PFC), and
- Explicit Congestion Notification (ECN).
Priority-Based Flow Control (PFC) helps relieve congestion by halting traffic flow for individual traffic priorities (IEEE 802.1p or DSCP markings) mapped to specific queues or ports. The goal of PFC is to stop a neighbor from sending traffic for an amount of time (PAUSE time), or until the congestion clears. This process consists of sending PAUSE control frames upstream requesting the sender to halt transmission of all traffic for a specific class or priority while congestion is ongoing. The sender completely stops sending traffic to the requesting device for the specific priority.
While PFC mitigates data loss and allows the receiver to catch up processing packets already in the queue, it impacts performance of applications using the assigned queues during the congestion period. Additionally, resuming traffic transmission post-congestion often triggers a surge, potentially exacerbating or reinstating the congestion scenario.
We recommend configuring PFC only on the QFX devices acting as spine nodes.
Explicit Congestion Notification (ECN), on the other hand, curtails transmit rates during congestion while enabling traffic to persist, albeit at reduced rates, until congestion subsides. The goal of ECN is to reduce packet loss and delay by making the traffic source decrease the transmission rate until the congestion clears. This process entails marking packets with ECN bits at congestion points by setting the ECN bits to 11 in the IP header. The presence of this ECN marking prompts receivers to generate Congestion Notification Packets (CNPs) sent back to source, which signal the source to throttle traffic rates.
Combining PFC and ECN offers the most effective congestion relief in a lossless IP fabric supporting RoCEv2, while safeguarding against packet loss. To achieve this, when implementing PFC and ECN together, their parameters should be carefully selected so that ECN is triggered before PFC.
Load Balancing
The fabric architecture used in this JVD for both the Frontend and backend follows the 2-stage clos design, with every leaf node connected to all the available spine nodes, and via multiple interfaces. As a result, multiple paths are available between the leaf and spine nodes to reach other devices.
AI traffic characteristics may impede optimal link utilization when implementing traditional Equal Cost Multiple Path (ECMP) Static Load Balancing (SLB) over these paths. This is because the hashing algorithm which looks at specific fields in the packet headers will result in multiple flows mapped to the same link due to their similarities. Consequently, certain links will be favored, and their high utilization may impede the transmission of smaller low-bandwidth flows, leading to potential collisions, congestion and packet drops. To improve the distribution of traffic across all the available paths either Dynamic Load Balancing (DLB) or Global Load Balancing (GLB) can be implemented instead.
For this JVD Dynamic Load Balancing flowlet-mode was implemented on all the QFX leaf and spines nodes. Additional testing was conducted on the QFX5240-64OD in the GPU Backend Fabric cluster 2, to evaluate the benefits of Selective Dynamic Load Balancing, Reactive path rebalancing, and Global Load Balancing.
These load balancing mechanisms are only available on the QFX devices.
Dynamic Load Balancing (DLB)
DLB ensures that all paths are utilized more fairly, by not only looking at the packet headers, but also considering real-time link quality based on port load (link utilization) and port queue depth, when selecting a path. This method provides better results when multiple long-lived flows moving large amounts of data need to be load balanced.
DLB can be configured in two different modes:
- Per packet mode: packets from the same flow are sprayed across link members of an IP ECMP group, which can cause packets to arrive out of order.
- Flowlet Mode: packets from the same flow are sent across a link member of an IP ECMP group. A flowlet is defined as bursts of the same flow separated by periods of inactivity. If a flow pauses for longer than the configured inactivity timer, it is possible to reevaluate the link members quality, and for the flow to be reassigned to a different.
Some enhancements have been introduced for the QFX5230s and QFX5240s in recent versions of Junos OS.
- Selective Dynamic Load Balancing(SDLB): allows implementing DLB only to certain traffic. This feature is only supported on QFX5230-64CD, QFX5240-64OD, and QFX5240-64QD, starting in Junos OS Evolved Release 23.4R2, at the time this document publication.
- Reactive path rebalancing: allows a flow to be reassigned to a different (better) link, when the current link quality deteriorates, even if no pause in the traffic flow has exceeded the configured inactivity timer. This feature is only supported on QFX5240-64OD, and QFX5240-64QD, starting in Junos OS Evolved Release 23.4R2, at the time this document publication.
Global load balancing (GLB):
GLB is an improvement on DLB which only considers the local link bandwidth utilization. GLB on the other hand, has visibility into the bandwidth utilization of links at the next-to-next-hop (NNH) level. As a result, GLB can reroute traffic flows to avoid traffic congestion farther out in the network than DLB can detect.
Each Language model will have a different traffic profile and characteristics, and therefore, class of service will need to be tuned to the specific model or models in use. Introduction to Congestion Control in Juniper AI Networksexplores how to build a lossless fabric for AI workloads using DCQCN (ECN and PFC) congestion control methods and DLB. The document was based on DLRM training model as a reference and demonstrates how different congestion parameters such as ECN and PFC counters, input drops and tail drops can be monitored to adjust configuration and build a lossless fabric infrastructure for RoCEv2 traffic. Load Balancing in the Data Center provides a comprehensive deep dive into the various load-balancing mechanisms and their evolution to suit the needs of the data center.