Solution Architecture
The three fabrics described in the previous section (Frontend, GPU Backend, and Storage Backend), are interconnected together in the overall AI JVD solution architecture as shown in Figure 2.
Figure 2: AI JVD Solution Architecture
Note: The number and switch type of the leaf and spine nodes, as well as the number and speed of the links between them, is determined by the type of fabric (Frontend, GPU Backend or Storage Backend) as they present different requirements. More details will be included in the respective fabric description sections.
In the case of the GPU Backend fabric, the number of GPU servers, as well as the number of GPUs per server, are also factors determining the number and switch type of the leaf and spine nodes.
Frontend Fabric
The Frontend Fabric provides the infrastructure for users to interact with the AI systems to orchestrate training and inference tasks workflows using tools such as SLURM. These interactions do not generate heavy data flows nor have stringent requirements regarding latency or packet drops; thus, they do not impose rigorous demands on the fabric.
The Frontend Fabric design described in this JVD follows a traditional 3-stage IP Fabric architecture without HA, as shown in Figure 3. This architecture provides a simple and effective solution for the connectivity required in the Frontend. However, any fabric architecture including EVPN/VXLAN, could be used. If an HA-capable Frontend Fabric is required we recommend following the 3-Stage with Juniper Apstra JVD.
Figure 3: Frontend Fabric Architecture
The Frontend devices included in this fabric, and the connections between them, can be summarized as follows:
Nvidia DGX GPU Servers | Weka Storage Servers | Headend Servers |
Frontend Leaf Nodes switch model (frontend-gpu-leaf & frontend-weka-leaf) |
Frontend Spine Nodes switch model (frontend-spine#) |
---|---|---|---|---|
A100 x 8 H100 x 4 |
Weka Storage Server x 8 | Headend-SVR x 3 | QFX5130-32CD x 2 | QFX5130-32CD x 2 |
GPU Servers to <=> Frontend Leaf Nodes |
Weka Storage Servers <=> Frontend Leaf Nodes |
Headend Servers <=> Frontend Leaf Nodes |
Frontend Spine Nodes <=> Frontend Leaf Nodes |
---|---|---|---|
1 x 100GE links between each GPU server ( A100-01 to A100-08 , & H100-01 to H100-04 ) and the frontend-gpu-leaf switch. |
1 x 100GE links between each storage server ( weka-1 to weka-8 ) and the frontend-weka-leaf switch. |
1 x 10GE links between each headend server ( Headend-SVR-01 to Headend-SVR-03 ) and the frontend-weka-leaf switch. |
2 x 400GE links between each leaf node and each spine node. |
This fabric is a pure L3 IP fabric using EBGP for route advertisement. The IP addressing and EBGP configuration details are described in the networking section on this document.
GPU Backend Fabric
The GPU Backend fabric provides the infrastructure for GPUs to communicate with each other within a cluster, using RDMA over Converged Ethernet (RoCEv2). ROCEv2 boosts data center efficiency, reduces overall complexity, and increases data delivery performance by enabling the GPUs to communicate as they would with the InfiniBand protocol.
Packet loss can significantly impact job completion times and therefore should be avoided. Therefore, when designing the compute network infrastructure to support RoCEv2 for an AI cluster, one of the key objectives is to provide a lossless fabric, while also achieving maximum throughput, minimal latency, and minimal network interference for the AI traffic flows. ROCEv2 is more efficient over lossless networks, resulting in optimum job completion times.
The GPU Backend fabric in this JVD was designed with these goals in mind and follows a 3-stage IP clos architecture combined with NVIDIA’s Backend GPU Rail Optimized Stripe Architecture (discussed in the next section), as shown in Figure 4.
Figure 4: GPU Backend Fabric Architecture
We have built two different clusters in the AI lab, with different combinations of QFX switch models as Leaf and Spine nodes, and two different Nvidia's server models, as shown in Figure 5.
Figure 5: AI JVD Lab Clusters
The two clusters share the same Frontend fabric and Storage Backend fabric but have their own GPU Backend fabric. Each cluster is comprised of two stripes following the Backend GPU Rail Optimized Stripe Architecture. The two clusters are connected by the spine nodes and include a different set of GPU servers connected to the leaf nodes.
The backend devices included in the fabric on each cluster/stripe, and the connections between them, can be summarized as follows:
Backend devices per cluster and stripe
Cluster | Stripe | Nvidia DGX GPU Servers |
GPU Backend Leaf Nodes switch model (gpu-backend-leaf#) |
GPU Backend Spine Nodes switch model (gpu-backend-spine#) |
---|---|---|---|---|
1 | 1 | A100-01 to A100-04 | QFX5230-64CD x 8 | QFX5230-64CD x 2 |
1 | 2 | A100-05 to A100-08 | QFX5220-32CD x 8 | |
2 | 1 | H100-01 to H100-02 | QFX5240-64OD x 8 | |
2 | 2 | H100-03 to H100-04 | QFX5240-64OD x 8 |
Connections between servers, leaf and spine nodes per cluster and stripe
Cluster | Stripe |
GPU Servers <=> GPU Backend Leaf Nodes |
GPU Backend Spine Nodes <=> GPU Backend Leaf Nodes |
---|---|---|---|
1 | 1 |
1 x 200GE links between each A100 server and each leaf node (200GE x 8 links per server) |
2 x 400GE links between each leaf node and each spines node (2 x 400GE x 2 links per leaf node) |
1 | 2 |
1 x 200GE links between each A100 server and each leaf nodes (200GE x 8 links per server) |
2 x 400GE links between each leaf node and each spines node (2 x 400GE x 2 links per leaf node) |
2 | 1 |
1 x 400GE links between each H100 server and each leaf nodes (400GE x 8 links per server) |
2 x 400GE links between each leaf node and each spines node (2 x 400GE x 4 links per leaf node) |
2 | 2 |
1 x 400GE links between each H100 server and each leaf nodes (400GE x 8 links per server) |
2 x 400GE links between each leaf node and each spines node (2 x 400GE x 4 links per leaf node) |
- Nvidia A100 servers in the lab are connected to the fabric using 200GE interfaces while the H100 servers used 400GE interfaces.
- This fabric is a pure L3 IP fabric that uses EBGP for route advertisement (described in the networking section).
- Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers (described in the networking section).
The speed and number of links between the GPU servers and leaf nodes and between the leaf and spine nodes determines the oversubscription factor. As an example, consider the number of GPU servers available in the lab, and how they are connected to the GPU backend fabric as described above.
Per cluster, per stripe Server to Leaf Bandwidth
Server to Leaf Bandwidth per Stripe (per Cluster) | |||||
---|---|---|---|---|---|
Cluster | AI Systems (server type) | Servers per Stripe | Server <=> Leaf Links per Server | Bandwidth of Server <=> Leaf Links [Gbps] |
Total Bandwidth Servers <=> Leaf per stripe [Tbps} |
1 | A100 | 4 | 8 | 200 | 4 x 8 x 200/1000 = 6.4 |
2 | H100 | 2 | 8 | 400 | 2 x 8 x 400/1000 = 6.4 |
Per cluster, per stripe Leaf to Spine Bandwidth
Leaf to Spine Bandwidth per Stripe | |||
---|---|---|---|
Leaf <=> Spine Links Per Spine Node & Per Stripe |
Speed Of Leaf <=> Spine Links [Gbps] |
Number of Spine Nodes |
Total Bandwidth Leaf <=> Spine Per Stripe [Tbps] |
8 | 2 x 400 | 2 | 12.8 |
The (over)subscription rate is simply calculated by comparing the numbers from the two tables above:
In cluster 1, the bandwidth between the servers and the leaf nodes is 6.4 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is 12.8 Tbps per stripe. This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, while still having extra capacity to accommodate additional servers without becoming oversubscribed.
Figure 6: Extra Capacity Example
We also tested connecting the H100 GPU servers along the A100 servers to the stripes in Cluster 1 as follows:
Figure 7: 1:1 Subscription Example
Per cluster, per stripe Server to Leaf Bandwidth with all servers connected to same cluster
Cluster | Al Systems | Servers per Stripe | Server <=> Leaf Links per Server |
Server <=> Leaf Links Bandwidth [Gbps] |
Total Servers <=> Leaf Links Bandwidth per stripe [Tbps] |
---|---|---|---|---|---|
1 | A100 | 4 | 8 | 200 | 4 x 8 x 200/1000 = 6.4 |
H100 | 2 | 8 | 400 | 2 x 8 x 400/1000 = 6.4 | |
Total Bandwidth of Server <=> Leaf Links | 12.8 |
The bandwidth between the servers and the leaf nodes is now 12.8 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is also 12.8 Tbps per stripe (as shown in table 2 before). This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, but now there is no extra capacity to accommodate additional servers. The subscription factor in this case is 1:1 (no subscription).
To run oversubscription testing, we disabled some of the interfaces between the leaf and spines to reduce the available bandwidth as shown in the example in Figure 8:
Figure 8: 2:1 Oversubscription Example
The total Servers to Leaf Links bandwidth per stripe has not changed. It is still 12.8 Tbps as shown in table 3 in the previous scenario.
However, the bandwidth available between the leaf and spine nodes is now only 6.4 Tbps per stripe.
Leaf to Spine Bandwidth per Stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe |
Speed Of Leaf <=> Spine Links [Gbps] |
Number of Spine Nodes |
Total Bandwidth Leaf <=> Spine Per Stripe [Tbps] |
---|---|---|---|
8 | 1 x 400 | 2 | 6.4 |
This means that the fabric no longer has enough capacity to process all traffic between the GPUs even if this traffic was 100% inter-stripe, potentially causing congestion and traffic loss. The oversubscription factor in this case is 2:1.
Backend GPU Rail Optimized Stripe Architecture
A Rail Optimized Stripe Architecture provides efficient data transfer between GPUs, especially during computationally intensive tasks such as AI Large Language Models (LLM) training workloads, where seamless data transfer is necessary to complete the tasks within a reasonable timeframe. A Rail Optimized topology aims to maximize performance by providing minimal bandwidth contention, minimal latency, and minimal network interference, ensuring that data can be transmitted efficiently and reliably across the network.
In a Rail Optimized Stripe Architecture a stripe refers to a design module or building block, that can be replicated to scale up the AI cluster as shown in Figure 9.
Figure 9: Rail Optimized Stripe
The number of leaf switches in a single stripe is always 8 and is determined by the number of GPUs per server (Each NVIDIA DGX H100 GPU server includes 8 NVIDIA H100 Tensor core GPUs).
The maximum number of servers supported in a single stripe (N1) is determined by the Leaf node switch model. This is because to provide 1:1 subscription, the number of interfaces connecting the GPU servers, and the leaf nodes should be equal to the number of interfaces between the leaf and spine nodes.
Maximum number of GPUs supported per stripe
Leaf Node QFX Model | Maximum number of 400 GE interfaces per switch | Maximum number of supported servers per stripe (N1) | Maximum number of GPUs supported per stripe |
---|---|---|---|
QFX5220-32CD | 32 | 16 | 16 x 8 = 128 |
QFX5230-64CD | 64 | 32 | 32 x 8 = 256 |
QFX5240-64OD | 64 | 32 | 32 x 8 = 256 |
- QFX5220-32CD switches provides 32 x 400 GE ports (16 can be used to connect to the servers and 16 will be used to connect to the spine nodes)
- QFX5230-64CD and QFX5240-64OD switches provide 64 x 400 GE ports (32 can be used to connect to the servers and 32 will be used to connect to the spine nodes)
To achieve larger scales, multiple stripes can be connected across Spine switches as shown in Figure 10.
Figure 10: Spines-connected Stripes
For example, assume that the desired number of GPUs is 16,000 and the fabric is using either QFX5230-64CD or QFX5240-64OD:
- the number of servers per stripe (N1) = 32 => the maximum number of GPUs supported per stripe = 256
N2 = 16000/256 ≈ 63 stripes
- with N2 = 64 stripes & N1 servers = 32 the cluster can provide 16,384 GPUs.
- with N2 = 72 & N1 servers = 32 the cluster can provide 18432 GPUs.
The Stripes in the AI JVD setup consists of 8 Juniper QFX5220-32CD, QFX5230-64CD or QFX5240-64OD depending on the cluster and stripe. The number of GPUs supported on each cluster/stripe is shown in table #.
Maximum number of GPUs supported per cluster
Cluster | Stripe | Leaf Node QFX model | Maximum number of GPUs supported per stripe |
---|---|---|---|
1 | 1 | QFX5230-64CD | 16 x 8 = 128 |
1 | 2 | QFX5220-32CD | 32 x 8 = 256 |
Total number of GPUs supported by the cluster | = 384 | ||
2 | 1 | QFX5240-64OD | 32 x 8 = 256 |
2 | 2 | QFX5240-64OD | 32 x 8 = 256 |
Total number of GPUs supported by the cluster | = 512 |
What is Rail Optimized?
The GPUs on each server are numbered 1-8, where the number represents the GPU’s position in the server, as shown in Figure 11.
Figure 11: Rail Optimized Connections Between GPUs and Leaf Nodes
Communication between GPUs in the same server happens internally via high throughput NV-Links (Nvidia links) channels attached to internal NV-Switches, while communication between GPUs in different servers happens across the QFX fabric, which provides 400Gbps GPU-to-GPU bandwidth. Communication across the fabric occurs between GPUs on the same rail, which is the basis of the Rail-optimized architecture: Rails connect GPUs of the same order across one of the leaf nodes; that is, rail N connects GPUs in position N in all the servers across leaf switch N.
Figure 12 represents a topology with one stripe and 8 rails connecting GPUs 1-8 across leaf switches 1-8 respectively.
The example shows that communication between GPU 7 and GPU 8 in Server 1 happens internally across Nvidia’s NVlinks/NV-switch (not shown), while communication between GPU 1 in Server 1 and GPU 1 in Server N1 happens across Leaf switch 1 (within the same rail).
Notice that if any communication between GPUs in different stripes and different servers is required (e.g. GPU 4 in server 1 communicating with GPU 5 in Server N1), data is first moved to a GPU interface in the same rail as the destination GPU, thus sending data to the destination GPU without crossing rails.
Following this design, data between GPUs on different servers (but in the same stripe) is always moved on the same rail and across one single switch, which guarantees GPUs are 1 hop away from each other and creates separate independent high-bandwidth channels, which minimize contention and maximize performance.
Notice that this example is presuming Nvidia’s PXN feature is enabled. PXN can be easily enabled/disabled before a training or inference job in initiated.
Figure 12: GPU to GPU Communication Between Two Servers with PXN Enabled
For reference, Figure 13 shows an example with PXN disabled.
Figure 13: GPU to GPU Communication Between Two Servers Without PXN Enabled
The example shows that communication between GPU 4 in Server 1 and GPU 5 in Server N1 goes across Leaf switch 1, the Spine nodes, and Leaf switch 5 (between two different rails).
Storage Backend Fabric
The Storage Backend fabric provides the connectivity infrastructure for storage devices to be accessible from the GPU servers.
The performance of the storage infrastructure significantly impacts the efficiency of AI workflows. A storage system that provides quick access to data can significantly reduce the amount of time for training AI models. Similarly, a storage system that supports efficient data querying and indexing can minimize the completion time of preprocessing and feature extraction in an AI workflow.
The Storage Backend fabric design in the JVD also follows a 3-stage IP clos architecture as shown in Figure 14. There is no concept of rail-optimization in a storage cluster. Each GPU server has a single connection to the leaf nodes, instead of 8.
Figure 14: Storage Backend Fabric Architecture
The Storage Backend devices included in this fabric, and the connections between them, can be summarized as follows:
Nvidia DGX GPU Servers | Weka Storage Servers |
Storage Backend Leaf Nodes switch model (storage-backend-gpu-leaf & storage-backend-weka-leaf) |
Storage Backend Spine Nodes switch model (storage-backend-spine# ) |
---|---|---|---|
A100 x 8 H100 x 4 |
Weka storage server x 8 |
QFX5130-32CD x 4 (2 storage-backend-gpu-leaf nodes, and 2 storage-backend-weka-leaf nodes) |
QFX5130-32CD x 2 |
GPU Servers <=> Storage Backend GPU Leaf Nodes |
Weka Storage Servers <=> Storage Backend Weka Leaf Nodes |
Storage Backend Spine Nodes <=> Storage Backend Leaf nodes |
---|---|---|
1 x 100GE links between each H100 server and the storage-backend-gpu-leaf switch 1 x 200GE links between each A100 server and the storage-backend-gpu-leaf switch |
1 x 100GE links between each storage server (weka-1 to weka-8) and the storage-backend-weka-leaf switch |
2 x 400GE links between each leaf and spine nodes and the storage-backend-weka-leaf switch 3 x 400GE links between each leaf and spine nodes and the storage-backend-gpu-leaf switch |
The NVIDIA servers hosting the GPUs have dedicated storage network adapters (NVIDIA ConnectX) that support both the Ethernet and InfiniBand protocols and provide connectivity to external storage arrays.
Communications between GPUs and the storage devices leverage the WEKA distributed POSIX client which enables multiple data paths for transfer of stored data from the WEKA nodes to the GPU client servers. The WEKA client leverages the Data Plane Development Kit (DPDK) to offload TCP packet processing from the Operating System Kernel to achieve higher throughput.
This communication is supported by the Storage Backend fabric described in the previous section and exemplified in Figure 15.
Figure 15: GPU Backend to Storage Backend Communication
WEKA Storage Solution
In small clusters, it may be sufficient to use the local storage on each GPU server, or to aggregate this storage together using open-source or commercial software. In larger clusters with heavier workloads, an external dedicated storage system is required to provide dataset staging for ingest, and for cluster checkpointing during training. This JVD describes the infrastructure for dedicated storage using WEKA storage.
WEKA is a distributed data platform that allows high performance and concurrent access and allows all GPU Servers in the cluster to efficiently utilize a shared storage resource. With extreme I/O capabilities, the WEKA system can service the needs of all servers and scale to support hundreds or even thousands of GPUs.
Toward the end of this document, you can find more details on the WEKA storage system, including configuration settings, driver details, and more.
Scaling
The size of an AI cluster varies significantly depending on the specific requirements of the workload. The number of nodes in an AI cluster is influenced by factors such as the complexity of the machine learning models, the size of the datasets, the desired training speed, and the available budget. The number varies from a small cluster with less than 100 nodes to a data center-wide cluster comprising of 10000s of compute, storage, and networking nodes. A minimum of 4 spines must always be deployed for path diversity and reduction of PFC failure paths.
Fabric Scaling - Devices and Positioning
Fabric Scaling Table | ||
---|---|---|
Small | Medium | Large |
64 – 2048 GPU | 2048 – 8192 GPU | 8192 – 32768 GPU |
With support for up to 2048 GPUs, the Juniper QFX5240-64CDs or QFX5230-64CD can be used as Spine and leaf devices to support single or dual-stripe applications. To follow best practice recommendations, a minimum of 4 Spines should be deployed, even in a single-stripe fabric. | With support for 2048 – 8192 GPUs, the Juniper QFX5240-64CDs can be used as Spine and leaf devices to achieve appropriate scale. This 3-stage, rail-based fabric design provides physical connectivity to 16 Stripes from 64 Spines and 1024 leaf nodes, maintaining a 1:1 subscription throughput model. | For infrastructures supporting more than 8192 GPUs, the Juniper PTX1000x Chassis spine and QFX5240 leaf nodes can support up to 32768 GPUs. This 3-stage, rail-based fabric design provides physical connectivity to 64 Stripes from 64 Spines and 4096 leaf nodes, maintaining a 1:1 subscription throughput model. |
Juniper continues in its rapid innovation for increased scalability and low Job Completion Times in AI network fabrics with our recently introduced QFX5240 TH5 switch, delivering 64 ports of high-density 800GbE ports in a 2U fixed form factor with software to provide advanced network services tuned to the specific needs of AI workloads. These advanced services include Selective Load Balancing, Global Load Balancing, ISSU Fast Boot, Reactive Path Balancing, and more.
Juniper Hardware and Software Components
For this particular solution design, the Juniper products and software versions are below. The design documented in this JVD is considered the baseline representation for the validated solution. As part of a complete solutions suite, we routinely swap hardware devices with other models during iterative use case testing. Each switch platform validated in this document goes through the same rigorous role-based testing using specified versions of Junos OS and Apstra management software.
Juniper Hardware Components
The following table summarizes the switches tested and validated by role for the AI Data Center Network with Juniper Apstra JVD.
Validated Devices and Positioning
Validated Devices and Positioning | ||
---|---|---|
Solution | Leaf Switches | Spine Switches |
Frontend Fabric | QFX5130-32CD | QFX5130-32CD |
GPU Backend Fabric |
QFX5230-64CD (CLUSTER 1-STRIPE 1) QFX5220-32CD (CLUSTER 1-STRIPE 2) QFX5240-64OD (CLUSTER 2) |
QFX5230-64CD (CLUSTER 1) QFX5240-64CD (CLUSTER 2) |
Storage Backend Fabric | QFX5220-32CD | QFX5220-32CD |
Juniper Software Components
The following table summarizes the software versions tested and validated by role.
Platform Recommended Release
Platform | Role | Version |
---|---|---|
Juniper Asptra | Management Platform | 5.0.0-a-12 |
QFX5130-32CD | Frontend Leaf | 22.2R3-S4 |
QFX5130-32CD | Frontend Spine | 22.2R3-S4 |
QFX5220-32CD | Storage Backend Leaf | 23.4R2-S1.4-EVO |
QFX5220-32CD | Storage Backend Spine | 23.4R2-S1.4-EVO |
QFX5220-32CD | GPU Backend Leaf - Cluster 1 | 23.4X100-D20 * |
QFX5230-64CD | GPU Backend Leaf - Cluster 1 | 23.4X100-D20 * |
QFX5230-64CD | GPU Backend Spine - Cluster 1 | 23.4X100-D20 * |
QFX5240-64CD | GPU Backend Leaf - Cluster 2 | 23.4X100-D20 * |
QFX5240-64CD | GPU Backend Spine - Cluster 2 | 23.4X100-D20 * |
* Note: 23.4X100-D20 is available through your Juniper Account Team or Product Line Managers. Please reach out to your account team for information on how to obtain this Junos-EVO release.
Congestion Management
AI clusters pose unique demands on network infrastructure due to their high-density, and low-entropy traffic patterns, characterized by frequent elephant flows with minimal flow variation. Additionally, most AI modes require uninterrupted packet flow with no packet loss for training jobs to be completed.
For these reasons, when designing a network infrastructure for AI traffic flows, the key objectives include maximum throughput, minimal latency, and minimal network interference over a lossless fabric, resulting in the need to configure effective congestion control methods.
Data Center Quantized Congestion Notification (DCQCN), has become the industry-standard for end-to-end congestion control for RDMA over Converged Ethernet (RoCEv2) traffic. DCQCN congestion control methods offer techniques to strike a balance between reducing traffic rates and stopping traffic all together to alleviate congestion, without resorting to packet drops.
DCQCN combines two different mechanisms for flow and congestion control:
- Priority-Based Flow Control (PFC), and
- Explicit Congestion Notification (ECN).
Priority-Based Flow Control (PFC) helps relieve congestion by halting traffic flow for individual traffic priorities (IEEE 802.1p or DSCP markings) mapped to specific queues or ports. The goal of PFC is to stop a neighbor from sending traffic for an amount of time (PAUSE time), or until the congestion clears. This process consists of sending PAUSE control frames upstream requesting the sender to halt transmission of all traffic for a specific class or priority while congestion is ongoing. The sender completely stops sending traffic to the requesting device for the specific priority.
While PFC mitigates data loss and allows the receiver to catch up processing packets already in the queue, it impacts performance of applications using the assigned queues during the congestion period. Additionally, resuming traffic transmission post-congestion often triggers a surge, potentially exacerbating or reinstating the congestion scenario.
Explicit Congestion Notification (ECN), on the other hand, curtails transmit rates during congestion while enabling traffic to persist, albeit at reduced rates, until congestion subsides. The goal of ECN is to reduce packet loss and delay by making the traffic source decrease the transmission rate until the congestion clears. This process entails marking packets with ECN bits at congestion points by setting the ECN bits to 11 in the IP header. The presence of this ECN marking prompts receivers to generate Congestion Notification Packets (CNPs) sent back to source, which signal the source to throttle traffic rates.
Combining PFC and ECN offers the most effective congestion relief in a lossless IP fabric supporting RoCEv2, while safeguarding against packet loss. To achieve this, when implementing PFC and ECN together, their parameters should be carefully selected so that ECN is triggered before PFC.
Load Balancing
The fabric architecture used in this JVD for both the Frontend and backend follows the 2-stage clos design, with every leaf node connected to all the available spine nodes, and via multiple interfaces. As a result, multiple paths are available between the leaf and spine nodes to reach other devices.
AI traffic characteristics may impede optimal link utilization when implementing traditional Equal Cost Multiple Path (ECMP) static load balancing over these paths. This is because the hashing algorithm which looks at specific fields in the packet headers will result in multiple flows mapped to the same link due to their similarities. Consequently, certain links will be favored, and highly utilized links may impede the transmission of smaller, low-bandwidth flows, leading to potential collisions, congestion and packet drops. To improve the distribution of traffic across all the available paths Dynamic Load Balancing (DLB) should be implemented on the leaf and spine nodes, instead of traditional ECMP.
Dynamic Load Balancing (DLB) ensures that all paths are utilized equally by not only looking at the packet headers to select a path for a given flow, but by also considering real-time link quality based on port load (link utilization) and port queue depth. This method provides better results when multiple long-lived flows moving large amounts of data need to be load balanced.
Each Language model will have a different traffic profile and characteristics, and therefore, class of service will need to be tuned to the specific model or models in use. Introduction to Congestion Control in Juniper AI Networks explores how to build a lossless fabric for AI workloads using DCQCN (ECN and PFC) congestion control methods and DLB. The document uses the DLRM training model as a reference and demonstrates how different congestion parameters such as ECN and PFC counters, input drops and tail drops can be monitored to adjust configuration and build a lossless fabric infrastructure for RoCEv2 traffic. Load Balancing in the Data Center provides a comprehensive deep dive into the various load-balancing mechanisms and their evolution to suit the needs of the data center.