Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Solution Architecture

The three fabrics described in the previous section (Frontend, GPU Backend, and Storage Backend), are interconnected together in the overall AI JVD solution architecture as shown in Figure 2.

Figure 2: AI JVD Solution Architecture

Note: The number and switch type of the leaf and spine nodes, as well as the number and speed of the links between them, is determined by the type of fabric (Frontend, GPU Backend or Storage Backend) as they present different requirements. More details will be included in the respective fabric description sections.

In the case of the GPU Backend fabric, the number of GPU servers, as well as the number of GPUs per server, are also factors determining the number and switch type of the leaf and spine nodes.

Frontend Fabric

The Frontend Fabric provides the infrastructure for users to interact with the AI systems to orchestrate training and inference tasks workflows using tools such as SLURM. These interactions do not generate heavy data flows nor have stringent requirements regarding latency or packet drops; thus, they do not impose rigorous demands on the fabric.

The Frontend Fabric design described in this JVD follows a traditional 3-stage IP Fabric architecture without HA, as shown in Figure 3. This architecture provides a simple and effective solution for the connectivity required in the Frontend. However, any fabric architecture including EVPN/VXLAN, could be used. If an HA-capable Frontend Fabric is required we recommend following the 3-Stage with Juniper Apstra JVD.

Figure 3: Frontend Fabric Architecture

The Frontend devices included in this fabric, and the connections between them, can be summarized as follows:

Nvidia DGX GPU Servers Weka Storage Servers Headend Servers

Frontend

Leaf Nodes switch model

(frontend-gpu-leaf &

frontend-weka-leaf)

Frontend Spine Nodes switch model

(frontend-spine#)

A100 x 8

H100 x 4

Weka Storage Server x 8 Headend-SVR x 3 QFX5130-32CD x 2 QFX5130-32CD x 2
GPU Servers to <=> Frontend Leaf Nodes

Weka Storage Servers <=>

Frontend Leaf Nodes

Headend Servers <=>

Frontend Leaf Nodes

Frontend Spine Nodes <=> Frontend Leaf Nodes

1 x 100GE links

between each GPU server ( A100-01 to A100-08 , & H100-01 to H100-04 ) and the frontend-gpu-leaf switch.

1 x 100GE links

between each storage server ( weka-1 to weka-8 ) and the frontend-weka-leaf switch.

1 x 10GE links

between each headend server ( Headend-SVR-01 to Headend-SVR-03 ) and the frontend-weka-leaf switch.

2 x 400GE links

between each leaf node and each spine node.

This fabric is a pure L3 IP fabric using EBGP for route advertisement. The IP addressing and EBGP configuration details are described in the networking section on this document.

GPU Backend Fabric

The GPU Backend fabric provides the infrastructure for GPUs to communicate with each other within a cluster, using RDMA over Converged Ethernet (RoCEv2). ROCEv2 boosts data center efficiency, reduces overall complexity, and increases data delivery performance by enabling the GPUs to communicate as they would with the InfiniBand protocol.

Packet loss can significantly impact job completion times and therefore should be avoided. Therefore, when designing the compute network infrastructure to support RoCEv2 for an AI cluster, one of the key objectives is to provide a lossless fabric, while also achieving maximum throughput, minimal latency, and minimal network interference for the AI traffic flows. ROCEv2 is more efficient over lossless networks, resulting in optimum job completion times.

The GPU Backend fabric in this JVD was designed with these goals in mind and follows a 3-stage IP clos architecture combined with NVIDIA’s Backend GPU Rail Optimized Stripe Architecture (discussed in the next section), as shown in Figure 4.

Figure 4: GPU Backend Fabric Architecture

We have built two different clusters in the AI lab, with different combinations of QFX switch models as Leaf and Spine nodes, and two different Nvidia's server models, as shown in Figure 5.

Figure 5: AI JVD Lab Clusters

The two clusters share the same Frontend fabric and Storage Backend fabric but have their own GPU Backend fabric. Each cluster is comprised of two stripes following the Backend GPU Rail Optimized Stripe Architecture. The two clusters are connected by the spine nodes and include a different set of GPU servers connected to the leaf nodes.

The backend devices included in the fabric on each cluster/stripe, and the connections between them, can be summarized as follows:

Backend devices per cluster and stripe

Cluster Stripe Nvidia DGX GPU Servers

GPU Backend Leaf Nodes switch model

(gpu-backend-leaf#)

GPU Backend Spine Nodes switch model

(gpu-backend-spine#)

1 1 A100-01 to A100-04 QFX5230-64CD x 8 QFX5230-64CD x 2
1 2 A100-05 to A100-08 QFX5220-32CD x 8
2 1 H100-01 to H100-02 QFX5240-64OD x 8  
2 2 H100-03 to H100-04 QFX5240-64OD x 8

Connections between servers, leaf and spine nodes per cluster and stripe

Cluster Stripe

GPU Servers <=>

GPU Backend Leaf Nodes

GPU Backend Spine Nodes <=>

GPU Backend Leaf Nodes

1 1

1 x 200GE links

between each A100 server and each leaf node (200GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 2 links per leaf node)

1 2

1 x 200GE links

between each A100 server and each leaf nodes (200GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 2 links per leaf node)

2 1

1 x 400GE links

between each H100 server and each leaf nodes (400GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 4 links per leaf node)

2 2

1 x 400GE links

between each H100 server and each leaf nodes (400GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 4 links per leaf node)

  • Nvidia A100 servers in the lab are connected to the fabric using 200GE interfaces while the H100 servers used 400GE interfaces.
  • This fabric is a pure L3 IP fabric that uses EBGP for route advertisement (described in the networking section).
  • Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers (described in the networking section).

The speed and number of links between the GPU servers and leaf nodes and between the leaf and spine nodes determines the oversubscription factor. As an example, consider the number of GPU servers available in the lab, and how they are connected to the GPU backend fabric as described above.

Per cluster, per stripe Server to Leaf Bandwidth

Server to Leaf Bandwidth per Stripe (per Cluster)
Cluster AI Systems (server type) Servers per Stripe Server <=> Leaf Links per Server Bandwidth of Server <=> Leaf Links [Gbps]

Total Bandwidth

Servers <=> Leaf per stripe [Tbps}

1 A100 4 8 200 4 x 8 x 200/1000 = 6.4
2 H100 2 8 400 2 x 8 x 400/1000 = 6.4

Per cluster, per stripe Leaf to Spine Bandwidth

Leaf to Spine Bandwidth per Stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe

Speed Of

Leaf <=> Spine Links

[Gbps]

Number of Spine Nodes

Total Bandwidth

Leaf <=> Spine Per Stripe

[Tbps]

8 2 x 400 2 12.8

The (over)subscription rate is simply calculated by comparing the numbers from the two tables above:

In cluster 1, the bandwidth between the servers and the leaf nodes is 6.4 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is 12.8 Tbps per stripe. This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, while still having extra capacity to accommodate additional servers without becoming oversubscribed.

Figure 6: Extra Capacity Example

We also tested connecting the H100 GPU servers along the A100 servers to the stripes in Cluster 1 as follows:

Figure 7: 1:1 Subscription Example

Per cluster, per stripe Server to Leaf Bandwidth with all servers connected to same cluster

Cluster Al Systems Servers per Stripe Server <=> Leaf Links per Server

Server <=> Leaf Links Bandwidth

[Gbps]

Total Servers <=> Leaf Links

Bandwidth per stripe

[Tbps]

1 A100 4 8 200 4 x 8 x 200/1000 = 6.4
  H100 2 8 400 2 x 8 x 400/1000 = 6.4
  Total Bandwidth of Server <=> Leaf Links 12.8

The bandwidth between the servers and the leaf nodes is now 12.8 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is also 12.8 Tbps per stripe (as shown in table 2 before). This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, but now there is no extra capacity to accommodate additional servers. The subscription factor in this case is 1:1 (no subscription).

To run oversubscription testing, we disabled some of the interfaces between the leaf and spines to reduce the available bandwidth as shown in the example in Figure 8:

Figure 8: 2:1 Oversubscription Example

The total Servers to Leaf Links bandwidth per stripe has not changed. It is still 12.8 Tbps as shown in table 3 in the previous scenario.

However, the bandwidth available between the leaf and spine nodes is now only 6.4 Tbps per stripe.

Leaf to Spine Bandwidth per Stripe

Leaf <=> Spine Links Per Spine Node & Per Stripe

Speed Of

Leaf <=> Spine Links

[Gbps]

Number of Spine Nodes

Total Bandwidth

Leaf <=> Spine Per Stripe

[Tbps]

8 1 x 400 2 6.4

This means that the fabric no longer has enough capacity to process all traffic between the GPUs even if this traffic was 100% inter-stripe, potentially causing congestion and traffic loss. The oversubscription factor in this case is 2:1.

Backend GPU Rail Optimized Stripe Architecture

A Rail Optimized Stripe Architecture provides efficient data transfer between GPUs, especially during computationally intensive tasks such as AI Large Language Models (LLM) training workloads, where seamless data transfer is necessary to complete the tasks within a reasonable timeframe. A Rail Optimized topology aims to maximize performance by providing minimal bandwidth contention, minimal latency, and minimal network interference, ensuring that data can be transmitted efficiently and reliably across the network.

In a Rail Optimized Stripe Architecture a stripe refers to a design module or building block, that can be replicated to scale up the AI cluster as shown in Figure 9.

Figure 9: Rail Optimized Stripe

The number of leaf switches in a single stripe is always 8 and is determined by the number of GPUs per server (Each NVIDIA DGX H100 GPU server includes 8 NVIDIA H100 Tensor core GPUs).

The maximum number of servers supported in a single stripe (N1) is determined by the Leaf node switch model. This is because to provide 1:1 subscription, the number of interfaces connecting the GPU servers, and the leaf nodes should be equal to the number of interfaces between the leaf and spine nodes.

Maximum number of GPUs supported per stripe

Leaf Node QFX Model Maximum number of 400 GE interfaces per switch Maximum number of supported servers per stripe (N1) Maximum number of GPUs supported per stripe
QFX5220-32CD 32 16 16 x 8 = 128
QFX5230-64CD 64 32 32 x 8 = 256
QFX5240-64OD 64 32 32 x 8 = 256
  • QFX5220-32CD switches provides 32 x 400 GE ports (16 can be used to connect to the servers and 16 will be used to connect to the spine nodes)
  • QFX5230-64CD and QFX5240-64OD switches provide 64 x 400 GE ports (32 can be used to connect to the servers and 32 will be used to connect to the spine nodes)

To achieve larger scales, multiple stripes can be connected across Spine switches as shown in Figure 10.

Figure 10: Spines-connected Stripes

For example, assume that the desired number of GPUs is 16,000 and the fabric is using either QFX5230-64CD or QFX5240-64OD:

  • the number of servers per stripe (N1) = 32 => the maximum number of GPUs supported per stripe = 256

N2 = 16000/256 ≈ 63 stripes

  • with N2 = 64 stripes & N1 servers = 32 the cluster can provide 16,384 GPUs.
  • with N2 = 72 & N1 servers = 32 the cluster can provide 18432 GPUs.

The Stripes in the AI JVD setup consists of 8 Juniper QFX5220-32CD, QFX5230-64CD or QFX5240-64OD depending on the cluster and stripe. The number of GPUs supported on each cluster/stripe is shown in table #.

Maximum number of GPUs supported per cluster

Cluster Stripe Leaf Node QFX model Maximum number of GPUs supported per stripe
1 1 QFX5230-64CD 16 x 8 = 128
1 2 QFX5220-32CD 32 x 8 = 256
Total number of GPUs supported by the cluster = 384
2 1 QFX5240-64OD 32 x 8 = 256
2 2 QFX5240-64OD 32 x 8 = 256
Total number of GPUs supported by the cluster = 512

What is Rail Optimized?

The GPUs on each server are numbered 1-8, where the number represents the GPU’s position in the server, as shown in Figure 11.

Figure 11: Rail Optimized Connections Between GPUs and Leaf Nodes

Communication between GPUs in the same server happens internally via high throughput NV-Links (Nvidia links) channels attached to internal NV-Switches, while communication between GPUs in different servers happens across the QFX fabric, which provides 400Gbps GPU-to-GPU bandwidth. Communication across the fabric occurs between GPUs on the same rail, which is the basis of the Rail-optimized architecture: Rails connect GPUs of the same order across one of the leaf nodes; that is, rail N connects GPUs in position N in all the servers across leaf switch N.

Figure 12 represents a topology with one stripe and 8 rails connecting GPUs 1-8 across leaf switches 1-8 respectively.

The example shows that communication between GPU 7 and GPU 8 in Server 1 happens internally across Nvidia’s NVlinks/NV-switch (not shown), while communication between GPU 1 in Server 1 and GPU 1 in Server N1 happens across Leaf switch 1 (within the same rail).

Notice that if any communication between GPUs in different stripes and different servers is required (e.g. GPU 4 in server 1 communicating with GPU 5 in Server N1), data is first moved to a GPU interface in the same rail as the destination GPU, thus sending data to the destination GPU without crossing rails.

Following this design, data between GPUs on different servers (but in the same stripe) is always moved on the same rail and across one single switch, which guarantees GPUs are 1 hop away from each other and creates separate independent high-bandwidth channels, which minimize contention and maximize performance.

Notice that this example is presuming Nvidia’s PXN feature is enabled. PXN can be easily enabled/disabled before a training or inference job in initiated.

Figure 12: GPU to GPU Communication Between Two Servers with PXN Enabled

For reference, Figure 13 shows an example with PXN disabled.

Figure 13: GPU to GPU Communication Between Two Servers Without PXN Enabled

The example shows that communication between GPU 4 in Server 1 and GPU 5 in Server N1 goes across Leaf switch 1, the Spine nodes, and Leaf switch 5 (between two different rails).

Storage Backend Fabric

The Storage Backend fabric provides the connectivity infrastructure for storage devices to be accessible from the GPU servers.

The performance of the storage infrastructure significantly impacts the efficiency of AI workflows. A storage system that provides quick access to data can significantly reduce the amount of time for training AI models. Similarly, a storage system that supports efficient data querying and indexing can minimize the completion time of preprocessing and feature extraction in an AI workflow.

The Storage Backend fabric design in the JVD also follows a 3-stage IP clos architecture as shown in Figure 14. There is no concept of rail-optimization in a storage cluster. Each GPU server has a single connection to the leaf nodes, instead of 8.

Figure 14: Storage Backend Fabric Architecture

The Storage Backend devices included in this fabric, and the connections between them, can be summarized as follows:

Nvidia DGX GPU Servers Weka Storage Servers

Storage Backend Leaf Nodes switch model

(storage-backend-gpu-leaf & storage-backend-weka-leaf)

Storage Backend Spine Nodes switch model

(storage-backend-spine# )

A100 x 8

H100 x 4

Weka storage server x 8

QFX5130-32CD x 4

(2 storage-backend-gpu-leaf nodes, and

2 storage-backend-weka-leaf nodes)

QFX5130-32CD x 2

GPU Servers <=>

Storage Backend GPU Leaf Nodes

Weka Storage Servers <=>

Storage Backend Weka Leaf Nodes

Storage Backend Spine Nodes <=> Storage Backend Leaf nodes

1 x 100GE links

between each H100 server and the storage-backend-gpu-leaf switch

1 x 200GE links

between each A100 server and the storage-backend-gpu-leaf switch

1 x 100GE links

between each storage server (weka-1 to weka-8) and the storage-backend-weka-leaf switch

2 x 400GE links

between each leaf and spine nodes and the storage-backend-weka-leaf switch

3 x 400GE links

between each leaf and spine nodes and the storage-backend-gpu-leaf switch

The NVIDIA servers hosting the GPUs have dedicated storage network adapters (NVIDIA ConnectX) that support both the Ethernet and InfiniBand protocols and provide connectivity to external storage arrays.

Communications between GPUs and the storage devices leverage the WEKA distributed POSIX client which enables multiple data paths for transfer of stored data from the WEKA nodes to the GPU client servers. The WEKA client leverages the Data Plane Development Kit (DPDK) to offload TCP packet processing from the Operating System Kernel to achieve higher throughput.

This communication is supported by the Storage Backend fabric described in the previous section and exemplified in Figure 15.

Figure 15: GPU Backend to Storage Backend Communication

WEKA Storage Solution

In small clusters, it may be sufficient to use the local storage on each GPU server, or to aggregate this storage together using open-source or commercial software. In larger clusters with heavier workloads, an external dedicated storage system is required to provide dataset staging for ingest, and for cluster checkpointing during training. This JVD describes the infrastructure for dedicated storage using WEKA storage.

WEKA is a distributed data platform that allows high performance and concurrent access and allows all GPU Servers in the cluster to efficiently utilize a shared storage resource. With extreme I/O capabilities, the WEKA system can service the needs of all servers and scale to support hundreds or even thousands of GPUs.

Toward the end of this document, you can find more details on the WEKA storage system, including configuration settings, driver details, and more.

Scaling

The size of an AI cluster varies significantly depending on the specific requirements of the workload. The number of nodes in an AI cluster is influenced by factors such as the complexity of the machine learning models, the size of the datasets, the desired training speed, and the available budget. The number varies from a small cluster with less than 100 nodes to a data center-wide cluster comprising of 10000s of compute, storage, and networking nodes. A minimum of 4 spines must always be deployed for path diversity and reduction of PFC failure paths.

Fabric Scaling - Devices and Positioning

Fabric Scaling Table
Small Medium Large
64 – 2048 GPU 2048 – 8192 GPU 8192 – 32768 GPU
With support for up to 2048 GPUs, the Juniper QFX5240-64CDs or QFX5230-64CD can be used as Spine and leaf devices to support single or dual-stripe applications. To follow best practice recommendations, a minimum of 4 Spines should be deployed, even in a single-stripe fabric. With support for 2048 – 8192 GPUs, the Juniper QFX5240-64CDs can be used as Spine and leaf devices to achieve appropriate scale. This 3-stage, rail-based fabric design provides physical connectivity to 16 Stripes from 64 Spines and 1024 leaf nodes, maintaining a 1:1 subscription throughput model. For infrastructures supporting more than 8192 GPUs, the Juniper PTX1000x Chassis spine and QFX5240 leaf nodes can support up to 32768 GPUs. This 3-stage, rail-based fabric design provides physical connectivity to 64 Stripes from 64 Spines and 4096 leaf nodes, maintaining a 1:1 subscription throughput model.

Juniper continues in its rapid innovation for increased scalability and low Job Completion Times in AI network fabrics with our recently introduced QFX5240 TH5 switch, delivering 64 ports of high-density 800GbE ports in a 2U fixed form factor with software to provide advanced network services tuned to the specific needs of AI workloads. These advanced services include Selective Load Balancing, Global Load Balancing, ISSU Fast Boot, Reactive Path Balancing, and more.

Juniper Hardware and Software Components

For this particular solution design, the Juniper products and software versions are below. The design documented in this JVD is considered the baseline representation for the validated solution. As part of a complete solutions suite, we routinely swap hardware devices with other models during iterative use case testing. Each switch platform validated in this document goes through the same rigorous role-based testing using specified versions of Junos OS and Apstra management software.

Juniper Hardware Components

The following table summarizes the switches tested and validated by role for the AI Data Center Network with Juniper Apstra JVD.

Validated Devices and Positioning

Validated Devices and Positioning
Solution Leaf Switches Spine Switches
Frontend Fabric QFX5130-32CD QFX5130-32CD
GPU Backend Fabric

QFX5230-64CD (CLUSTER 1-STRIPE 1)

QFX5220-32CD (CLUSTER 1-STRIPE 2)

QFX5240-64OD (CLUSTER 2)

QFX5230-64CD (CLUSTER 1)

QFX5240-64CD (CLUSTER 2)

Storage Backend Fabric QFX5220-32CD QFX5220-32CD

Juniper Software Components

The following table summarizes the software versions tested and validated by role.

Platform Recommended Release

Platform Role Version
Juniper Asptra Management Platform 5.0.0-a-12
QFX5130-32CD Frontend Leaf 22.2R3-S4
QFX5130-32CD Frontend Spine 22.2R3-S4
QFX5220-32CD Storage Backend Leaf 23.4R2-S1.4-EVO
QFX5220-32CD Storage Backend Spine 23.4R2-S1.4-EVO
QFX5220-32CD GPU Backend Leaf - Cluster 1 23.4X100-D20 *
QFX5230-64CD GPU Backend Leaf - Cluster 1 23.4X100-D20 *
QFX5230-64CD GPU Backend Spine - Cluster 1 23.4X100-D20 *
QFX5240-64CD GPU Backend Leaf - Cluster 2 23.4X100-D20 *
QFX5240-64CD GPU Backend Spine - Cluster 2 23.4X100-D20 *

* Note: 23.4X100-D20 is available through your Juniper Account Team or Product Line Managers. Please reach out to your account team for information on how to obtain this Junos-EVO release.

Congestion Management

AI clusters pose unique demands on network infrastructure due to their high-density, and low-entropy traffic patterns, characterized by frequent elephant flows with minimal flow variation. Additionally, most AI modes require uninterrupted packet flow with no packet loss for training jobs to be completed.

For these reasons, when designing a network infrastructure for AI traffic flows, the key objectives include maximum throughput, minimal latency, and minimal network interference over a lossless fabric, resulting in the need to configure effective congestion control methods.

Data Center Quantized Congestion Notification (DCQCN), has become the industry-standard for end-to-end congestion control for RDMA over Converged Ethernet (RoCEv2) traffic. DCQCN congestion control methods offer techniques to strike a balance between reducing traffic rates and stopping traffic all together to alleviate congestion, without resorting to packet drops.

DCQCN combines two different mechanisms for flow and congestion control:

  • Priority-Based Flow Control (PFC), and
  • Explicit Congestion Notification (ECN).

Priority-Based Flow Control (PFC) helps relieve congestion by halting traffic flow for individual traffic priorities (IEEE 802.1p or DSCP markings) mapped to specific queues or ports. The goal of PFC is to stop a neighbor from sending traffic for an amount of time (PAUSE time), or until the congestion clears. This process consists of sending PAUSE control frames upstream requesting the sender to halt transmission of all traffic for a specific class or priority while congestion is ongoing. The sender completely stops sending traffic to the requesting device for the specific priority.

While PFC mitigates data loss and allows the receiver to catch up processing packets already in the queue, it impacts performance of applications using the assigned queues during the congestion period. Additionally, resuming traffic transmission post-congestion often triggers a surge, potentially exacerbating or reinstating the congestion scenario.

Explicit Congestion Notification (ECN), on the other hand, curtails transmit rates during congestion while enabling traffic to persist, albeit at reduced rates, until congestion subsides. The goal of ECN is to reduce packet loss and delay by making the traffic source decrease the transmission rate until the congestion clears. This process entails marking packets with ECN bits at congestion points by setting the ECN bits to 11 in the IP header. The presence of this ECN marking prompts receivers to generate Congestion Notification Packets (CNPs) sent back to source, which signal the source to throttle traffic rates.

Combining PFC and ECN offers the most effective congestion relief in a lossless IP fabric supporting RoCEv2, while safeguarding against packet loss. To achieve this, when implementing PFC and ECN together, their parameters should be carefully selected so that ECN is triggered before PFC.

Load Balancing

The fabric architecture used in this JVD for both the Frontend and backend follows the 2-stage clos design, with every leaf node connected to all the available spine nodes, and via multiple interfaces. As a result, multiple paths are available between the leaf and spine nodes to reach other devices.

AI traffic characteristics may impede optimal link utilization when implementing traditional Equal Cost Multiple Path (ECMP) static load balancing over these paths. This is because the hashing algorithm which looks at specific fields in the packet headers will result in multiple flows mapped to the same link due to their similarities. Consequently, certain links will be favored, and highly utilized links may impede the transmission of smaller, low-bandwidth flows, leading to potential collisions, congestion and packet drops. To improve the distribution of traffic across all the available paths Dynamic Load Balancing (DLB) should be implemented on the leaf and spine nodes, instead of traditional ECMP.

Dynamic Load Balancing (DLB) ensures that all paths are utilized equally by not only looking at the packet headers to select a path for a given flow, but by also considering real-time link quality based on port load (link utilization) and port queue depth. This method provides better results when multiple long-lived flows moving large amounts of data need to be load balanced.

Note:

Each Language model will have a different traffic profile and characteristics, and therefore, class of service will need to be tuned to the specific model or models in use. Introduction to Congestion Control in Juniper AI Networks explores how to build a lossless fabric for AI workloads using DCQCN (ECN and PFC) congestion control methods and DLB. The document uses the DLRM training model as a reference and demonstrates how different congestion parameters such as ECN and PFC counters, input drops and tail drops can be monitored to adjust configuration and build a lossless fabric infrastructure for RoCEv2 traffic. Load Balancing in the Data Center provides a comprehensive deep dive into the various load-balancing mechanisms and their evolution to suit the needs of the data center.