Optimizing AI/ML Infrastructure with Adaptive Load Balancing

0:00 Hello friends is your AI/ML infrastructure struggling to keep up? Discover the ML

0:06 infrastructure and the power of adaptive load balancing in this short video. Let's understand

0:16 the most common high-level design for a network optimized for machine learning we have at the

0:21 very outset a front-end fabric. This is where users and applications interact with the ML

0:27 system. It could include web servers, APIs or any other interface. This is the typical spine/leaf

0:34 architecture with border leaf switches. They connect the front-end fabric to the rest of

0:38 network handling the routing and filtering of traffic. Spine switches act as the backbone of

0:45 the network and leaf switches connect to the actual Computing resources used for ML tasks,

0:50 such as GPUs or specialized processors. Now the second part is the back-end fabric. This is where

0:56 the heavy lifting of ML happens. This is where you have your servers with GPUs. These are powerful

1:02 servers equipped with graphical processing units specifically designed for the parallel

1:08 processing needs of ML training and inference. And also we have infrastructure dedicated to

1:14 training ML models. This could include specialized hardware and software for distributed training,

1:20 data preprocessing or model optimization. And then we have the storage and inference network.

1:27 We all know data is king. Storage is a critical component of any ML infrastructure. The network

1:33 design should prioritize fast and reliable access to data inference. At scale, the network should

1:40 be able to handle the demands of inference workloads which can vary significantly in

1:45 terms of volume and latency requirements. We must have flexibility: the architecture should

1:52 be flexible enough to support different types of storage and inference solutions and this

1:57 can allow you to choose the best option that suits your needs. So what are the key benefits

2:03 aspects of AI/ML infra? The first and foremost is uniform link utilization in a spine/leaf network.

2:10 It is very important to ensure the traffic is distributed evenly across the links between

2:15 spine and leaf switches. This is where features like ECMP help. We must have a lossless behavior:

2:22 the network should be designed to prevent packet loss even under high traffic loads. Features like

2:28 ECN and PFC come to the rescue. Here factors like the number of switches, the distance between them,

2:36 and the required data rates will influence the choice of topology in optics. This also matters

2:42 for AI/ML infra. The network should be designed to support multi-tenancy,

2:48 this provides isolation between different tenants to ensure security and performance. Last but not

2:55 least, we must have provisioning in automation. Automating or provisioning of network resources

3:01 can simplify management and reduce the risk of errors. Now, ML training requires iteration-level

3:08 atomicity. For ML training across GPUs, a job is divided into multiple iterations and each

3:15 iteration involves several steps. The first step is initialization. This is where GPUs prepare data

3:22 and resources for computation. Now computation, this is where GPUs perform computations on their

3:29 assigned portion of data. And then, we have data transfer. Intermediate results or gradients are

3:34 then transferred between GPUs which is shown at green and orange lines. This might involve direct

3:40 GPU to GPU communication or transfer to the host CPU. We also have aggregation within GPU. Each

3:46 GPU processes the received data, potentially performing operations like reduce-scatter which

3:51 is reducing data and scattering across multiple GPUs or all gather which is gathering data from

3:57 all GPUs to each GPU, which is shown here. But the most important aspect is the sequence which

4:03 is initialization / computation / data transfer / aggregation represents a single iteration. And

4:10 this is repeated multiple times till the entire job is complete all the training job is complete.

4:16 Crucially if any iteration fails, for example, due to an error in computation or data transfer,

4:22 the job halts and iteration is repeated. Today's powerful GPUs can handle massive amounts of data,

4:30 we are talking about 800Gbps. This can easily saturate network connections. However, traditional

4:36 load-balancing methods can inadvertently create bottlenecks by unevenly distributing the traffic

4:43 across network links. This means: some connections get overloaded while others sit idle. This can

4:50 ultimately slow down the training of your complete AI workflow. Now, this condition can happen at

4:56 leaf or spine levels: the traffic from several leaves or several leaf switches to the spine may

5:04 be directed to the same leaf with finite capacity and possibly a single spine-to-leaf downlink,

5:11 flows can oversubscribe the interface resulting in buffer exhaustion, congestion or drops. This

5:17 is where features like ECMP help you load balance your traffic flow. Now let's understand why your

5:24 ECMP might be failing you at times. We know that the distribution of a given traffic flow

5:30 across one of the ECMP leg is based on a hash function. Hash function generates a hash based

5:37 on n-tuple primarily comprising several fields of L2/L3/L4 packet headers. The many-to-one

5:44 mapping between one or more hash value can cause unbalanced traffic distribution sometime and this

5:52 unbalanced traffic distribution would create an overload condition for child legs. Not to mention,

5:58 if the unbalanced child link is persistent, the only recovery may be to add or remove legs of an

6:05 ECMP path. This is a router R1 which is using a hash function to distribute traffic across

6:11 multiple NextHops R2, R3, R4, R5. The hash function assigns each packet to a specific

6:17 hash bucket which then determines the next hop for that particular packet. In the ideal case,

6:24 which is equal flows and equal traffic rate, the hash function distributes traffic evenly across

6:29 all the next hops. This results in optimal load balancing. But what happens if the hash function

6:36 is not well-designed? It can lead to uneven traffic distribution where some links become

6:42 overloaded while others remain underutilized. This can surely cause performance bottlenecks

6:48 and network congestion. If we take another example: a single large flow or a "fat" flow

6:53 can also disrupt load-balancing. If a significant proportion of traffic belongs to a single flow,

6:59 it might all be hashed to the same next hop. This would overload that link and cause uneven

7:05 distribution this is where Adaptive Load Balancing comes into the picture. It helps prevent network

7:11 congestion in an AI/ML infrastructure. Instead of just relying on simple hashing to distribute

7:17 traffic, ALB intelligently monitors network links and dynamically adjusts how data flows are routed.

7:26 This ensures that no single link gets overloaded, maximizing efficiency, and preventing slowdowns of

7:32 your AI jobs. Even with very demanding workloads or unexpected spikes in traffic, ALB can keep your

7:41 network running smoothly and it will help you to avoid any congestion in AI/ML load work training.

7:48 Now with Express ASICs, we achieve Adaptive Load Balancing in our networking hardware. These chips

7:54 have built-in intelligence to constantly monitor network traffic and make real-time adjustments

8:00 to ensure smooth and efficient data flow. Let's understand how it works. The first and foremost

8:06 part is traffic analysis. The Express ASIC analyzes incoming data types or data packets,

8:12 identifying different flows, and also measuring their bandwidth usage. Based upon this analysis,

8:18 the ASIC automatically adjusts how traffic is distributed across network links,

8:23 this also prevents any single link from becoming overloaded, ensuring that data can move quickly

8:30 and efficiently throughout the network. The selected buckets can be maintained at a per-ASIC

8:35 or per-line-card level. With the express ASIC, we achieve Adaptive Load Balancing in our networking

8:40 Hardware. These chips have built-in intelligence to constantly monitor network traffic and have

8:46 real-time adjustments to ensure smooth and efficient data flow. Now think of it as a

8:53 smart traffic management system for your complete network which prevents bottlenecks and keeps the

8:58 data flow smooth, especially important for your demanding AI/ML workloads. Thank you.

Optimizing AI/ML Infrastructure with Adaptive Load Balancing

Optimizing AI/ML Infrastructure with Adaptive Load Balancing

You’ll learn

Who is this for?

Resources

Experience More

NOW in 60: What is AI for Networking?

Explainable AI Technical Whiteboard Series: SHAP

The Power of AI-Native Networking in Formula 1®

Whiteboard Technical Series: Large Language Models

Transcript

Stay in touch