Praful Lalchandani, Head of AI Data Center Business, Juniper Networks

Networks Myths to Solutions - Juniper’s Approach to AI Data Centers

Summits Data Center

Cloud Field Day 20: Networks Myths to Solutions - Juniper’s Approach to AI Data Centers

Misconceptions and misunderstandings of the rapidly evolving AI infrastructure space often cloud the decision-making process for IT professionals. This presentation debunks myths surrounding AI data centers, shows how Juniper’s AI data center solution can optimize these environments, and demonstrates the key role networking plays in maximizing the ROI for expensive GPU assets.

Show more

You’ll learn

  • The role of the network in minimizing job completion time (JCT)

  • Common AI training and inference network bottlenecks

  • The economic advantages of Ethernet over InfiniBand

Who is this for?

Network Professionals Business Leaders

Host

Praful Lalchandani
Head of AI Data Center Business, Juniper Networks

Transcript

0:10 so good morning everyone my name is pral lalchandani and today I'm excited to take you all on a journey from myths to

0:17 Solutions in the world of AI data centers now when it comes to networking for AI there are more than a few myths

0:25 that need to be debunked and a lot of self-serving arguments out there from vendors

0:30 so in this session today we're going to help you demonstrate how we are debunking some of the biggest

0:36 misconceptions out there and showing you how Juniper Networks is taking an evidence-based approach to demonstrate

0:43 what truly matters in the context of networking for AI so let me start with Juniper's

0:49 missions for AI data centers we know that gpus are expensive while gpus are

0:55 the currency for the AI Revolution building even a small cluster of just eight to 16 nodes can lead to a capex

1:01 investment of easily $5 to $10 million now we believe that while networking is a smaller percentage of

1:07 that overall capex investment networking plays an outsized role in maximizing the return on investment on your GPU assets

1:15 as Mansour mentioned if gpus are sitting idle it's a sad situation so our mission for AI data

1:22 centers is to unlock the full potential for our customers AI infrastructure the AI projects by a delivering unparallel

1:29 Network performance 400 gig and 800 gig B ease of operations using abstra and

1:35 three a lower total cost of ownership versus other proprietary Technologies out

1:42 there now this is a a level setting slide to show you what's a life cycle of

1:48 building an AI application nothing new over here that Juniper is sharing but the real the way it starts is first you

1:54 gather data you pre-process your data you Cate your data eliminate biases and that data is now into a training model

2:01 which is really a neural network or a machine learning model which is learning to recognize patterns from the data that

2:07 has been fed to it and then the rubber hits the road when the model is then packaged into an application deployed

2:13 for inference and it's now starting to make predictions based on new user

2:18 inputs the reason for level setting that is that Juniper has solutions for both training as well as inference clusters

2:26 now let's look at the requirements for Network requirements more specifically for both of training as well as

2:31 inference cluster starting with training training can be foundational model training uh which is really training a

2:37 model from scratch example of this would be gpt3 GPD 4 llama 2 llama 3 where

2:42 you're taking an uninitialized model and and training its waist and weights and

2:47 biases based on the data that has been fed to it uh foundational model training can

2:53 require thousands and thousands of gpus um probably the highest class of the GPU that is possibly out there uh versus

3:00 fine tuning fine-tuning is really at the task of taking a pre-trained model and

3:05 adapting it for a specific task so this may be for a very specific use case in automotive or Finance or whatever the

3:12 use case might be but the idea is do you take a smaller more specific data set uh and you train the model for a specific

3:19 task now fine tuning requires much lower scale than the kind of scale you read

3:24 for foundational model training this could be 64 GPU maybe thousands of gpus but not the tens and thousands of GP

3:29 that required for foundational model training but whether it is foundational model training or fine-tuning what we

3:36 see is that the elements of the training cluster sort of remain the same you have a GPU cluster which is your set of

3:42 coordinated gpus you have dedicated storage which is high performance storage this is where your data sets

3:49 that are being used to train the model are residing this is where intermediate checkpoints gradients all of those

3:54 things are in your dedicated storage and then you have shared storage nodes where everything else all your other artifacts

4:00 your logs everything else is stored in shared storage now for this session the three critical things uh that uh that we

4:07 want to talk about is a network are there three separate networks that participate in this AI training cluster

4:13 you have the backend GPU training Network which is gpus talking to gpus at very very high speed you have a backend

4:19 dedicated storage Network which is gpus talking to your backend dedicated storage and then you have your front-end

4:25 Network which is your connectivity to the outside world it's where client requests come in it could also be where

4:30 your orchestration is residing or it is almost always where the orchestration is residing this could be slurm it could be

4:36 kubernetes you know which is orchestrating jobs across your cluster as manure mentioned the key thing in in

4:43 the key metric for training jobs is job completion time which is how much time does it take to train a particular job

4:50 with a specific data set with a Target accuracy in mind can you can you talk about job

4:58 completion time a little bit with in an AI context because I don't understand

5:03 that I haven't really seen job used for AI before I mean it's very common in HPC but especially foundational like a lot

5:10 of training these are not so much jobs as continuous feedback they're not it's

5:16 not really feedback that's the wrong term but continuous feedback loops yeah um not really jobs unless you say a job

5:23 last 2 months for a foundational model so inference is real time and it is continuous yeah but you're before well

5:31 training is training is continuous it's one time long I mean unless I misunderstanding was totally training is

5:39 is actually is is not sometimes it can be continuous but many very often when we deploying a job a training job model

5:46 that is trained for inference it has been trained with prior data in in in in place I'll talk about rag where it's

5:52 live data coming in for inferencing but a typically any job like a Bert or in L

5:58 llama 2 or llama 3 you're training beginning it with your data maybe over over a period of a week then deploying

6:03 it now obviously you gather more data and you retrain it and then deploy it again but essentially it's not always

6:10 continuous it if you train it then you deploy it for influence so that training of a week is what you're calling it job

6:15 completion time because yeah an HPC job might be a few hours that you have to wait a week for but it's still yeah okay

6:22 thanks I F does does junifer and EPA play any role in the backend Network

6:27 absolutely we do yeah that's what we we are this session is about showing you how we play yeah okay all

6:34 right so uh so again just to give you a little bit more appreciation for why the

6:40 network is important for job completion time and the economics of job completion time the reason for this is job training

6:46 doesn't happen in one GPU it's a cluster of gpus that are coordinating to train that job over a period of you know hours

6:53 and weeks and and this is an example of a data parallelism where you have a very

6:58 large data set that that is sliced into chunks Federated to the gpus again over the data center Fabric and the gpus

7:05 there all do their individual computations they're basically calculating the weights and biases the gradients for the for the for the neural

7:12 network but before and this is this is the critical piece before the next set

7:18 of data can be fed to the gpus all the gpus need to merge their computations

7:24 that merge is when a burst of traffic hits the backend GPU Network and this is where Network efficiency becomes

7:30 critical because if there is any bottleneck so let's say GPU 8 is not able to talk to GPU 10 effectively or in

7:36 at very high speed all gpus in the cluster are sitting idle not just GPU 8 and the GPU 10 which lost packets every

7:43 GPU in the cluster is sitting idle and that's again lost time that you don't want to happen in your infrastructure

7:49 another example of this could be model parallelism where even the layers of a neural network you know the model is too

7:56 big for all the layers of the neural network to fit into one GPU the each layer or a set of layers is fed to

8:02 different gpus and which leads to even more complex uh traffic patterns over your backend GPU networks so hopefully

8:08 that explains why eliminating those network bottlenecks is super critical in these uh infrastructure so that merging

8:14 is done through the backend Network and not through the front end Network so the GPU computations are definitely

8:19 happening over the backend Network the front-end network is in a training is really just for job orchestration there's not a whole lot of traffic

8:25 happening in a front end Network the the backend network is you know you're accessing data that's your backend

8:30 storage Network gpus merging their computation is the backend GPU Network

8:36 okay and Juniper does play a role and I'm going to show you how yeah so now we moving on to inference

8:43 now inference can take on different cluster types uh the simplest form is single node inferencing your job doesn't

8:49 have uh let's say your your your model is not very big it can fit into the memory of a single GPU a single client

8:57 request can be handled by a single GPU some sometimes even a CPU if massive parallel processing is not necessary

9:03 right a little bit more complex situation is multi- node inferencing where your model like llama 3 or GPD 4

9:11 may not maybe too large to fit into the memory of a single GPU in which case

9:16 again even for inferencing a cluster of gpus across multiple server nodes is you

9:21 know coming into the picture to handle the request even for for a single client in which case you have a front-end

9:27 network from where uh frontend Network from where client requests are coming in inference responses are going out but

9:33 you also have a backend GPU inference Network now this is not the most common situation the most common situation is

9:38 single node inferencing but we have seen with our customers who are doing some things with llama 3 and and onwards uh

9:44 and higher larger sized models that they even have a backend GPU inference Network and the last situation is

9:51 retrieval augmented generation or rag which is you know you know becoming like a buzz word these days uh the idea

9:58 behind that is that as to the question be before you cannot be continuously

10:03 training your model but there is live data and maybe you want your the responsiveness of your model to be based

10:10 on that live data that is available out there so the idea behind rag is that you

10:15 supplement your inferencing with external sub or internal sources of data

10:20 and that is done using something called what what what is rag now we don't have time to go into rag in a lot of detail

10:26 but I'll just talk about the network context rag generally has Vector database involved that Vector database

10:33 is stored in shared storage nodes and requires more high performance low latency act access into those Vector

10:40 databases now like I said training the key metric is job completion time with inference the key metric is throughput

10:46 and time to First token throughput is how many client requests can you handle per second time to First token is a

10:53 measure of interactivity so when you go to chat gbt you ask a question what's the amount of time it takes for the

10:59 first word to start getting typed is Loosely reflected in time to First token and again network plays a key role in

11:06 accelerating both throughput as well as time to First token all

11:13 right so summarizing the network requirements you have the front-end fabric front end fabric we've typically

11:19 seen is 100 Gig to the host it's always ethernet uh it is where you have your

11:25 access to the outside world so your gate you have your Leaf spine fabric but you also have your gate Gateway which is a connectivity to the outside world uh you

11:32 typically require multi- tency in that fabric because that's the fabric that is connected to almost everything it's

11:37 connected to your general purpose infrastructure it is connected to your AI infrastructure it is connected to your headend servers where slurm or

11:43 kubernetes is located it's connected to pretty much everything plus it's connected to the outside world the backend GPU fabric is 400 gig in the

11:51 current generation of Nvidia h100 servers is a 400 G connectivity to each

11:57 server and uh you typically require lossless networking and we'll touch upon lossless networking a little bit more

12:03 detail but we've heard Mansour as well mention that you don't want to lose packets uh too extensively in that uh

12:10 backend GPU Fabric and then backend storage we typically see it as 200 gig today some of our customers are

12:15 deploying at 400 gig but again you need low latency access to storage there as well now for the rest of this session

12:22 we're going to focus on that backend GPU Fabric and see what are the requirements and how Juniper is serving those

12:28 requirements all right so there are three different

12:34 Technologies at play When You're Building backend GPU Fabrics the first one is infin band I think everybody is

12:41 familiar with infin band especially coming from the HPC world uh you know where this was the gold standard uh it

12:48 kind of translated into AI training as well so Nvidia is the only vendor out there that is positioning switches based

12:55 on infin band Fabrics the second is an open ethernet fabric solution with the keyword being open uh and there are

13:02 multiple different vendors that this ecosystem is really big you know there many many different vendors put you know

13:07 selling ethernet switches Juniper being one of them and the last is a new class of

13:13 products that is emerging which is called scheduled Fabric architectures and they seem to have found a sweet spot

13:19 or at least they claim to have found a sweet spot with AI training what scheduled fabric is is or also referred

13:25 to a distributed disaggregated chassis it's a modular chassis that is is split into its components or disaggregated

13:31 into its components so the line card in a model chassis becomes a leaf uh Fabric in a model chassis becomes a spine and

13:38 they somewhat act independently um so that's not where Juniper is but there are other vendors

13:45 out there that are pitching that architecture for AI training uh backend GPU clusters so where would Ultra

13:51 e Ultra ethernet is an extension of open ethernet uh so by the way I just want to

13:57 kind of maybe say and then this question came in earlier as well we believe the solutions exist and we're going to show

14:03 you some solutions that meet the demands of AI training networks today Ultra ethernet is only going to make it better

14:09 Juniper is completely participating in U but it is only going to make it better but we believe that we can meet the

14:15 performance of infin band today and that's what we're going to show you could you explain scheduled fabric again

14:20 I I don't okay so scheduled fabric if you know what a model chassis is uh you have line cards and you have fabrics and

14:27 traffic comes into the line card the packet is chopped into cells and sprayed across the fabric what a scheduled

14:34 fabric architecture is trying to do is disaggregate that that modular architecture into a leaf spine fabric

14:41 but what's happening under inside the leaf spine fabric is still that cell-based or packet spring architecture

14:46 so it's a modular chassis that is sort of disaggregated ATM will never

14:52 die true true so all of these Technologies

14:58 they're really trying to do only two things and I'm I'm I know I'm oversimplifying this but they're really trying to do only two things one is

15:03 congestion avoidance by better utilization of the available Leaf spine capacity in your Fabric and different

15:10 Technologies do it differently infin band has a centralized God uh called a subnet manager which looks at all the

15:17 flows coming in all the paths to your Fabric and Maps the flows to the paths it's not a decentralized distributed

15:23 architecture like ethernet but it's like a got central control scheduled fabric like I said the way they try to do

15:29 better load balancing or better utilization of available capacity is cell-based or packet spraying again

15:35 packet is chopped into cells by the Ingress line card or the Ingress Leaf goes over the fabric the destination

15:41 line card or destination Leaf has to reassemble them into packets and send it on its way open ethernet Fabrics like

15:48 from from juniper use more intelligent load balancing techniques which I'm going to double click on and a second

15:54 thing that all Fabrics are trying to do is congestion control the whole objective of control is to not lose any

16:01 packets during your you know AI training uh run infin band and scheduled Fabrics do

16:08 it using a request Grant mechanism uh not exactly the same but I don't want to get into details of what how infin band

16:14 works and scheduled fabric works but the general principle of this is a sender before sending a packet to a receiver

16:19 will send a request to the receiver request receiver says okay I have the buffers to accept this packet I'm going

16:26 to send a grant back to the sender and and only then the sender sends a packet through the to the fabric open ethernet

16:34 uh Solutions use techniques called DC qcn again the middle column I'm going to explain a little bit more in detail

16:39 because those are Juniper Solutions but uh different techniques used by different

16:47 Technologies I said this was going to be a mythbusting session so after the little bit of an academic uh you know

16:53 setup out there I'm going to jump straight into it the first myth that we want to bust is Ean does not match infin

17:00 band's performance for AI training now we being Juniper I think you know what the answer is going to be uh right but

17:07 we want to show you why and how now we can write a lot of blogs and

17:14 white papers out there explaining the theory of why ethernet Solutions work over infin

17:19 band but you're not going to believe us so as Mansour said we built out this AI

17:25 Innovation Lab at Juniper because we wanted to run real training jobs real inference jobs on a cluster of gpus and

17:33 accelerators and prove that our job completion times on this ethernet Network matches that of infin band so

17:40 let me geek out a little bit and explain what this infrastructure actually looks like so we built out a cluster what you

17:46 see in the middle over there is NVIDIA gpus so this is 64 a100 gpus and 32 h100

17:53 gpus we have the dedicated storage we have a VCA storage over here it's a high performance storage used for training

17:59 you know used used to store the data sets that are used for training the job like I mentioned we have our shared

18:06 storage which is used for anything else like logs other artifacts Etc it's also

18:11 where the headden servers are located we are using slurm for orchestration of our jobs in this case but other customers in

18:17 production could be using kubernetes open shift you know so on and so forth now talking about the various

18:24 Fabrics uh like I said the backend GPU fabric is the one the focus of this discussion so I'm going to talk about it

18:29 a little bit more uh we have the a100 servers connected at 200 gig to the

18:35 fabric we have the h100 servers connected at 400 gig into the fabric and these are essentially

18:42 connected in what is referred to as a rail optimized design rail optimized design is something Nvidia has as part

18:47 of the superp or super uh superp reference designs what we have done is kind of mimic that design except we have

18:54 replaced infin van switches with ethernet switches and the idea behind a rail optimized design is that GPU 1 from

19:03 each server connects to Leaf one gpu2 connects to Leaf 2 and GPU 8 from each

19:09 so by the way each server has eight gpus so GPU 8 from each server connects to Leaf 8 and Juniper refers to this

19:16 collection of eight leaves as a stripe and in this design over here we have two

19:22 stripes that are then interconnected by a spine layer now the reason for this rail optimized design is to minimize the

19:28 number number of hops through your fabric so what that does is that each server itself has an EnV switch or a

19:34 switch inside the fabric inside the server itself so all gpus that communicate within the server don't need

19:42 to come out to the fabric they will just communicate in within the high bandwidth NV switch in the server itself all gpus

19:48 within the same stripe communicate only at the leaf layer no traffic hits the spine it's only when your gpus need to

19:55 talk across stripes when you have a you know large enough cluster is when the traffic actually hits the spine so each

20:02 stripe can support like up to 256 or 512 gpus depending on the size of your Leaf

20:07 uh in that design can I can I have a question about

20:14 these 52s so for the uh for the back end the leaf layer 5200s at the back end

20:21 5200s for the dedicated storage these are like physically the same 5200s or no

20:26 physically different switches physically different switches okay yeah thank you and and the 5200s are essentially a

20:32 tomahawk series of products broadcom Tomahawk series of products thank you uh that's the back end GPU fabric then we

20:38 have the backend storage fabric again it's a leaf spine design and then we have a front-end fabric which is also

20:45 another Leaf spine design that is connecting to the you know shared storage nodes is connecting to the slur

20:50 it's also connecting to the outside world for client request to come in the challenge for something like this is

20:56 It's orders of magnitude small than what larger language models are actually

21:02 trained on so we have trained gpt3 we have trained Lama two on this cluster

21:07 but with smaller data sets now the proof I mean we in in our lab we not going to be able to build thousand gpus but our

21:14 proof point is our customers we are you know we have customers who are running clusters that are tens and thousands of

21:22 gpus so yes in the lab we can only scale to a certain extent but the proof point is our customers to a certain extent

21:29 are you going to you're going to go through some like data and like you know how long did it take and like you you

21:34 will go for that okay cool yeah right here okay so uh I mean that was the whole

21:41 goal of the setup uh right again we said we don't want to put theories out there we want to put it in practice so we ran

21:47 ml Commons jobs now ml Commons for those who are not aware uh is a body which produces standardized data set

21:53 standardized or I should say yeah reference data sets with reference models so all vendors can do an Apples

21:59 to Apples comparison so we ran Bert large we ran drrm we ran llama 2 uh we ran gpt3 we ran other things as well but

22:07 if you look at a job completion time for 64 a100 gpus with Bert large we achieved

22:12 2.6 minutes of job completion time versus infin band metrics posted out there uh you know out there in meml

22:18 common we didn't test infin band these are again these are standardized benchmarks so we measuring against what's posted in ml Commons range from

22:24 2.5 to 3.3 minutes llama 2 if the number number is catching your eye why is Juniper at 7.9 minutes versus Nvidia is

22:32 at five or infin band is at 5.1 minutes the reason for that is that we have only 32 h100 gpus and the closest Benchmark

22:40 posted has 64 h100 gpus right but we feel very good about this number generally not perfectly linearly but we

22:48 expect job completion time to improve somewhat linearly uh with doubling of uh

22:53 gpus so while we have half the number of gpus our job completion time is is not

22:59 double that of U you know infin band posted numbers out there so we feel very very good about these numbers it's not

23:05 just these customers have come to us during pox they have brought their own models they said you know we don't trust

23:10 ml perf to be reflective of our own production traffic we going to get our own models run it in your lab and see

23:16 what happens and they have come back with the same conclusions that uh you know they get similar comparisons uh

23:22 similar results to what they see with infin band in their own Labs so ml per and things of that nature that a lot

23:29 depending on the hardware configuration it's not just the gpus specifically it's the CPU storage cores storage and All

23:36 Storage not as much for these sorts of uh activities because it's all direct storage but um so you've matched the the

23:43 hardware configuration in the Bert large for instance that when you say Hardware we match the we match the gpus so the

23:50 CPUs um and and the cores and all that other stuff and and RAM and stuff like

23:55 that needs to also be matched right yeah I mean in fact the software stack matters so we kind of also matching the

24:00 software stack as well yeah yeah yeah now now uh we are we are thinking that any vendor who puts numbers out there in

24:07 the in the is putting their best foot forward as well right so the I think the what matters the most obviously is the

24:13 class of the GPU whether it's the a100 or the h100 or how many gpus you have that makes the most difference we've seen storage actually makes a difference

24:20 if you if you tune it right or you don't tune it right that can be the bottleneck uh the nickel versions can matter

24:26 sometimes uh those lot of things that matter matter again because these are standardized Benchmark it is coordinated

24:32 to the extent it is possible are you submitting these benchmarks per so they actually publicly available numbers

24:38 correct yeah we are actually coincident today was the day it gets published yeah

24:45 yeah so the the answer to that question the myth busting is answer is false uh

24:51 we believe that uh you know ethernet performs and that raises the questions for our customers is the cost and

24:56 complexity of infin band worth it um to my earlier point it's not just us seeing

25:03 it our customers are saying it some of the largest cloud providers in the world are saying it meta has a you know

25:08 publicly stated that they are using both Rocky V2 and infin band and they get you

25:14 know similar Network performance U you know using both Technologies right U Nvidia

25:22 in their last earning release and I should say more specifically Jensen in his last earnings release said we are all in on E ethernet so I think the

25:29 world is coming to the conclusion slowly uh that you know ethernet can perform and we are doing our part in proving you

25:36 know that ethernet can perform and specifically Juniper's AI optimize ethernet can perform as well so so isn't

25:41 it like also one of the benefits of infent over the ethernet like you can

25:46 run like loads of parallel trainings simultaneously and that's no different over here as as well oh okay so you've

25:53 done that because like you know what you were showing those numbers that was sort of kind of like we do like one training

25:58 at the one time yeah all Benchmark results on ML Commons are one job at a time okay but it's not stopping you know

26:05 we can we can run like you can you can carve out the cluster we have slurm running as an orchestrator very often in

26:11 our lab uh we have sliced this infrastructure for different experimentation purposes and we are

26:16 running different jobs at the same time sometimes inference sometimes training multiple training jobs over ethernet so

26:22 we doing it uh yeah a lot of cloud providers are running on ethernet as well uh it's not just

26:28 an is specifically infrastructure as a service Cloud providers are running over ethernet and for them their customers

26:34 could be doing something they don't even know what they're doing uh and all at the same

26:39 time all right uh we talked about performance uh if you look at the economics ACG research did an

26:45 independent study comparing the economics of infin band over ethernet over a three-year time Horizon and they

26:51 came to the conclusion that infin band is 122% and I should just exp be explicit not 22% 12 22% more uh

26:59 expensive than ethernet uh the drivers were simple uh I think we talked about

27:04 them infin band is a venger locked in solution it's it's it is more costly and

27:09 infin band has been traditionally been more operationally difficult to manage with abstra as you will see in a few

27:16 demos from following me we made it really really simple skill sets toolkits

27:21 for infin band are hard to come come by they're not standard out there industry we all know how to manage ethernet I

27:28 networks those skill sets are easily available the toolkits are easily available and that makes you know uh it

27:33 operationally simple to manage ethernet other benefits that are not been factored in this report is ethernet is

27:40 going to win out in the future and ethernet is GPU agnostic while today the

27:45 market is locked by Nvidia AMD is following right after Intel is right there there other accelerators like you

27:51 know sanova and cerebras all of those vendors we are actually currently working to build reference designs with

27:57 uh the market ecosystem will open up and you want to bet on something that has you know better investment protection

28:04 raful where is smart Nicks and things of that nature fit into this I mean um let

28:10 me try to cover it in the next myid busting and and I'll I'll touch upon it a little bit um so the next myth that I

28:18 want to bust is Packet spring is required to maximize performance in GPU

28:23 backend networks and a little bit of uh uh and a

28:29 little bit of I'm going to go into Theory again like I said every technology needs to do two things one is

28:34 congestion avoidance and congestion control so congestion avoidance is really the task of better utilization of

28:40 available Leaf spine capacity because the problem that exists in Ethernet networks is that even if you throw bandwidth at the problem like 400 gig

28:47 and 800 gig you can still hit collisions you can still hit congestion points because of the way static ecmp works and

28:53 so if you look at a picture on the right hand side multiple 400 gig flows coming in on a Leaf they could get mapped using

29:00 static ecmp to the same Leaf spine Uplink and you could hit collisions on the Uplink similarly you can have

29:05 collisions on the down link from The Spine down to the leaf and that problem is actually worse in AI training

29:11 networks because you have large elephant flows and versus general purpose networks you would have you know many

29:17 smaller flows here you have fewer larger flows and that means less entropy in your network so we are using you know a

29:24 a combination of technique called Dynamic load balancing GL load balancing packet spraying in conjunction with

29:30 smart nick uh to to kind of alleviate those problems one of those technologies

29:36 that I want to talk about is dynamic load balancing which is an alternative to static ecmp where we are looking at

29:42 the quality of a link in order to make a deterministic DE to make a decision on where a packet or a flow goes the way we

29:50 do this is for example in a picture on the right hand side gpu2 is trying to send traffic to GPU 6 three available

29:56 Parts static ecmp would have just just picked one of the parts based on a you know random hash here we take a look at

30:02 the quality of the link and realize that at that moment in time when that packet arrives the path to S2 is the best

30:09 quality path and I'm going to send traffic there now you asked the question what where do Nick play play a role over

30:15 here this is exactly where it plays a role uh when you employ this in packet spraying mode or I should say per packet

30:23 mode DLB is making a decision on the quality of a link on a per packet basis

30:28 so each packet could go in a different direction now it gives you good entropy

30:33 across your network but the what what you need to do is in conjunction with smart Nicks you need to be able to have

30:39 a smart Nick that can handle that out of order packets and place it in the right place in memory so that's where you know

30:46 we have actually tested with connectx 6 connectx 7 Bluefield 3 Nicks we've tested the solution with all the Nicks

30:51 out there and we'll continue to you know test it with other Nicks as they emerge in the market but that's the the that's

30:58 the downside of package spring that you need those smart Nicks on the other end so we have something called flowlet mode

31:04 a flowlet is essentially a burst in a flow and as I said in AI training jobs

31:09 you have a computation phase when the network is actually sitting idle and then you have a merge phase where bursts

31:15 hit the network so what we do is we split those flows into flowlet a flet is essentially

31:23 a burst in a flow and we make the DLB decision based on a flowlet not on a PO

31:29 packet basis and when we again put those things to the test so again this is the dlrm job training time dlrm is a job

31:36 that is running what you the numbers you're seeing over here are job completion times and we tested static ecmp DLB flowet and per packet spraying

31:44 and what you'll see over here is that and again various columns are various load conditions so the First Column is

31:50 there's no there's zero background traffic and the last column there 37.5% background traffic generated by xia RDMA

31:57 traffic and what you see is static ecmp as I said gives you the worst performance and

32:03 if in fact it degrades pretty badly once you have congestion in your network DLB flowet is very very

32:12 deterministic going from 0% congestion to 37.5% background congestion or

32:17 background traffic it gives you fairly deterministic job completion time for d

32:22 for DLB for for dlrm per packet spring gives you the

32:27 best results but only marginally so it's only 8% better than DLB flowet now another

32:35 way to look at it is is this now this is like left on

32:41 the left hand side you see the static ecmp and the various lines are utilization of the leaf spine links so

32:48 you can see that every Link in with static ecmp is utilizing those Links at a at a different late at the lowest it's

32:55 about 150 gig at the highest it's about 350 50 gig so you're not getting even utilization of your links Leaf spine

33:01 capacity which was the goal of congestion avoidance but DLB flet gives you amazing you know uniformity so every

33:08 link between in the leaf spine fabric that is being measured out there has pretty much the same utilization over

33:14 that entire training run and like I said that was the goal of congestion avoidance and it's like very you know

33:20 really well very well achieved with DLB flet so again to pray you have a slide

33:28 that shows what these packet spraying would look like on these links uh it

33:33 almost indistinguishable from DLB flowet okay yeah so the conclusion that we're

33:41 driving over here is yes it is theoretically true that package spring is the best results but not practically

33:48 so like I said it's only 8% better than DLB flowet but it does require you to

33:53 have more expensive Nicks so yes our solutions from juniper have the optionality to work Nick agnostic or we

34:01 have tested out Solutions in conjunction with the smart Nicks out there but is

34:06 the value that you get from packet spraying uh incremental value worth spending more on the on the on the Nick

34:13 side the other thing I would mention is that the scheduled fabric architectures I talked about one of the core elements

34:19 of scheduled fabric is Packet spraying or cell based spraying so again that leads to the question is scheduled

34:25 Fabrics uh the the additional cost cost and complexity of scheduled fabric is it

34:30 worth it given that we have shown you that DLB flowet with standard ethernet can give you almost the same performance

34:36 as packet

34:41 spring okay so the next one is actually the most popular question that we get

34:47 from customers that this one is one is to one non-blocking performance is required to maximize performance in your

34:54 infrastructure so now let me explain what that even the question even means uh it means that if you look at a given

35:01 Leaf non-blocking means that your total Downstream capacity of the host is equal

35:06 to your Upstream capacity to the spine so for every 400 gig I have down I have 400 gig up 2 is to one over subscription

35:14 would mean that I have you know 2x the capacity down to the server relative to

35:20 the capacity I have to the spine so for every two ports of 400 gig down to the host I have only one port of 400 gig to

35:26 the to the to the spine so again like I said this is the probably the most popular question we get from our customers should I build

35:33 one is to one non-blocking or should I save some cost with 2 is to one over subscription because I have less Optics less number of spines less number of you

35:39 know cables less complexity now lean knowing that you know where my last two

35:45 myths have gone uh where do you guys stand on this who who who who believes

35:50 this is true who raise show off hands who believes this is true nobody who

35:55 believes it's false nobody prove it baby prove it all right uh so

36:04 far everything has gone false right so uh so here the results are right here so

36:10 what we saw is that we had I'm going to start with the one on the right hand side because the most drastic uh we had a customer come in uh they you know

36:17 again customers are coming into to us in our lab they're running their models uh because they don't trust ml Pro they're

36:22 running their models and they found that they had a 90% impact to job completion

36:27 time when they went from 1 is to one non-blocking to 2 is to one over subscription they clearly walked out

36:33 with the conclusion we're going to build with one is to one non-blocking but when we put that to the test with other

36:40 models uh the other most extreme situation that we saw was with drrm where we saw only 1% uh degradation in

36:47 job completion time so this kind of leads us to the quick conclusion that yes it is true that one

36:53 is to I think in general we would recommend customers to go with one is to one non talking the reason for that is

36:59 most of our customers don't know what their researchers are going to be running and let alone today uh one year

37:04 down in the future you know if you're a cloud customer you have no idea what your customers are going to be running

37:09 uh at all even even if you're an internal it shop you may not have predictability into what your researchers your data scientists your

37:15 various business units are going to be running if you have no visibility into what your model characteristics are and they change on a periodic basis uh

37:23 Juniper validated designs will recommend that you go with a one is to one non-blocking architect

37:28 um you know simply it gives you the safest solution out there if you really know what your model behaves and you

37:34 have predictability into uh you know what it can you know how is how it's going to be what's going to be running

37:41 over a duration of time you can put possibly try it out with 2 is to one or higher over subscription but in general

37:47 we recommend one is to one non-blocking even if you go with 2 is to one you may want to think about whether the cost

37:53 savings that you're getting is worth losing the optional of maybe having to run different things

38:00 in the future so but Juniper's AI lab if you're a customer Juniper's AI lab can definitely help you characterize your

38:06 model for not just you know testing whether you know you want to do non-blocking or not but with many other

38:11 things as well all right um the next one is

38:20 lossless networking is always required in the backend GPU fabric even though over the past 30 minutes I've said that

38:27 losing packets is bad in AI training clusters but the question over here is is you do you need 100% lossless

38:34 networking 100% of the time in AI training fabrics and how does it impact your job completion time now we hear

38:41 this a lot from our customers networking people researchers you know data scientists because there are some scars

38:48 from the HPC world where losing a packet was really really bad but and they have

38:53 carried those cars into the AI world but Lo is losing a packet in the AI training

38:58 realm as bad as it was in the HPC world let's look at that so now again a little bit of theory

39:06 uh ethernet does congestion control differently from you know the way infin band does it we use a combination of

39:12 techniques collectively that are referred to as DC qcn or DC data center quantized congestion notification this

39:19 is a combination of explicit congestion notification and pause frames explicit congestion notification the idea is

39:25 switch C's congestion it starts marking packets as having seen congestion and eventually an alert goes to the source

39:32 who slows down pause frames is a measure of Last Resort we call it an insurance

39:38 where if a receiver starts getting his buffer utilization at around 80% 90% it

39:44 basically tells the sender stop stop sending me packets until I clear these buffers and then send it after that the

39:50 collectively this is referred to as DC qcn but the question is yes we can deliver lossless with ethernet but is is

39:56 lossless 100% lossless the goal so again the same customer we put

40:02 their model to the test two scenarios scenario one where we had ecn only and

40:07 as expected you lose some packets and in scenario two we had both ecn and the you

40:13 know the insurance which is pause frames turned on and as expected we get lossless capability with scenario number

40:19 two but surprisingly what this customer found was that first of all a lot of

40:25 researchers and others people have said they think that a job will completely fail if you lose a packet job doesn't

40:31 fail there are some retransmissions that can happen maybe in the HPC World there was some different Behavior but here a

40:36 job doesn't fail but we also saw that they got better performance with scenario one where you actually lost a

40:43 few packets and actually got better performance then not losing any packets and uh and they got worse

40:50 performance and SP more explicitly they got a 8% degradation in performance so if that makes sense we

40:57 actually surprised by the results ourselves so after the customer left we

41:02 ran all our mlpf models against the same scenario s sorry to jumping in on the previous slide so the pocket loss that's

41:09 interesting so like do you have roughly an idea like what was the pocket loss like you know what is that like sort of like like line how far you can go in

41:17 terms of the pocket loss yeah so I mean you cannot I mean I don't have the exact number right here with me but you cannot

41:23 like lose 20% packet loss okay but here they basically what they did was allowed their applications to breathe a little

41:29 bit otherwise you know what we what happens with PFC is that you you are throttling all flows in your entire

41:35 fabric you're telling them slow down when they might not need to but here they saw that with you know by allowing

41:40 allowing their applications to breathe a little bit and maybe losing some packets and then retransmitting and I know I

41:45 need to quantify some for you but uh but you know you cannot lose a whole lot of packets but you can allow say like up to

41:51 five% maybe yeah possibly yeah it depend on the model but dep on a Model okay and

41:57 then so so which models are like you know less sort of like touchy about pocket

42:03 loss and so this was the most touchy uh I mean of all the ones that we tested

42:08 like when we put drrm to the test again against the same scenarios we saw that this one couldn't handle packet loss uh

42:15 we saw 22 I mean we saw the opposite results of the customer model where we this one couldn't handle packet loss we

42:20 saw 22% degradation we realized that for this model we always had to have PFC

42:25 turned on to have the lossless behavior in order to get the best uh best results for this one so leading into the

42:34 conclusions uh is that the one conclusion is that yes from the HPC World there is some Battle Scars don't

42:41 always translate the same way into an AI training world uh 100% L does not always

42:48 yield the best result job definitely doesn't fail with a few packet loss and the theory behind this is that sometimes

42:53 when you have a high high bandwidth 400 gig 800 gig Network losing a few packets and retransmitting is probably better

43:01 than having everybody back off in the whole infrastructure right that's the theory behind the and behind the

43:06 empirical evidence that we're kind of seeing over here so um how do you work with the the AI engineers then to

43:15 understand which side their model fits on the one that's very sensitive to the packet loss or the one that's sensitive

43:22 to being throttled so we have customers who actually do bring their models to our lab and to test it out so um there's

43:31 no standard that they can measure their model against theoretically to know that you actually have to practice one of the

43:37 exercises that Juniper is running uh right now is we are trying to characterize the different type of models and come up with some standard

43:43 Solutions I don't think we're there yet even not just not just as Juniper but even as an industry to be able to say

43:48 that your model is type A and this is what you need and model is type B and

43:54 this is what you need but that's exactly the goal of our experimentation in our lab to continue to get to that point as

44:00 an industry as a whole there's no correlation to data parallelism versus model parallelism versus not parallelism

44:07 and this there could be I don't think we have done enough enough range of different models to arrive at a some

44:14 conclusion out out there yeah but I think you don't know so I'll be I'll be humble this this

44:22 last four months that we built this lab we've learned a lot I think a lot more to learn I mean we've going to be

44:27 experimenting along the along the along with along with customers and I that was our next slide so real quick before you

44:33 go on I think did I hear you correctly that you said if somebody's being has

44:38 challenges they should bring their model to your lab they can they don't have they don't have to is there a way you

44:44 can Leverage The abstra in the Telemetry and get access to the data and look at what's going on locally to do yes

44:55 yes okay thank you so so can I can I just to clarify thing about that pocket loss so like where was that bottl L was

45:01 that uh the demarcation line was the the networking card on the server that was the bottleneck that's where you've seen

45:07 switches were losing packets and then did you did you compare kind of where the loss if there was a difference in

45:14 performance depending on where the loss was happening whether that was over to the the supplemental storage or whether

45:19 that was at the Nick or whether that was at the switch now this loss was I mean more explicitly in this particular case

45:25 the loss was happening on the spine yeah so you were like filling filling up the buffer buffer on the spine okay so

45:31 there was like overflow yeah again we didn't put their insurance of false frames in scenario

45:37 number one so you do expect to see some packet loss in that in that environment yeah

45:48 interesting yeah we had a feeling that would be the most interesting controversial

45:54 one so um so anyway I I is open to customers we will continue to experiment

46:00 I don't think we the end of our journey but this Design Lab is you know been great at you learning a lot those best

46:07 practices and learnings feed into our AI jvds uh that mansur talked about these are Juniper validated designs uh these

46:15 one of the artifacts of the Juniper validated designs are design docs that are published on net accessible to

46:20 everybody to see but the other artifact of the jvd is terraform configs and what

46:26 customers can do is on their own terraform instance they can download or you know you import these terraform

46:31 configs and then deploy these terraform configs into AI Fabrics into abstra so

46:38 whether you need a 64 slide GPU cluster or you need a 4,000 size GPU cluster all the best practices feed in Via this uh

46:45 overall linkage into abstra so that you get you know our learnings from the lab into designs that you can deploy for AI

46:53 clusters so that's not just a PDF that's the actual configs that you can import directly into Abra that's very cool uh

46:59 we we publish configs we publish terraform templates uh terraform templates can then be loaded into App

47:07 Store and that's the first demo that we're going to show you so we're going to Lean Forward on operations for the

47:13 next three demos they're all abstra Centric operations uh Nick and Raj are going to talk about exactly that how do

47:20 we take those AB terraform uh templates and load them into abstra so that when

47:26 you design your infrastructure you already have your best practices loaded Jay Wilson is going to talk about day

47:31 two operations how do I visualize congestion how do I react to it and then because one size may not fit all Vikram

47:39 and Raj are going to talk about how you can use abstra again to find tune your AI data centers to get the maximum

47:45 performance of your AI and look at that uh somebody snuck in

47:51 uh bonus with Buster uh just for you guys because we love you guys we need more acronyms and terms in the AI world

47:59 now I'm going to leave that answer to you it really depends on you personally I lost my balance 10 acronyms ago but

48:06 our neck presenter Nick Davy he calls himself smart Nick Davy he is super super smart and he says that he can

48:12 handle four more

Show more