Automated Congestion Management in the AI Data Center with Juniper Networks

0:10 hi my name is Vikram Singh I'm a PLM in

0:12 the AI data center team and I'm going to

0:15 talk about automated congestion

0:17 Management in AIML backend

0:21 Fabrics so if I may draw an analogy what

0:24 we're going to present today we all

0:26 familiar with these metering lights um

0:29 you know and these are primarily used to

0:31 manage congestion on our freeways during

0:32 rush hour right and the way this works

0:35 is you usually there are cameras or

0:37 sensors that detect how much traffic is

0:40 Flowing or how much congestion is there

0:41 on the main freeway and then the am the

0:44 they change the duration of time this

0:47 light is red right so practically how

0:50 much traffic you're allowing in if you

0:52 allow too much then it adds to the

0:54 congestion and it just worsens the

0:56 situation if you you know if these

0:59 lights are longer uh read for longer

1:01 then you actually are creating

1:02 congestion at a different point and not

1:04 utilizing the whole bandwidth available

1:06 right so what we're trying to do is give

1:08 that flexibility of fine-tuning um you

1:11 know depending on when congestion

1:14 happens

1:16 so if you see just to describe the

1:18 challenge uh let's put ourselves in the

1:21 uh shoes of network admin right like um

1:23 you're managing this backend cluster

1:26 which PR will talk about and so there

1:29 are two typ types of complexity right

1:30 one is monitoring comp complexity there

1:32 are so many entities that you have to

1:34 monitor for example here you see 256

1:37 gpus if it's a small cluster like 256

1:40 gpus you may have 12 switches with 64

1:43 ports each so there are 768 ports and

1:45 then quickly you see each one can have

1:47 six cues on um you know each Port so you

1:52 can have 5,000 plus um entities that you

1:55 have to monitor right and you may need

1:58 historical view as Jay was showing

2:00 uh from a provisioning complexity there

2:02 are only a few dials available to tune

2:05 the congestion and they're listed on the

2:07 uh bottom right but um in terms of you

2:12 know it's today it's a manual process

2:14 it's tedious it's Error prone somehow

2:16 the network admins is supposed to know

2:18 what these what are the optimal values

2:20 for this network without knowing

2:21 anything what the traffic from these

2:24 machine learning workloads will come

2:26 right so sometimes you may need to tweak

2:29 like as said you may need to tweak a

2:31 certain certain things for a particular

2:33 model um and more importantly in a

2:36 multi-tenant training uh Network where

2:38 there are you know either multiple jobs

2:40 running and the traffic is Flowing from

2:41 the same fabric by the time you you get

2:44 a ticket or somebody tells you hey my

2:45 machine learning job is slow you may not

2:48 be able to recreate that other job that

2:50 was interfering at that time so you

2:51 don't know what to tweak right so these

2:54 are some of the challenges um you know

2:56 today and that's what we're trying to

2:58 solve through automation right so let's

3:00 see how abstra and all the things

3:02 terraform and um probes that you have

3:05 seen can we can build something to solve

3:07 it so here what we have done is we've um

3:10 written this dcq and autot tune app what

3:13 it does is it leverages all the things

3:15 you saw J showed you from uh you know

3:16 the probe so we are continuously

3:18 monitoring key kpis uh that tell you

3:22 when congestion happens or performance

3:24 kpis right and we'll do a demo soon so

3:27 you'll know and then once we detect that

3:29 hey uh you know I think the intend or

3:31 intended config is not doing what it's

3:33 supposed to do we make a decision and

3:35 then we do a Clos Loop automation we use

3:38 terraform to actually tweak that

3:39 configuration on throughout your Fabric

3:42 and then continue to monitor the effect

3:43 of that and until you you have an

3:46 optimal optimal set of these congestion

3:48 uh configuration

3:51 right so this is like there was a

3:53 question how are we creating so in this

3:55 demo what we are doing is this is our um

3:57 lab that you'll be touring um shortly

4:00 and we use IIA to generate this Rocky V2

4:03 traffic and actually we're generating so

4:05 much traffic that it overwhelms the

4:08 receiving Nicks right so it goes through

4:10 um um you know four leaf and spine

4:13 switches and that congestion manifests

4:17 um you know at the egress leaves and

4:19 that's where we are filling up the

4:20 buffers that's where we are applying the

4:22 algorithm that I'm going to show you uh

4:24 so we monitor the K uh EC and PFC and

4:27 tail drops um via abstra and then we

4:30 basically make a decision and then we

4:32 push out this config now there are a few

4:34 options here you know you can you can

4:36 say hey wherever the congestion is

4:38 happening only tweak the congestion uh

4:41 parameters on that switch or as J show

4:44 you abstra has this concept of leaf and

4:46 spine so if it's happening in a leaf you

4:47 can choose like hey only apply on all

4:49 the other leaves because it may happen

4:52 uh on the other leaves as well or spine

4:53 right so that's the thing and also we

4:56 have service now integration so you know

4:58 we will update everything that all the

5:00 decisions uh this app makes onto the

5:02 service now

5:04 right so profl went through this and I'm

5:07 going to go in a little bit more detail

5:08 before I just uh show you the algorithm

5:10 we are using so as you know DC qcn so

5:14 there are two primary uh mechanisms in

5:16 Ethernet to control congestion right so

5:18 the first one was

5:19 PFC um this came first and it's more of

5:22 a you know brute force or a hammer

5:24 approach in this case if you see the um

5:27 the spine the red link

5:30 if let's say that's congested what the

5:32 switch does is switch will monitor its

5:35 buffer and if it's around 90% it's it's

5:37 configurable but if it's getting to a

5:39 point where it will imminently the

5:41 packet drops is imminent because buffers

5:42 will fill up it it sends a message pause

5:46 frame to the neighboring switch

5:48 Downstream and it says hey don't send me

5:51 uh stop sending me traffic for

5:53 particular amount of time ties for this

5:55 whole priority and the way priority

5:57 works is usually the Nyx will uh Mark a

6:00 dscp value for all RDMA traffic there is

6:03 a default value for NVIDIA and others

6:05 and then what you do is you map that you

6:07 take that and map these things in a

6:09 queue where you apply both these

6:11 techniques right so um So eventually

6:13 this pause the other switch will get in

6:15 trouble and its buffers will fill up and

6:17 the pause frames reach the Nyx and they

6:20 they stop right so the goal is pre how

6:23 to make ethernet lossless and this is

6:25 what makes it lossless but the the

6:27 problem with this approach is you know

6:29 it does it doesn't have the granularity

6:31 so it will use let's say one link is

6:33 blocked it's going to block all traffic

6:35 from that priority from the neighboring

6:37 switch it has 63 other or 62 other ports

6:40 where the traffic could be fine right so

6:42 that's what is uh you know um you know

6:46 Troublesome with this technique but

6:48 there is

6:48 ecn uh with ecn what happens is wherever

6:51 let's say they take the same example you

6:53 have the the spine one of the interfaces

6:55 is uh

6:57 congested um and uh uh what ecn does is

7:01 it it and you have these two knobs where

7:03 you can configure um you know

7:05 probabilistically it starts marking

7:07 packets right and these are oneway flows

7:09 one GPU is trying to write some memory

7:10 chunk onto another gpu's memory and uh

7:14 the all the Nicks are continuously

7:16 looking for this marking so in the dscp

7:18 bid there is a bit that it flips the

7:20 receiver actually detects that there

7:22 some there is congestion somewhere in

7:24 the fabric and then it sends a explicit

7:27 congestion notification packet back to

7:29 the sender and then cender backs off the

7:32 flow so if it was sending at 400 it step

7:34 steps down to 200 and then eventually

7:36 you know can so this is more effective a

7:40 it's granular you're only impacting the

7:42 flows that are impacted uh by a

7:44 particular congested q and B it actually

7:48 uh you know remediates it because it's

7:50 the flows will back down the congestion

7:51 goes away for temporarily and you know

7:53 so DC qcn actually is judicially using

7:56 both these techniques or sometimes they

7:59 you know as Prof was saying you can only

8:00 use uh uh ecn as your primary method uh

8:05 and this is what is DC qcm right so so

8:08 you mentioned that

8:10 PFC causes all the traffic at that

8:13 priority to stop

8:16 yes from the neighboring

8:19 switch Yeah on that interface

8:23 okay so let's see like so what we have

8:26 done is you know so I'm simplifying here

8:28 a little bit but this is easy to

8:30 understand what you're seeing is

8:33 um the x-axis here is actually the the

8:38 buffer occupancy in percentage right so

8:40 whenever congestion happens a switch uh

8:42 is congested at some que on some Port

8:45 this is the m uh this is what you apply

8:47 and the there are two configuration

8:49 parameters K Min and K Max so for ecn so

8:53 what it does is as buffers fill up hit

8:55 the cman lower water Mark um uh they

8:59 will probabilistically start marking the

9:01 packets so if let's say you are

9:02 somewhere at 10% the switch will start

9:04 marking one in 10 packets as ecn right

9:07 and if the if the the ecn doesn't kick

9:11 in fast enough you keep filling the

9:13 buffers you cross kmax that's when all

9:16 the packets in the flow in the queue

9:18 will be marked ecn that means all the

9:20 flows will be impacted right

9:23 and then you have PFC so sometimes so

9:26 there is a function of these control

9:28 packets being sent to the sender if

9:30 there is some delay in there uh it may

9:32 still not act fast enough so that's when

9:34 you have in uh some you know PFC as

9:36 insurance just to kick in and buy some

9:38 time for rec to eventually kick in and

9:40 slow uh you know um so what we have done

9:43 is so that this is the set of value that

9:46 a network admin has to know somehow

9:49 automatically like hey what my pattern

9:51 should be and the the practice is they

9:53 all shift uh very they start very right

9:56 because you don't want to slow

9:57 unnecessary down but with that approach

10:00 what happens is you don't have much uh

10:02 buffer left to actually uh you know

10:04 prevent drops so what we have done here

10:06 is we keep monitoring uh on the top

10:09 right uh right what you see how do we

10:11 adjust Les left so everybody starts with

10:14 a very uh you know uh um relaxed ecn

10:18 config so that all the applications can

10:20 breathe they're running at the uh

10:22 maximum possible uh you know your load

10:24 balancing is your first defense but

10:25 still if congestion happens what you do

10:28 is if we we keep keep monitoring the

10:29 buffer occupancy by seeing hey are there

10:32 any pfcs being triggered or is there a

10:35 tail drop happening right so if it is

10:37 happening that indicates that your ecn

10:40 is not uh reacting fast and whatever

10:43 config you did um you know we need to

10:46 move a little bit left trigger easn

10:48 sooner so that you know that eventuality

10:51 of dropping packets is is is is elevated

10:54 right so that's how this uh what we do

10:56 is we make a decision we keep moving

10:58 left and then we see what's the impact

11:00 of this if it's still pfcs are triggered

11:03 or packet drops are happening we keep

11:05 moving left until this system is stable

11:07 right and once we do that then um how do

11:11 we move right is when when you have

11:13 avoided this boogy man the uh situation

11:15 which may cause packet loss we start

11:17 saying okay if EC if the buffer

11:20 occupancy is between your low and high

11:22 Watermark that indicates that hey my ecn

11:25 is perfectly tuned and the round trip

11:27 times for these cnps are are perfect so

11:29 let me actually allow the applications

11:31 to run at higher bandwidth breathe more

11:34 and move right and it will help you find

11:37 the most optimal value based on whatever

11:39 workloads come in in real time right so

11:42 the so the metric it's trying to drive

11:44 is Packet drop rather than job

11:47 completion time or something like that

11:48 which is a higher level metric so so

11:51 you're right so because it doesn't have

11:52 the notion job completion time can only

11:54 be measured by the application right so

11:57 but if there is congestion happening uh

11:59 packet losses uh and uh you know can be

12:03 can slow that job completion time so

12:05 it's trying to avoid that by saying hey

12:06 let me

12:08 slow uh slow applications to a like not

12:11 a full crawl because you know but let me

12:13 slow them down reasonably they're still

12:15 running that may impact that may help

12:18 the job completion time rather than if

12:20 packet losses happen retransmissions are

12:21 triggered no I was going to just add to

12:24 what vickram said this is an application

12:26 that's running on top of abstra with the

12:28 apis this is one implementation we've

12:30 done but like at the end uh you can take

12:33 it and do what you want but you're right

12:34 that there could be if you have access

12:36 to job completion time you can

12:37 absolutely make that part of the

12:39 algorithm that you implement and we're

12:41 going to experiment with a bunch of

12:42 things we're going to learn along with

12:43 our customers in terms of what to make

12:45 how to make this you know as effective

12:47 as possible so this is just one example

12:49 implementation of what you can do yeah

12:51 but the application people are never

12:53 going to tell the uh network engineer

12:55 about their jobs no I mean yes no so

13:00 what we seen in terms of like the Divide

13:01 is that we're they're never going to

13:02 allow us to configure anything in the

13:04 application it's kind of like the inter

13:06 the the the uh the Integrations we have

13:09 with like VMware we don't go and imple

13:12 and and configure VMware however yes we

13:15 do get visibility into it we get access

13:18 to that Telemetry yeah but the end we're

13:20 controlling the network allow us if they

13:22 D to allow us if if yeah it'll be even

13:25 harder in the AML applications yeah so

13:28 um I I know you Matas you had a question

13:30 earlier like hey how much drops and

13:32 stuff once we know that we can tweak

13:34 like hey how much drops you want like to

13:36 move this further right like we could

13:38 you could do that but um just wanted to

13:40 say that hey you know future so this is

13:42 one idea we are testing and you know as

13:44 we do more experiments we are going to

13:46 find more opportunities what else can we

13:48 T tune what other parameters load

13:50 balancing other things that we will uh

13:52 enhance this right and um so abstra now

13:56 allows you to not only like you know

13:58 deploy will operate but also find help

14:01 you fine-tune some of these performance

14:03 kpis to help you um you know run this

14:06 optimally so with that I I will um let

14:09 Raj do the demo and just to set the

14:11 stage right um like Vikram said uh we

14:15 want to start like I I think of packet

14:19 drops as an edge right so you want to

14:21 get away from the edge as quickly as

14:23 possible and then because the view is

14:25 really nice from The Edge you want to

14:27 creep closer and closer and then find

14:29 the spot where you're safe right so

14:31 that's kind of the the my mental image

14:34 when I when I wrote this code um so this

14:38 application uh what it does is it

14:41 basically uses abstra uh to manage the

14:44 fabric right uh Jay spoke about

14:46 configlets which is basically like these

14:48 config Snippets that you can push out

14:50 into abstra um and Jay spoke about

14:52 probes which basically you know tell you

14:55 what's happening right um so I'll just

14:57 let it go

15:04 and we'll run it on two times the speed

15:06 because we like Fast um so the situation

15:10 we see here is there are both ecn and

15:13 PFC anomalies happening on the fabric

15:16 because the exia is pushing traffic out

15:19 whenever we see pfcs or Trail drops we

15:21 start moving left right and that's

15:24 that's basically what's going to happen

15:26 um and what you also see here is you

15:28 know because

15:29 it's so much fun to uh to see um to see

15:33 log messages I just put everything up on

15:35 service now so it basically updates

15:37 incident ticket right um the general

15:40 idea is like we want to make the network

15:43 admin's life easy right so with the

15:45 network admin might would much rather

15:48 see what happened uh than be told

15:51 something really bad happened right so

15:52 the the general idea is uh as it sees as

15:57 it sees this application as it sees pfcs

16:00 is going to start shifting left

16:01 basically shifting that window left um

16:04 hopefully slowing the traffic down and

16:06 it keeps updating service now and says

16:08 hey man I just dropped it down I dropped

16:10 it down some more I dropped it down some

16:12 more and so on and that's basically what

16:14 we seeing here uh this was written uh

16:17 while this is going on uh this was

16:19 written in Python uh using the rest API

16:22 uh the the rest API that you know Jay

16:25 spoke about and um you can see already

16:29 that it started at 6090 it's already

16:32 gone to 4070 right so two cycles have

16:35 happened and it's already gone to 4070

16:37 and then it went to

16:39 3060 right now at this point

16:43 um in a few couple of seconds it's going

16:46 to notice that there are no pfcs right

16:48 so pfcs are gone the ecn are gone like

16:51 there are still ecn so basically we

16:53 moved far away from the edge right now

16:57 we still want the application to breathe

16:59 so we know we are close to the edge so

17:01 we start moving right now to see to

17:04 actually find the edge right so we go

17:07 right and we find edges right so it

17:11 basically says uh it basically knows now

17:14 that if it sees pfcs again that's the

17:16 danger zone right um if I was a singer

17:20 I'd start singing The Danger Zone song

17:21 but I'm not everyone is safe um so it

17:25 sees the edge now it knows the EDG is

17:28 there so it wants to find a spot where

17:31 it can still breathe but you know it's

17:33 not going to fall over so now it's going

17:36 to start moving left slow like we went

17:40 when we were when we when we moved

17:42 initially we moving in steps of 10 now

17:44 we're going to move in steps of five

17:45 just to find the edge right um so from

17:50 uh 40 70 it went to 3565 you know it

17:54 moved a little bit and I'm at this point

17:57 I'm willing to take better on whether

17:59 it's going to stop here I know what's

18:00 going to happen

18:02 and go further uh so yeah we have like

18:07 80 Seconds for your bets uh no but the

18:10 the general idea is right like because

18:13 because you know you're kind of safe

18:15 here you pause here longer so that like

18:17 you know the the changes can propagate

18:20 uh you know the the system can settle

18:22 down for a bit um and once it settles

18:26 down in a particular Zone pH long enough

18:30 this particular this particular

18:32 application will just stop right it'll

18:34 basically say I I think I found the safe

18:36 spot and you know it'll just stop at

18:38 that point um it could also potentially

18:42 uh just keep running right it can

18:44 basically say like I don't see any pfcs

18:47 I've not seen any pfcs in the last

18:49 minute minute and a half maybe I can go

18:52 I can move again maybe I can go to a

18:53 better spot right uh so uh it can

18:57 basically run in like like infinite mode

19:00 not infinite like not stopping mode like

19:02 while one right it can do that or it can

19:04 basically say like uh you know I'm done

19:08 uh this is good this is where I'm going

19:09 to stop so you can basically

19:13 um um it's basically up to you right

19:16 that's that's kind of the power of

19:17 abstract you can make your application

19:19 like this whole concept did not exist uh

19:23 like 3 4 months ago right and it now

19:26 it's it's now a thing and you know we

19:28 we're talking to abstra PLM it's going

19:30 to become a thing in abstra right so

19:32 like that's kind of where we want to go

19:34 here like we want to use this as a

19:36 general a general kind of uh framework

19:40 thank you uh where um anybody can

19:44 develop their own applications with Abra

19:46 right this code is going to going to go

19:48 out on GitHub like everything else uh

19:50 and you know anybody can do whatever

19:53 they want you know like you want to make

19:55 like config backups make config backups

19:58 be happy you know uh use our code be

20:00 happy that's that's really what we want

20:02 and as you can see 3060 if anybody had

20:05 3060 you won so at 3060 everything is

20:08 happy it's stopped

20:11 um that's 3060 306 yeah um there's like

20:17 yeah there's like you know thesis

20:18 quality research to be done here to be

20:20 honest because like you know we are

20:22 doing something like very simple and

20:24 straightforward like the the windows

20:26 only 30 wide we move the window

20:29 you know like in in lock step we could

20:32 you know expand our contract a window uh

20:35 we could expand we could change the

20:37 percentages by which these like right

20:39 now the percentage is 0 100 right when

20:41 you before you hit the when you hit the

20:43 window zero the probability is like is a

20:46 linear probability you could play with

20:47 that um somebody said like can we uh you

20:52 know can we change this based on

20:54 completion time right of course we can

20:56 right see this is a pattern right the

20:58 pattern in which this window was moved

21:00 is a is is like is something um if you

21:03 can match that against completion times

21:06 over a period of time for different

21:08 kinds of loads you can figure out what

21:11 the ideal pattern is and then you know

21:13 like

21:15 um again I wish I had a joke about

21:17 learning but I don't know like this

21:21 application can be smart right it can do

21:23 things where it knows this kind of

21:25 workload would probably need this kind

21:28 of window and automatically do that this

21:31 python application you mean the python

21:33 the the yeah I just wrote python code

21:34 man like you well yeah you're you're

21:37 describing an ability of the of the

21:40 software the networking software

21:42 engineer yes using whichever platform

21:44 they prefer you prefer python yeah to do

21:47 all these kinds of things yeah exactly

21:49 yeah yeah that's that's what I meant but

21:51 we we would like to see is that abstra

21:54 has an AI in it that deter no we

21:57 don't we don't to see that this moment

22:00 my my question on the python is is there

22:02 a uh good python library for abstra or

22:06 are we directly making uh rest API calls

22:09 inside the python code yeah f.o is

22:12 coming up with like a a brand new API

22:15 Library I used an older version but the

22:17 latest I like sdks I'm

22:19 I'm API like a really fancy SDK is

22:23 coming out uh so you know you you'll

22:25 have that for sure

22:29 you you didn't talk about flets and the

22:31 in the optimization but that's another

22:33 potential thing that you could turn on

22:34 or off based on how things are going

22:37 yeah yes I mean I I I don't make those

22:39 decisions but I would get some grad

22:41 students for this work you know they get

22:43 a thesis we get an application

22:44 everyone's happy but you're right so we

22:47 we took the hardest problem first

22:49 congestion Management in Ethernet and

22:51 then we are looking at others uh and one

22:53 of what you pointed out is our is on our

22:56 list just to expand

23:07 to expand on that there was earlier

23:08 question on Smart NX as well so one of

23:10 the ideas that we had that we started

23:11 building an app two let's call this app

23:13 one app two is we're looking at

23:15 congestion metrics on not just the

23:17 switches but we're looking congestion

23:18 metrics on the smart NYX if you see for

23:21 example out of order packets on the

23:22 smart Nick as an example then you know

23:25 that your DLB is not configured properly

23:27 so you can go twe your DLB Timeout on

23:29 the switches to respond to congestion

23:32 that you see on the smart Nick so that's

23:33 an app that is going to follow very soon

23:35 after this app as

23:37 well yeah thanks Prof i' forgotten like

23:40 you know Michael Michael hle from

23:42 juniper is working on that you know uh I

23:45 one the next thing I'm going to do is

23:47 integrate his work with this code so

23:48 that you know we can have uh more

23:51 smartness

Automated Congestion Management in the AI Data Center with Juniper Networks

Cloud Field Day 20: Automated Congestion Management in the AI Data Center

You’ll learn

Who is this for?

Host

Resources

Experience More

Seize the AI Moment with Juniper Networks

Networks Myths to Solutions - Juniper’s Approach to AI Data Centers

Your Private AI Data Center, as Easy as Cloud with Juniper Networks

Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks

Transcript

Stay in touch