Rajagopalan Subrahmanian and Vikram Singh

Automated Congestion Management in the AI Data Center with Juniper Networks

Summits Data Center
Rajagopalan Subrahmanian Headshot

Cloud Field Day 20: Automated Congestion Management in the AI Data Center

To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.

Show more

You’ll learn

  • The importance of adjusting parameters on real-time data rather than static, manual settings

  • How Apstra’s DCQCN auto-tune app monitors network KPIs and adjusts configs in real time

Who is this for?

Network Professionals Business Leaders

Host

Rajagopalan Subrahmanian Headshot
Rajagopalan Subrahmanian
Product Manager, Juniper Networks
Vikram Singh
Sr. Product Manager, Juniper Networks

Transcript

0:10 hi my name is Vikram Singh I'm a PLM in

0:12 the AI data center team and I'm going to

0:15 talk about automated congestion

0:17 Management in AIML backend

0:21 Fabrics so if I may draw an analogy what

0:24 we're going to present today we all

0:26 familiar with these metering lights um

0:29 you know and these are primarily used to

0:31 manage congestion on our freeways during

0:32 rush hour right and the way this works

0:35 is you usually there are cameras or

0:37 sensors that detect how much traffic is

0:40 Flowing or how much congestion is there

0:41 on the main freeway and then the am the

0:44 they change the duration of time this

0:47 light is red right so practically how

0:50 much traffic you're allowing in if you

0:52 allow too much then it adds to the

0:54 congestion and it just worsens the

0:56 situation if you you know if these

0:59 lights are longer uh read for longer

1:01 then you actually are creating

1:02 congestion at a different point and not

1:04 utilizing the whole bandwidth available

1:06 right so what we're trying to do is give

1:08 that flexibility of fine-tuning um you

1:11 know depending on when congestion

1:14 happens

1:16 so if you see just to describe the

1:18 challenge uh let's put ourselves in the

1:21 uh shoes of network admin right like um

1:23 you're managing this backend cluster

1:26 which PR will talk about and so there

1:29 are two typ types of complexity right

1:30 one is monitoring comp complexity there

1:32 are so many entities that you have to

1:34 monitor for example here you see 256

1:37 gpus if it's a small cluster like 256

1:40 gpus you may have 12 switches with 64

1:43 ports each so there are 768 ports and

1:45 then quickly you see each one can have

1:47 six cues on um you know each Port so you

1:52 can have 5,000 plus um entities that you

1:55 have to monitor right and you may need

1:58 historical view as Jay was showing

2:00 uh from a provisioning complexity there

2:02 are only a few dials available to tune

2:05 the congestion and they're listed on the

2:07 uh bottom right but um in terms of you

2:12 know it's today it's a manual process

2:14 it's tedious it's Error prone somehow

2:16 the network admins is supposed to know

2:18 what these what are the optimal values

2:20 for this network without knowing

2:21 anything what the traffic from these

2:24 machine learning workloads will come

2:26 right so sometimes you may need to tweak

2:29 like as said you may need to tweak a

2:31 certain certain things for a particular

2:33 model um and more importantly in a

2:36 multi-tenant training uh Network where

2:38 there are you know either multiple jobs

2:40 running and the traffic is Flowing from

2:41 the same fabric by the time you you get

2:44 a ticket or somebody tells you hey my

2:45 machine learning job is slow you may not

2:48 be able to recreate that other job that

2:50 was interfering at that time so you

2:51 don't know what to tweak right so these

2:54 are some of the challenges um you know

2:56 today and that's what we're trying to

2:58 solve through automation right so let's

3:00 see how abstra and all the things

3:02 terraform and um probes that you have

3:05 seen can we can build something to solve

3:07 it so here what we have done is we've um

3:10 written this dcq and autot tune app what

3:13 it does is it leverages all the things

3:15 you saw J showed you from uh you know

3:16 the probe so we are continuously

3:18 monitoring key kpis uh that tell you

3:22 when congestion happens or performance

3:24 kpis right and we'll do a demo soon so

3:27 you'll know and then once we detect that

3:29 hey uh you know I think the intend or

3:31 intended config is not doing what it's

3:33 supposed to do we make a decision and

3:35 then we do a Clos Loop automation we use

3:38 terraform to actually tweak that

3:39 configuration on throughout your Fabric

3:42 and then continue to monitor the effect

3:43 of that and until you you have an

3:46 optimal optimal set of these congestion

3:48 uh configuration

3:51 right so this is like there was a

3:53 question how are we creating so in this

3:55 demo what we are doing is this is our um

3:57 lab that you'll be touring um shortly

4:00 and we use IIA to generate this Rocky V2

4:03 traffic and actually we're generating so

4:05 much traffic that it overwhelms the

4:08 receiving Nicks right so it goes through

4:10 um um you know four leaf and spine

4:13 switches and that congestion manifests

4:17 um you know at the egress leaves and

4:19 that's where we are filling up the

4:20 buffers that's where we are applying the

4:22 algorithm that I'm going to show you uh

4:24 so we monitor the K uh EC and PFC and

4:27 tail drops um via abstra and then we

4:30 basically make a decision and then we

4:32 push out this config now there are a few

4:34 options here you know you can you can

4:36 say hey wherever the congestion is

4:38 happening only tweak the congestion uh

4:41 parameters on that switch or as J show

4:44 you abstra has this concept of leaf and

4:46 spine so if it's happening in a leaf you

4:47 can choose like hey only apply on all

4:49 the other leaves because it may happen

4:52 uh on the other leaves as well or spine

4:53 right so that's the thing and also we

4:56 have service now integration so you know

4:58 we will update everything that all the

5:00 decisions uh this app makes onto the

5:02 service now

5:04 right so profl went through this and I'm

5:07 going to go in a little bit more detail

5:08 before I just uh show you the algorithm

5:10 we are using so as you know DC qcn so

5:14 there are two primary uh mechanisms in

5:16 Ethernet to control congestion right so

5:18 the first one was

5:19 PFC um this came first and it's more of

5:22 a you know brute force or a hammer

5:24 approach in this case if you see the um

5:27 the spine the red link

5:30 if let's say that's congested what the

5:32 switch does is switch will monitor its

5:35 buffer and if it's around 90% it's it's

5:37 configurable but if it's getting to a

5:39 point where it will imminently the

5:41 packet drops is imminent because buffers

5:42 will fill up it it sends a message pause

5:46 frame to the neighboring switch

5:48 Downstream and it says hey don't send me

5:51 uh stop sending me traffic for

5:53 particular amount of time ties for this

5:55 whole priority and the way priority

5:57 works is usually the Nyx will uh Mark a

6:00 dscp value for all RDMA traffic there is

6:03 a default value for NVIDIA and others

6:05 and then what you do is you map that you

6:07 take that and map these things in a

6:09 queue where you apply both these

6:11 techniques right so um So eventually

6:13 this pause the other switch will get in

6:15 trouble and its buffers will fill up and

6:17 the pause frames reach the Nyx and they

6:20 they stop right so the goal is pre how

6:23 to make ethernet lossless and this is

6:25 what makes it lossless but the the

6:27 problem with this approach is you know

6:29 it does it doesn't have the granularity

6:31 so it will use let's say one link is

6:33 blocked it's going to block all traffic

6:35 from that priority from the neighboring

6:37 switch it has 63 other or 62 other ports

6:40 where the traffic could be fine right so

6:42 that's what is uh you know um you know

6:46 Troublesome with this technique but

6:48 there is

6:48 ecn uh with ecn what happens is wherever

6:51 let's say they take the same example you

6:53 have the the spine one of the interfaces

6:55 is uh

6:57 congested um and uh uh what ecn does is

7:01 it it and you have these two knobs where

7:03 you can configure um you know

7:05 probabilistically it starts marking

7:07 packets right and these are oneway flows

7:09 one GPU is trying to write some memory

7:10 chunk onto another gpu's memory and uh

7:14 the all the Nicks are continuously

7:16 looking for this marking so in the dscp

7:18 bid there is a bit that it flips the

7:20 receiver actually detects that there

7:22 some there is congestion somewhere in

7:24 the fabric and then it sends a explicit

7:27 congestion notification packet back to

7:29 the sender and then cender backs off the

7:32 flow so if it was sending at 400 it step

7:34 steps down to 200 and then eventually

7:36 you know can so this is more effective a

7:40 it's granular you're only impacting the

7:42 flows that are impacted uh by a

7:44 particular congested q and B it actually

7:48 uh you know remediates it because it's

7:50 the flows will back down the congestion

7:51 goes away for temporarily and you know

7:53 so DC qcn actually is judicially using

7:56 both these techniques or sometimes they

7:59 you know as Prof was saying you can only

8:00 use uh uh ecn as your primary method uh

8:05 and this is what is DC qcm right so so

8:08 you mentioned that

8:10 PFC causes all the traffic at that

8:13 priority to stop

8:16 yes from the neighboring

8:19 switch Yeah on that interface

8:23 okay so let's see like so what we have

8:26 done is you know so I'm simplifying here

8:28 a little bit but this is easy to

8:30 understand what you're seeing is

8:33 um the x-axis here is actually the the

8:38 buffer occupancy in percentage right so

8:40 whenever congestion happens a switch uh

8:42 is congested at some que on some Port

8:45 this is the m uh this is what you apply

8:47 and the there are two configuration

8:49 parameters K Min and K Max so for ecn so

8:53 what it does is as buffers fill up hit

8:55 the cman lower water Mark um uh they

8:59 will probabilistically start marking the

9:01 packets so if let's say you are

9:02 somewhere at 10% the switch will start

9:04 marking one in 10 packets as ecn right

9:07 and if the if the the ecn doesn't kick

9:11 in fast enough you keep filling the

9:13 buffers you cross kmax that's when all

9:16 the packets in the flow in the queue

9:18 will be marked ecn that means all the

9:20 flows will be impacted right

9:23 and then you have PFC so sometimes so

9:26 there is a function of these control

9:28 packets being sent to the sender if

9:30 there is some delay in there uh it may

9:32 still not act fast enough so that's when

9:34 you have in uh some you know PFC as

9:36 insurance just to kick in and buy some

9:38 time for rec to eventually kick in and

9:40 slow uh you know um so what we have done

9:43 is so that this is the set of value that

9:46 a network admin has to know somehow

9:49 automatically like hey what my pattern

9:51 should be and the the practice is they

9:53 all shift uh very they start very right

9:56 because you don't want to slow

9:57 unnecessary down but with that approach

10:00 what happens is you don't have much uh

10:02 buffer left to actually uh you know

10:04 prevent drops so what we have done here

10:06 is we keep monitoring uh on the top

10:09 right uh right what you see how do we

10:11 adjust Les left so everybody starts with

10:14 a very uh you know uh um relaxed ecn

10:18 config so that all the applications can

10:20 breathe they're running at the uh

10:22 maximum possible uh you know your load

10:24 balancing is your first defense but

10:25 still if congestion happens what you do

10:28 is if we we keep keep monitoring the

10:29 buffer occupancy by seeing hey are there

10:32 any pfcs being triggered or is there a

10:35 tail drop happening right so if it is

10:37 happening that indicates that your ecn

10:40 is not uh reacting fast and whatever

10:43 config you did um you know we need to

10:46 move a little bit left trigger easn

10:48 sooner so that you know that eventuality

10:51 of dropping packets is is is is elevated

10:54 right so that's how this uh what we do

10:56 is we make a decision we keep moving

10:58 left and then we see what's the impact

11:00 of this if it's still pfcs are triggered

11:03 or packet drops are happening we keep

11:05 moving left until this system is stable

11:07 right and once we do that then um how do

11:11 we move right is when when you have

11:13 avoided this boogy man the uh situation

11:15 which may cause packet loss we start

11:17 saying okay if EC if the buffer

11:20 occupancy is between your low and high

11:22 Watermark that indicates that hey my ecn

11:25 is perfectly tuned and the round trip

11:27 times for these cnps are are perfect so

11:29 let me actually allow the applications

11:31 to run at higher bandwidth breathe more

11:34 and move right and it will help you find

11:37 the most optimal value based on whatever

11:39 workloads come in in real time right so

11:42 the so the metric it's trying to drive

11:44 is Packet drop rather than job

11:47 completion time or something like that

11:48 which is a higher level metric so so

11:51 you're right so because it doesn't have

11:52 the notion job completion time can only

11:54 be measured by the application right so

11:57 but if there is congestion happening uh

11:59 packet losses uh and uh you know can be

12:03 can slow that job completion time so

12:05 it's trying to avoid that by saying hey

12:06 let me

12:08 slow uh slow applications to a like not

12:11 a full crawl because you know but let me

12:13 slow them down reasonably they're still

12:15 running that may impact that may help

12:18 the job completion time rather than if

12:20 packet losses happen retransmissions are

12:21 triggered no I was going to just add to

12:24 what vickram said this is an application

12:26 that's running on top of abstra with the

12:28 apis this is one implementation we've

12:30 done but like at the end uh you can take

12:33 it and do what you want but you're right

12:34 that there could be if you have access

12:36 to job completion time you can

12:37 absolutely make that part of the

12:39 algorithm that you implement and we're

12:41 going to experiment with a bunch of

12:42 things we're going to learn along with

12:43 our customers in terms of what to make

12:45 how to make this you know as effective

12:47 as possible so this is just one example

12:49 implementation of what you can do yeah

12:51 but the application people are never

12:53 going to tell the uh network engineer

12:55 about their jobs no I mean yes no so

13:00 what we seen in terms of like the Divide

13:01 is that we're they're never going to

13:02 allow us to configure anything in the

13:04 application it's kind of like the inter

13:06 the the the uh the Integrations we have

13:09 with like VMware we don't go and imple

13:12 and and configure VMware however yes we

13:15 do get visibility into it we get access

13:18 to that Telemetry yeah but the end we're

13:20 controlling the network allow us if they

13:22 D to allow us if if yeah it'll be even

13:25 harder in the AML applications yeah so

13:28 um I I know you Matas you had a question

13:30 earlier like hey how much drops and

13:32 stuff once we know that we can tweak

13:34 like hey how much drops you want like to

13:36 move this further right like we could

13:38 you could do that but um just wanted to

13:40 say that hey you know future so this is

13:42 one idea we are testing and you know as

13:44 we do more experiments we are going to

13:46 find more opportunities what else can we

13:48 T tune what other parameters load

13:50 balancing other things that we will uh

13:52 enhance this right and um so abstra now

13:56 allows you to not only like you know

13:58 deploy will operate but also find help

14:01 you fine-tune some of these performance

14:03 kpis to help you um you know run this

14:06 optimally so with that I I will um let

14:09 Raj do the demo and just to set the

14:11 stage right um like Vikram said uh we

14:15 want to start like I I think of packet

14:19 drops as an edge right so you want to

14:21 get away from the edge as quickly as

14:23 possible and then because the view is

14:25 really nice from The Edge you want to

14:27 creep closer and closer and then find

14:29 the spot where you're safe right so

14:31 that's kind of the the my mental image

14:34 when I when I wrote this code um so this

14:38 application uh what it does is it

14:41 basically uses abstra uh to manage the

14:44 fabric right uh Jay spoke about

14:46 configlets which is basically like these

14:48 config Snippets that you can push out

14:50 into abstra um and Jay spoke about

14:52 probes which basically you know tell you

14:55 what's happening right um so I'll just

14:57 let it go

15:04 and we'll run it on two times the speed

15:06 because we like Fast um so the situation

15:10 we see here is there are both ecn and

15:13 PFC anomalies happening on the fabric

15:16 because the exia is pushing traffic out

15:19 whenever we see pfcs or Trail drops we

15:21 start moving left right and that's

15:24 that's basically what's going to happen

15:26 um and what you also see here is you

15:28 know because

15:29 it's so much fun to uh to see um to see

15:33 log messages I just put everything up on

15:35 service now so it basically updates

15:37 incident ticket right um the general

15:40 idea is like we want to make the network

15:43 admin's life easy right so with the

15:45 network admin might would much rather

15:48 see what happened uh than be told

15:51 something really bad happened right so

15:52 the the general idea is uh as it sees as

15:57 it sees this application as it sees pfcs

16:00 is going to start shifting left

16:01 basically shifting that window left um

16:04 hopefully slowing the traffic down and

16:06 it keeps updating service now and says

16:08 hey man I just dropped it down I dropped

16:10 it down some more I dropped it down some

16:12 more and so on and that's basically what

16:14 we seeing here uh this was written uh

16:17 while this is going on uh this was

16:19 written in Python uh using the rest API

16:22 uh the the rest API that you know Jay

16:25 spoke about and um you can see already

16:29 that it started at 6090 it's already

16:32 gone to 4070 right so two cycles have

16:35 happened and it's already gone to 4070

16:37 and then it went to

16:39 3060 right now at this point

16:43 um in a few couple of seconds it's going

16:46 to notice that there are no pfcs right

16:48 so pfcs are gone the ecn are gone like

16:51 there are still ecn so basically we

16:53 moved far away from the edge right now

16:57 we still want the application to breathe

16:59 so we know we are close to the edge so

17:01 we start moving right now to see to

17:04 actually find the edge right so we go

17:07 right and we find edges right so it

17:11 basically says uh it basically knows now

17:14 that if it sees pfcs again that's the

17:16 danger zone right um if I was a singer

17:20 I'd start singing The Danger Zone song

17:21 but I'm not everyone is safe um so it

17:25 sees the edge now it knows the EDG is

17:28 there so it wants to find a spot where

17:31 it can still breathe but you know it's

17:33 not going to fall over so now it's going

17:36 to start moving left slow like we went

17:40 when we were when we when we moved

17:42 initially we moving in steps of 10 now

17:44 we're going to move in steps of five

17:45 just to find the edge right um so from

17:50 uh 40 70 it went to 3565 you know it

17:54 moved a little bit and I'm at this point

17:57 I'm willing to take better on whether

17:59 it's going to stop here I know what's

18:00 going to happen

18:02 and go further uh so yeah we have like

18:07 80 Seconds for your bets uh no but the

18:10 the general idea is right like because

18:13 because you know you're kind of safe

18:15 here you pause here longer so that like

18:17 you know the the changes can propagate

18:20 uh you know the the system can settle

18:22 down for a bit um and once it settles

18:26 down in a particular Zone pH long enough

18:30 this particular this particular

18:32 application will just stop right it'll

18:34 basically say I I think I found the safe

18:36 spot and you know it'll just stop at

18:38 that point um it could also potentially

18:42 uh just keep running right it can

18:44 basically say like I don't see any pfcs

18:47 I've not seen any pfcs in the last

18:49 minute minute and a half maybe I can go

18:52 I can move again maybe I can go to a

18:53 better spot right uh so uh it can

18:57 basically run in like like infinite mode

19:00 not infinite like not stopping mode like

19:02 while one right it can do that or it can

19:04 basically say like uh you know I'm done

19:08 uh this is good this is where I'm going

19:09 to stop so you can basically

19:13 um um it's basically up to you right

19:16 that's that's kind of the power of

19:17 abstract you can make your application

19:19 like this whole concept did not exist uh

19:23 like 3 4 months ago right and it now

19:26 it's it's now a thing and you know we

19:28 we're talking to abstra PLM it's going

19:30 to become a thing in abstra right so

19:32 like that's kind of where we want to go

19:34 here like we want to use this as a

19:36 general a general kind of uh framework

19:40 thank you uh where um anybody can

19:44 develop their own applications with Abra

19:46 right this code is going to going to go

19:48 out on GitHub like everything else uh

19:50 and you know anybody can do whatever

19:53 they want you know like you want to make

19:55 like config backups make config backups

19:58 be happy you know uh use our code be

20:00 happy that's that's really what we want

20:02 and as you can see 3060 if anybody had

20:05 3060 you won so at 3060 everything is

20:08 happy it's stopped

20:11 um that's 3060 306 yeah um there's like

20:17 yeah there's like you know thesis

20:18 quality research to be done here to be

20:20 honest because like you know we are

20:22 doing something like very simple and

20:24 straightforward like the the windows

20:26 only 30 wide we move the window

20:29 you know like in in lock step we could

20:32 you know expand our contract a window uh

20:35 we could expand we could change the

20:37 percentages by which these like right

20:39 now the percentage is 0 100 right when

20:41 you before you hit the when you hit the

20:43 window zero the probability is like is a

20:46 linear probability you could play with

20:47 that um somebody said like can we uh you

20:52 know can we change this based on

20:54 completion time right of course we can

20:56 right see this is a pattern right the

20:58 pattern in which this window was moved

21:00 is a is is like is something um if you

21:03 can match that against completion times

21:06 over a period of time for different

21:08 kinds of loads you can figure out what

21:11 the ideal pattern is and then you know

21:13 like

21:15 um again I wish I had a joke about

21:17 learning but I don't know like this

21:21 application can be smart right it can do

21:23 things where it knows this kind of

21:25 workload would probably need this kind

21:28 of window and automatically do that this

21:31 python application you mean the python

21:33 the the yeah I just wrote python code

21:34 man like you well yeah you're you're

21:37 describing an ability of the of the

21:40 software the networking software

21:42 engineer yes using whichever platform

21:44 they prefer you prefer python yeah to do

21:47 all these kinds of things yeah exactly

21:49 yeah yeah that's that's what I meant but

21:51 we we would like to see is that abstra

21:54 has an AI in it that deter no we

21:57 don't we don't to see that this moment

22:00 my my question on the python is is there

22:02 a uh good python library for abstra or

22:06 are we directly making uh rest API calls

22:09 inside the python code yeah f.o is

22:12 coming up with like a a brand new API

22:15 Library I used an older version but the

22:17 latest I like sdks I'm

22:19 I'm API like a really fancy SDK is

22:23 coming out uh so you know you you'll

22:25 have that for sure

22:29 you you didn't talk about flets and the

22:31 in the optimization but that's another

22:33 potential thing that you could turn on

22:34 or off based on how things are going

22:37 yeah yes I mean I I I don't make those

22:39 decisions but I would get some grad

22:41 students for this work you know they get

22:43 a thesis we get an application

22:44 everyone's happy but you're right so we

22:47 we took the hardest problem first

22:49 congestion Management in Ethernet and

22:51 then we are looking at others uh and one

22:53 of what you pointed out is our is on our

22:56 list just to expand

23:07 to expand on that there was earlier

23:08 question on Smart NX as well so one of

23:10 the ideas that we had that we started

23:11 building an app two let's call this app

23:13 one app two is we're looking at

23:15 congestion metrics on not just the

23:17 switches but we're looking congestion

23:18 metrics on the smart NYX if you see for

23:21 example out of order packets on the

23:22 smart Nick as an example then you know

23:25 that your DLB is not configured properly

23:27 so you can go twe your DLB Timeout on

23:29 the switches to respond to congestion

23:32 that you see on the smart Nick so that's

23:33 an app that is going to follow very soon

23:35 after this app as

23:37 well yeah thanks Prof i' forgotten like

23:40 you know Michael Michael hle from

23:42 juniper is working on that you know uh I

23:45 one the next thing I'm going to do is

23:47 integrate his work with this code so

23:48 that you know we can have uh more

23:51 smartness

Show more