AIDC Now: Backend AI Training Data Centers

Demo Drop Data Center

AIDC Now: Backend AI Training Data Centers

In this AIDC Now demo, we’ll share how quick and easy it is to configure a backend data center for AI training with JVDs specifically designed for rail-optimized deployments. Rail-optimized designs maximize performance and efficiency, while reducing network costs.

Show more

You’ll learn

  • How Apstra helps automate rail-optimized designs

Who is this for?

Network Professionals Business Leaders

Transcript

0:00 With this demo, I'm going to share how our representative customer

0:04 A1 Arcade, can quickly and easily build a back end A.I.

0:08 training data center.

0:10 This DC is made of racks built with two spines, eight leafs and 16

0:14 GPU servers for a total of 128 GPUs per rack.

0:19 Apstra allows us to drill down for a closer look at the network template,

0:23 which in this case is a relatively small data center of 512 GPUs.

0:28 Drilling down further, we can see the actual data center

0:31 design and look at more details of the links and nodes.

0:34 As you can see, this particular network template is

0:37 made up of four of the 128 GPU racks.

0:41 By zooming in on one of these racks, it becomes clear

0:44 that this isn't a normal clos fabric.

0:46 So let's take a second to talk about the fabric design.

0:49 This is a rail optimized fabric introduced by Nvidia, which provides a dense

0:54 number of links and the capacity to handle massive datasets, elephant flows,

0:59 GPU to GPU communications

1:02 GPU-to-memory and GPU-to-storage traffic.

1:05 So what does the training process look like in this scenario?

1:09 Training requires many cycles

1:11 or epochs to train an A.I. model.

1:13 Data sets are chopped up into smaller flows, which are distributed

1:17 across a network fabric to GPUs, where parallel processing shortens

1:21 the compute cycles.

1:23 That data is again sent across the fabric

1:26 where it is combined before another epic can be run.

1:29 This is where network performance becomes so critical.

1:32 It's your key in reducing tail latency and shortening the job completion time.

1:37 So while your network is the smallest investment in an

1:40 AI data center, it's also the most critical to A.I.

1:43 training and GPU efficiency. Do it right.

1:46 Everything works together.

1:47 Do it wrong, and you waste vast amounts of time and money.

1:52 To optimize GPU efficiency, increase speed and minimize cost.

1:56 We use these rail optimized designs along with protocols like ROCEv2

2:01 to provide one hop connectivity between GPUs.

2:04 In this design, GPU1 from each server

2:07 is connected to leaf switch 1, which is called a rail.

2:11 GPU2 from each server is connected to leaf switch 2 and so on.

2:15 Typically across eight switches something Juniper refers to as a stripe.

2:20 As shown in our Apstra overview demo,

2:22 we can simplify the implementation of these fabrics by leveraging JVDs

2:27 that are specifically designed for rail optimized deployments.

Show more