- About this Document
- Solution Benefits
- AI Use Case and Reference Design
- Solution Architecture
- Configuration Walkthrough
- NVIDIA Configuration
- Terraform Automation of Apstra for the AI Fabric
- Validation Framework
- Network Connectivity: Reference Examples
- WEKA Storage Solution
- Tested Optics
- Results Summary and Analysis
- Recommendations
WEKA Storage Solution
The WEKA Data Platform is a software-based solution built to modernize enterprise data stacks. Its advanced AI-native, data pipeline-oriented architecture delivers high performance at scale, so AI workloads run faster and work more efficiently.
We selected the WEKA Data Platform as part of the AI JVD design due to the following benefits:
- High Performance: Weka's architecture is designed for extreme performance, making it suitable for AI/ML workloads, big data analytics, and high-performance computing (HPC) environments.
- Scalability: Weka can scale from a few terabytes to exabytes of data, allowing customers to grow their storage capacity without compromising performance. WEKA’s distributed architecture differs from typical scale-up style storage systems, appliances, and hypervisor-based, software-defined storage solutions. It overcomes traditional storage scaling and file-sharing limitations that can be a bottleneck to large-scale AI deployments making one of the preferred choices for customers.
- Unified Storage: Weka provides a single storage solution that can support multiple protocols (e.g., NFS, SMB, POSIX, S3), providing flexibility to access and manage the data and allowing Nvidia’s GPUDirect Storage access.
- Data Resilience: Weka offers advanced data protection features, including erasure coding, which ensures data resilience and protection against hardware failures. With a minimum configuration of six storage servers the cluster can survive two-server failure.
- Ease of Management: Weka's software-defined storage solution is easy to deploy and manage, with a user-friendly interface and automated management features. It can be installed on any standard AMD EPYC™ or Intel Xeon™ Scalable Processor-based hardware with the appropriate memory, CPU processor, networking, and NVMe solid-state drives.
- Support for GPUs: Weka is optimized for GPU acceleration, making it an ideal storage solution for environments that heavily rely on GPU computing, such as AI and machine learning applications.
- Low Latency: The architecture of Weka allows for very low-latency access to data, which is crucial for applications that require real-time data processing.
Weka storage cluster in the AI JVD lab
We built the WEKA storage cluster with eight SuperMicro-based servers connected to the Storage Backend fabric providing 242TB of usable storage. WEKA recommends eight cluster nodes and requires a minimum of six nodes for production deployment.
Each WEKA Server has the following specifications:
- AMD EPYC 9454P processors
- 384GB System Memory
- OS drives: 2x 1.92TB M.2 NVMe Data Center SSD (PCIe 4.0)
- Data drives: 7x 7.68TB U.2 NVMe Data Center SSD (PCIe 4.0)
- Onboard OOB network connection (RJ45) and the following
additional interface cards:
- 1 x NVIDIA Mellanox ConnectX-6 DX Adapter Card, 100GE, dual-port QSFP28, PCIe 4.0 x16
- 2 x NVIDIA Mellanox ConnectX-6 VPI Adapter Card, HDR IB & 200GE, dual-port QSFP56, OCP 3.0
- Software:
- The operating system installed is Ubuntu 22.04 LTS.
- WEKA release version tested in this design is 4.2.5.
- WEKA Flash Tier license w/SnapShot and high-performance protocol services
- (POSIX, NFS-W, S3 and SMB-W)
Common Setting Changes Required
WEKA strongly recommends certain BIOS settings, and that Mellanox drivers are matched across all nodes. For convenience, these changes are documented here.
WEKA makes available a Weka Management Service (WMS) tool that can be used to automate the BIOS settings changes, verify your configuration, including driver revisions, and deploy the WEKA version you have. This can be downloaded from the WEKA website, located here: https://get.weka.io/ui/wms/download. Juniper highly recommends utilizing the WMS for configuring the WEKA cluster. All the devices are configured to perform ECMP load balancing, as explained later in the document.
BIOS settings:
The BIOS settings can be changed by applying the bios_settings.yml:
Supermicro: AMD: ACPISRATL3CacheAsNUMADomain#0099: Disabled IOMMU#00EA: Disabled NUMANodesPerSocket#703F: Auto SMTControl#00CB: Disabled SR-IOVSupport#0067: Enabled DFCstates#7104: Disabled GlobalC-stateControl#00CD: Disabled
This is an AMD CPU-powered cluster; the settings may be different for Intel based CPUs.
For more details on how to apply these changes refer to: GitHub - weka/bios_tool: A tool for viewing/setting bios_settings for Weka servers
Network Configuration for the Juniper WEKA Cluster
As described in the Storage Backend sections, the WEKA servers are dual-homed, and are connected to separate storage backend switches (storage-backend-weka-leaf1 and storage-backend-weka-leaf2) using 200GE ports in the NVIDIA Mellanox ConnectX-6 VPI Adapter Card. The additional QSFP28 100Gbe ports are not used in this JVD but can be used for front-end ingress/egress traffic, staging and management.
Figure 98: Storage Interface Connectivity
The ports on the switch side must be configured with no auto negotiation and set to 200G speed.
OFED Drivers:
WEKA recommends following Nvidia’s recommendation for OFED (Mellanox) drivers when using Connect-X cards. NVIDIA Documentation - Installing Mellanox OFED.
Driver Release Should be 5.8 or Later.
Ensure that all versions for OFED drivers are aligned across all nodes in the WEKA cluster (i.e. ensure weka01 has the appropriate OFED installed).
For Ubuntu, the following command is recommended:
./mlnxofedinstall --force --dkms --all.
The following script can also be run (as root) on all machines to set the appropriate Mellanox firmware settings.
#!/bin/bash mst start for MLXDEV in /dev/mst/* ; do mlxconfig -d ${MLXDEV} -y s ADVANCED_PCI_SETTINGS=1 PCI_WR_ORDERING=1 mlxfwreset -y -d ${MLXDEV} reset done netplan apply mst stop
Best Practices for WEKA Data Platform with Juniper Switches
Our cluster is configured using the WEKA distributed POSIX client, which requires some tuning to be integrated to the rest of the design.
We recommend the following:
- Set the MTU to 9000
If the back-end storage fabric is shared with another resource, set up appropriate CoS prioritization to ensure the AI ingest and checkpoint traffic is not interrupted by other applications network I/O requests.
If GPU Direct Storage is being used instead of the WEKA distributed POSIX client, congestion management and mitigation capability on the network utilizing Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) must be set up.
WEKA also provides tools that can be used to test and measure network activity from a WEKA system perspective.
The command line tool ‘weka stats’ reports a percentage output of ‘good’ network performance.
weka stats --start-time -24h --end-time -1m --show-internal --stat GOODPUT_TX_RATIO,GOODPUT_RX_RATIO
When the output is shown as a percentage, anything below 85% indicates potential issues that require further examination.
Examples:
NODE CATEGORY TIMESTAMP STAT VALUE all network 2024-06-14T12:58:00 GOODPUT_RX_RATIO 99.7636 % all network 2024-06-14T12:58:00 GOODPUT_TX_RATIO 99.7636 % all network 2024-06-14T12:57:00 GOODPUT_RX_RATIO 99.7663 % all network 2024-06-14T12:57:00 GOODPUT_TX_RATIO 99.7663 % all network 2024-06-14T12:56:00 GOODPUT_RX_RATIO 99.752 % all network 2024-06-14T12:56:00 GOODPUT_TX_RATIO 99.752 % all network 2024-06-14T12:55:00 GOODPUT_RX_RATIO 99.7578 % all network 2024-06-14T12:55:00 GOODPUT_TX_RATIO 99.7578 % all network 2024-06-14T12:54:00 GOODPUT_RX_RATIO 99.7795 % all network 2024-06-14T12:54:00 GOODPUT_TX_RATIO 99.7795 % all network 2024-06-14T12:53:00 GOODPUT_RX_RATIO 99.7685 % all network 2024-06-14T12:53:00 GOODPUT_TX_RATIO 99.7685 % all network 2024-06-14T12:52:00 GOODPUT_RX_RATIO 99.775 % all network 2024-06-14T12:52:00 GOODPUT_TX_RATIO 99.775 %
weka stats --category=network --show-internal --stat DROPPED_PACKETS --start-time -24h --end-time -1m -Z
NODE CATEGORY TIMESTAMP STAT VALUE all network 2024-06-14T13:06:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:05:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:04:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:03:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:02:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:01:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T13:00:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:59:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:58:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:57:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:56:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:55:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:54:00 DROPPED_PACKETS 0 Packets/Sec all network 2024-06-14T12:53:00 DROPPED_PACKETS 0 Packets/Sec
If the weka stats command reports dropped packets as shown, further investigation is warranted.
More details and additional tools can be found on the WEKA website Manually prepare the system for WEKA configuration | W E K A.
Test Objectives
The primary objectives of the JVD testing can be summarized as:
- Qualification of the complete AI fabric design functionality including the Frontend, GPU Backend, and Storage Backend fabrics, and connectivity between NVIDIA GPUs and WEKA Storage.
- Qualification of the deployment steps based on Juniper Apstra.
- Ensure the design is well-documented and will produce a reliable, predictable deployment for the customer.
The qualification objectives included validating:
- validation of blueprint deployment, device upgrade, incremental configuration pushes/provisioning, Telemetry/Analytics checking, failure mode analysis, congestion avoidance and mitigation, and verification of host, storage, and GPU traffic.
Test Goals
The AI JVD testing for the described network included the following:
- Design and blueprint deployment through Apstra of three distinct fabrics
- Fabric operation and monitoring through Apstra analytics and telemetry dashboard
- Congestion management with PFC and ECN, including failure scenarios
- End-to-end traffic flow, with Dynamic Load Balancing
- System health, ARP, ND, MAC, BGP (route, next hop), interface traffic counters, and so on
- Software operation verification (no anomalies, or issues found)
- AI fabric with Juniper Apstra successfully performing under the
following required scenarios (must):
- Node failure (reboot)
- Interface failures (interface down/up, Laser on/off):
Under these scenarios the following were evaluated/validated:
- Completion of AI Job models within MLCommons Training benchmarks
- Traffic recovery was validated after all failure scenarios.
- impact to the fabric and check anomalies reporting in Apstra.
Other features tested:
- Mellanox Connect-X NIC card default settings.
- DSCP and CNP configuration on the NICs
- Connectivity between fabric-connected hosts created by Apstra towards NSX-managed hosts.
- BERT/DLRM test completion times
- Llama2 Inference against existing infrastructure.
Refer to the test report for more information.