High-Performance GPU Server Hardware Topology and Cluster Networking

September 27, 2024

FiberMall

One-stop supplier of professional optical communication products

Table of Contents

Terminology and Basics

Large model training typically utilizes single-machine, 8-GPU hosts to form clusters. The models include 8*{A100, A800, H100, H800}. Below is the hardware topology of a typical 8*A100 GPU host:

the hardware topology of a typical 8xA100 GPU host

PCIe Switch Chip

Devices such as CPUs, memory, storage (NVME), GPUs, and network cards that support PCIe can connect to the PCIe bus or a dedicated PCIe switch chip to achieve interconnectivity.

Currently, there are five generations of PCIe products, with the latest being Gen5.

NVLink

Definition

According to Wikipedia, NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).

In summary, NVLink is a high-speed interconnect method between different GPUs within the same host. It is a short-range communication link that ensures successful packet transmission, offers higher performance, and serves as a replacement for PCIe. It supports multiple lanes, with link bandwidth increasing linearly with the number of lanes. GPUs within the same node are interconnected via NVLink in a full-mesh manner (similar to spine-leaf architecture), utilizing NVIDIA’s proprietary technology.

Evolution: Generations 1/2/3/4

The main differences lie in the number of lanes per NVLink and the bandwidth per lane (the figures provided are bidirectional bandwidths):

For example:

A100: 2 lanes/NVSwitch * 6 NVSwitch * 50GB/s/lane = 600GB/s bidirectional bandwidth (300GB/s unidirectional). Note: This is the total bandwidth from one GPU to all NVSwitches.

A800: Reduced by 4 lanes, resulting in 8 lanes * 50GB/s/lane = 400GB/s bidirectional bandwidth (200GB/s unidirectional).

Monitoring

Real-time NVLink bandwidth can be collected based on DCGM metrics.

NVSwitch

Refer to the diagram below for a typical 8*A100 GPU host hardware topology.

NVSwitch is an NVIDIA switch chip encapsulated within the GPU module, not an independent external switch.

Below is an image of an actual machine from Inspur. The eight boxes represent the eight A100 GPUs, and the six thick heat sinks on the right cover the NVSwitch chips:

NVLink Switch

Although NVSwitch sounds like a switch, it is actually a switch chip on the GPU module used to connect GPUs within the same host.

In 2022, NVIDIA released this chip as an actual switch called NVLink Switch, designed to connect GPU devices across hosts. The names can be easily confused.

HBM (High Bandwidth Memory)

Origin

Traditionally, GPU memory and regular memory (DDR) are mounted on the motherboard and connected to the processor (CPU, GPU) via PCIe. This creates a speed bottleneck at PCIe, with Gen4 offering 64GB/s and Gen5 offering 128GB/s. To overcome this, some GPU manufacturers (not just NVIDIA) stack multiple DDR chips and package them with the GPU. This way, each GPU can interact with its own memory without routing through the PCIe switch chip, significantly increasing speed. This “High Bandwidth Memory” is abbreviated as HBM. The HBM market is currently dominated by South Korean companies like SK Hynix and Samsung.

Evolution: HBM 1/2/2e/3/3e

According to Wikipedia, the AMD MI300X uses a 192GB HBM3 configuration with a bandwidth of 5.2TB/s. HBM3e is an enhanced version of HBM3, with speeds ranging from 6.4GT/s to 8GT/s.

Bandwidth Units

The performance of large-scale GPU training is directly related to data transfer speeds. This involves various links, such as PCIe bandwidth, memory bandwidth, NVLink bandwidth, HBM bandwidth, and network bandwidth. Network bandwidth is typically expressed in bits per second (b/s) and usually refers to unidirectional (TX/RX). Other modules’ bandwidth is generally expressed in bytes per second (B/s) or transactions per second (T/s) and usually refers to total bidirectional bandwidth. It is important to distinguish and convert these units when comparing bandwidths.

Typical 8A100/8A800 Host

Host Topology: 2-2-4-6-8-8

2 CPUs (and their respective memory, NUMA)
2 storage network cards (for accessing distributed storage, in-band management, etc.)
4 PCIe Gen4 Switch chips
6 NVSwitch chips
8 GPUs
8 GPU-dedicated network cards

The following diagram provides a more detailed view:

Storage Network Cards

These are directly connected to the CPU via PCIe. Their purposes include:

Reading and writing data from distributed storage, such as reading training data and writing checkpoints.

Normal node management, SSH, monitoring, etc.

The official recommendation is to use BF3 DPU, but as long as the bandwidth meets the requirements, any solution will work. For cost-effective networking, use RoCE; for the best performance, use IB.

NVSwitch Fabric: Intra-Node Full-Mesh

The 8 GPUs are connected in a full-mesh configuration via 6 NVSwitch chips, also known as NVSwitch fabric. Each link in the full-mesh has a bandwidth of n * bw-per-nvlink-lane:

For A100 using NVLink3, it is 50GB/s per lane, so each link in the full-mesh is 12*50GB/s = 600GB/s (bidirectional), with 300GB/s unidirectional.

For A800, which is a reduced version, 12 lanes are reduced to 8 lanes, so each link is 8*50GB/s = 400GB/s (bidirectional), with 200GB/s unidirectional.

Using nvidia-smi topo to View Topology

Below is the actual topology displayed by nvidia-smi on an 8*A800 machine (network cards are bonded in pairs, NIC 0~3 are bonded):

Between GPUs (top left area): All are NV8, indicating 8 NVLink connections.

Between NICs:

On the same CPU: NODE, indicating no need to cross NUMA but requires crossing PCIe switch chips.

On different CPUs: SYS, indicating the need to cross NUMA.

Between GPUs and NICs:

On the same CPU and under the same PCIe Switch chip: NODE, indicating only crossing PCIe switch chips.

On the same CPU but under different PCIe Switch chips: NODE, indicating crossing PCIe switch chips and PCIe Host Bridge.

On different CPUs: SYS, indicating crossing NUMA, PCIe switch chips, and the longest distance.

GPU Training Cluster Networking: IDC GPU Fabric

GPU Node Interconnection Architecture:

Compute Network:

The GPU network interface cards (NICs) are directly connected to the top-of-rack switches (leaf). These leaf switches are connected in a full-mesh topology to the spine switches, forming an inter-host GPU compute network. The purpose of this network is to facilitate data exchange between GPUs on different nodes. Each GPU is connected to its NIC through a PCIe switch chip: GPU <–> PCIe Switch <–> NIC.

Storage Network:

Two NICs directly connected to the CPU are linked to another network, primarily for data read/write operations and SSH management.

RoCE vs. InfiniBand:

Both the compute and storage networks require RDMA to achieve the high performance needed for AI. Currently, there are two RDMA options:

RoCEv2: Public cloud providers typically use this network for 8-GPU hosts, such as the CX6 with an 8*100Gbps configuration. It is relatively inexpensive while meeting performance requirements.

InfiniBand (IB): Offers over 20% better performance than RoCEv2 at the same NIC bandwidth but is twice as expensive.

Data Link Bandwidth Bottleneck Analysis:

Key link bandwidths are indicated in the diagram:

Intra-host GPU communication: Utilizes NVLink with a bidirectional bandwidth of 600GB/s (300GB/s unidirectional).

Intra-host GPU to NIC communication: Uses PCIe Gen4 switch chips with a bidirectional bandwidth of 64GB/s (32GB/s unidirectional).

Inter-host GPU communication: Relies on NICs for data transmission. The mainstream bandwidth for domestic A100/A800 models is 100Gbps (12.5GB/s unidirectional), resulting in significantly lower performance compared to intra-host communication.

200Gbps (25GB/s): Approaches the unidirectional bandwidth of PCIe Gen4.
400Gbps (50GB/s): Exceeds the unidirectional bandwidth of PCIe Gen4.

Thus, using 400Gbps NICs in these models is not very effective, as 400Gbps requires PCIe Gen5 performance to be fully utilized.

Typical 8H100/8H800 Hosts

There are two types of GPU Board Form Factor:

PCIe Gen5
SXM5: Offers higher performance.

H100 Chip Layout:

The internal structure of an H100 GPU chip includes:

4nm process technology.

The bottom row consists of 18 Gen4 NVLink lanes with a total bidirectional bandwidth of 900GB/s (18 lanes * 25GB/s/lane).

The middle blue section is the L2 cache.

The sides contain HBM chips, which serve as the GPU memory.

Intra-host Hardware Topology:

Similar to the A100 8-GPU structure, with the following differences:

The number of NVSwitch chips has been reduced from 6 to 4.

The connection to the CPU has been upgraded from PCIe Gen4 x16 to PCIe Gen5 x16, with a bidirectional bandwidth of 128GB/s.

Networking:

Similar to the A100, but the standard configuration has been upgraded to 400Gbps CX7 NICs. Otherwise, the bandwidth gap between the PCIe switch and NVLink/NVSwitch would be even larger.

**Typical 4L40S/8L40S Hosts**

The L40S is a new generation of cost-effective, multifunctional GPUs set to be released in 2023, positioned as a competitor to the A100. While it is not suitable for training large foundational models (as will be explained later), it is advertised as being capable of handling almost any other task.

Comparison of L40S and A100 Configurations and Features

One of the key features of the L40S is its short time-to-market, meaning the period from order to delivery is much shorter compared to the A100/A800/H800. This is due to both technical and non-technical reasons, such as: The removal of FP64 and NVLink.

The use of GDDR6 memory, which does not rely on HBM production capacity (and advanced packaging).

The lower cost is attributed to several factors, which will be detailed later.

The primary cost reduction likely comes from the GPU itself, due to the removal of certain modules and functions or the use of cheaper alternatives.

Savings in the overall system cost, such as the elimination of a layer of PCIe Gen4 switches. Compared to 4x/8x GPUs, the rest of the system components are almost negligible in cost.

Performance Comparison Between L40S and A100

Below is an official performance comparison:

Performance: 1.2x to 2x (depending on the specific scenario).

Power consumption: Two L40S units consume roughly the same power as a single A100.

It is important to note that the official recommendation for L40S hosts is a single machine with 4 GPUs rather than 8 (the reasons for this will be explained later). Therefore, comparisons are generally made between two 4L40S units and a single 8A100 unit. Additionally, many performance improvements in various scenarios have a major prerequisite: the network must be a 200Gbps RoCE or IB network, which will be explained next.

L40S System Assembly

Recommended Architecture: 2-2-4

Compared to the A100’s 2-2-4-6-8-8 architecture, the officially recommended L40S GPU host architecture is 2-2-4. The physical topology of a single machine is as follows:

The most noticeable change is the removal of the PCIe switch chip between the CPU and GPU. Both the NIC and GPU are directly connected to the CPU’s built-in PCIe Gen4 x16 (64GB/s):

2 CPUs (NUMA)
2 dual-port CX7 NICs (each NIC 2*200Gbps)
4 L40S GPUs

Additionally, only one dual-port storage NIC is provided, directly connected to any one of the CPUs.

This configuration provides each GPU with an average network bandwidth of 200Gbps.

Non-Recommended Architecture: 2-2-8

As shown, compared to a single machine with 4 GPUs, a single machine with 8 GPUs requires the introduction of two PCIe Gen5 switch chips.

It is said that the current price of a single PCIe Gen5 switch chip is $10,000 (though this is unverified), and a single machine requires 2 chips, making it cost-ineffective.

There is only one manufacturer producing PCIe switches, with limited production capacity and long lead times.

The network bandwidth per GPU is halved.

Networking

The official recommendation is for 4-GPU models, paired with 200Gbps RoCE/IB networking.

Analysis of Data Link Bandwidth Bottlenecks

Using two L40S GPUs under the same CPU as an example, there are two possible link options:

Direct CPU Processing:

Path: GPU0 <–PCIe–> CPU <–PCIe–> GPU1

Bandwidth: PCIe Gen4 x16 with a bidirectional bandwidth of 64GB/s (32GB/s unidirectional).

CPU Processing Bottleneck: To be determined.

Bypassing CPU Processing:

Path: GPU0 <–PCIe–> NIC <– RoCE/IB Switch –> NIC <–PCIe–> GPU1

Bandwidth: PCIe Gen4 x16 with a bidirectional bandwidth of 64GB/s (32GB/s unidirectional).

Average Bandwidth per GPU: Each GPU has a unidirectional 200Gbps network port, equivalent to 25GB/s.

NCCL Support: The latest version of NCCL is being adapted for the L40S, with the default behavior routing data externally and back.

Although this method appears longer, it is reportedly faster than the first method, provided the NICs and switches are properly configured with a 200Gbps RoCE/IB network. In this network architecture, with sufficient bandwidth, the communication bandwidth and latency between any two GPUs are consistent, regardless of whether they are within the same machine or under the same CPU. This allows for horizontal scaling of the cluster.

Cost and Performance Considerations:

The cost of GPU machines is reduced. However, for tasks with lower network bandwidth requirements, the cost of NVLINK is effectively transferred to the network. Therefore, a 200Gbps network is essential to fully utilize the performance of multi-GPU training with the L40S.

Bandwidth Bottlenecks in Method Two:

The bandwidth bottleneck between GPUs within the same host is determined by the NIC speed. Even with the recommended 2*CX7 configuration:

L40S: 200Gbps (unidirectional NIC speed)
A100: 300GB/s (unidirectional NVLINK3) == 12x200Gbps
A800: 200GB/s (unidirectional NVLINK3) == 8x200Gbps

It is evident that the inter-GPU bandwidth of the L40S is 12 times slower than the A100 NVLINK and 8 times slower than the A800 NVLINK, making it unsuitable for data-intensive foundational model training.

Testing Considerations:

As mentioned, even when testing a single 4-GPU L40S machine, a 200Gbps switch is required to achieve optimal inter-GPU performance.