How to Build a Cluster with 128 DGX H100?

The NVIDIA DGX H100, released in 2022, is equipped with 8 single-port ConnectX-7 network cards, supporting NDR 400Gb/s bandwidth, and 2 dual-port Bluefield-3 DPUs (200Gb/s) that can support IB/Ethernet networks. The appearance is shown in the following figure.

In-band system management

The DGX H100 has 4 QSFP56 ports for storage network and In-Band management network; In addition, there is one 10G Ethernet port for Remote Host OS management and one 1G Ethernet port for Remote System Management.

From the figure of the internal network topology of the server, there are 4 OSFP ports for computing network connection (the purple ones), and the blue blocks are network cards, which can act as network cards and also play the role of PCIe Switch expansion, becoming the bridge between CPU and GPU.

CPU

If the NVIDIA SuperPOD NVLink cluster interconnection scheme is adopted, 32 H100s will be interconnected through external NVLink switches. The 8 GPUs inside the server are connected to 4 NVSwitch modules, each NVSwitch module corresponds to 4-5 OSFP optical modules, a total of 18 OSFPs, and the OSFPs are then connected to 18 external NVLink switches. (Currently, the H100s on the market do not have these 18 OSFP modules) This article does not discuss the NVLink networking method but focuses on the IB networking method. According to the NVIDIA reference design document: In the DGX H100 server cluster, every 32 DGX H100s form an SU, and every 4 DGX H100s are placed in a separate rack (it is estimated that the power of each rack is close to 40KW), and various switches are placed in two independent racks. Therefore, each SU contains 10 racks (8 for placing servers and 2 for placing switches). The computing network only needs to use Spine-Leaf two-layer switches (Mellanox QM9700), the network topology is shown in the following figure.

Spine

Switch usage: In the cluster, every 32 DGX H100s form an SU (there are 8 Leaf switches in each SU), and there are 4 SUs in the 128 H100 server cluster, so there are a total of 32 Leaf switches. Each DGX H100 in the SU needs to have a connection with all 8 Leaf switches. Since each server only has 4 OSFP ports for computing network connection, after connecting 800G optical modules to each port, one OSFP port is expanded to two QSFP ports through the expansion port, achieving the connection of each DGX H100 with 8 Leaf switches. Each Leaf switch has 16 uplink ports that connect to 16 Spine switches.

Optical module usage: 400G optical modules are required for the downlink ports of the Leaf switch, and the demand is 3284. 800G optical modules are used for the uplink ports of the Leaf switch, and the demand is 1684. 800G optical modules are used for the downlink ports of the Spine switch. Therefore, in the 128 H800 server cluster, the computing network used 800G optical modules 1536 and 400G optical modules 1024.

Leave a Comment

Scroll to Top