IB Network and Mellanox Products

What is IB-InfiniBand

IB, the abbreviation of InfiniBand (translated as ” Infinitive bandwidth ” ), is a computer network communication standard for high-performance computing. IB technology, characterized by extremely high throughput and extremely low latency, is used for data interconnection between computers. InfiniBand is also used as a direct or switched interconnect between servers and storage systems, and as an interconnect between storage systems. With the rise of AI, IB has also become popular and is currently the preferred networking method for high-end GPU server clusters.

Here is the development history of InfiniBand:

  • 1999: The InfiniBand Trade Association (IBTA) released the InfiniBand architecture, which was originally intended to replace the PCI bus.
  • In 2000, version 1.0 of the InfiniBand architecture specification was officially released. Then in 2001, the first batch of InfiniBand products came out, and many manufacturers began to launch products that support InfiniBand, including servers, storage systems, and network equipment.
  • In 2003, thanks to its high throughput and low latency, InfiniBand turned to a new application area – HPC computer cluster interconnection, and was widely used in the TOP500 supercomputers at the time.
  • In 2004, another important InfiniBand non-profit organization was born – Open Fabrics Alliance (OFA ).
  • In 2005, InfiniBand found a new scenario – the connection of storage devices, and has been continuously updated and iterated since then.
  • In 2015, InfiniBand technology accounted for more than 50% of TOP500 supercomputers for the first time, reaching 51.4%. This marks the first time that InfiniBand technology has overtaken Ethernet technology to become the most popular internal connection technology in supercomputers.
  • Since 2023, AI large model training has been highly dependent on high-performance computing clusters, and InfiniBand networks are the best partner for high-performance computing clusters.

Core advantages of Mellanox and IB Network

Relationship between Mellanox and InfiniBand

Currently, when people mention IB, the first thing that comes to mind is Mellanox. In 2019, NVIDIA spent US$6.9 billion to acquire Mellanox as its sub-brand. Jensen Huang said publicly: This is a combination of two of the world’s leading high-performance computing companies. NVIDIA focuses on accelerated computing, and Mellanox focuses on interconnection and storage.

According to forecasts by industry organizations, the market size of InfiniBand will reach US$98.37 billion in 2029, a 14.7-fold increase from US$6.66 billion in 2021. Driven by high-performance computing and AI, InfiniBand has a bright future.

 InfiniBand network architecture and features

An InfiniBand system consists of channel adapters, switches, routers, cables, and connectors, as shown in the following figure.

InfiniBand System

InfiniBand System

The core features are summarized as follows:

  • Low latency: extremely low latency and native support for RDMA
  • High bandwidth: 400Gb/s data transmission capability per port
  • Ease of use: Suitable for building large-scale data center clusters

IB Network and RDMA

When talking about IB network, we have to mention RDMA. RDMA (Remote Direct Memory Access) was created to solve the delay of server-side data processing in network transmission. It can directly access the memory of another host or server from the memory of one host or server without using the CPU. It frees up the CPU to perform its work. Infiniband is a network technology designed specifically for RDMA, and IB network natively supports RDMA.

The reason why RDMA technology is so powerful is its kernel bypass mechanism, which allows direct data reading and writing between applications and network cards, reducing the data transmission latency within the server to nearly 1us. Roce transplants RDMA to Ethernet.

Comparison of two mainstream RDMA solutions (IB and ROCEV2)

  • Compared with traditional data centers, the new intelligent computing center communication network requirements are higher, such as low latency, large bandwidth, stability, and large scale.
  • RDMA-based InfiniBand and ROCEV2 can both meet the needs of intelligent computing center communication networks.
  • InfiniBand currently has more advantages in performance than ROCEV2, while ROCEV2 currently has advantages over InfiniBand in terms of economy and versatility.

Taking the early SDR (single data rate) specification as an example, the original signal bandwidth of a 1X link is 2.5Gbps, a 4X link is 10Gbps, and a 12X link is 30Gbps. The actual data bandwidth of a 1X link is 2.0 Gbps (due to 8b/10b encoding). Since the link is bidirectional, the total bandwidth relative to the bus is 4Gbps. Over time, the network bandwidth of InfiniBand continues to upgrade.

What do the DRs mean in HDR and NDR? Each DR stands for the abbreviation of each generation of IB technology. DR is a general term for data rate, with 4 channels being the mainstream.

The following figure shows the network bandwidth of InfiniBand from SDR, DDR, QDR, FDR, EDR to HDR and NDR. The speed is based on 4x link speed. Currently, EDR, HDR and NDR are mainstream, corresponding to PICE 3.0, 4.0 and 5.0 server platforms.

NVIDIA InfiniBand mainstream products – the latest NDR network card situation

ConnectX-7 IB Card (HCA) has a variety of form factors, including single and dual ports, supporting OSFP and QSFP112 interfaces, and supporting 200Gbps and 400Gbps rates. The CX-7 network card supports x16 PCle5.0 or PCle 4.0,  which complies with CEM specifications. Up to 16 lanes can be connected with support for an optional auxiliary card that enables 32 lanes of PCIe 4.0 using NVIDIA Socket Direct® technology.

Other form factors include Open Compute Project (OCP) 3.0 with an OSFP connector, OCP 3.0 with a QSFP112 connector, and CEM PCle x16 with a QSFP112 connector.

Mellanox’s latest NDR switch

Mellanox’s IB switches are divided into two types: fixed configuration switches and modular switches. It is understood that the latest NDR series switches no longer sell modular configuration switches (although the official website shows that they are available, they are no longer on sale).

NDR’s fixed configuration switch MQM9700 series is equipped with 32 physical OSFP connectors and supports 64 400Gb/s ports (which can be split into up to 128 200Gb/s ports). The switch series provides a total of 51.2Tb/s of bidirectional throughput ( backplane bandwidth) and an astonishing 66.5 billion packets per second (packet forwarding rate).

The number of interfaces and speeds of the sub-models are the same, and the differences lie in whether the management function is supported, in the power supply method and the heat dissipation method. Usually one switch with management functions is sufficient.

Mellanox’s latest interconnect cables and modules

Mellanox’s LinkX cables and transceivers are typically used to link ToR switches downward to NVIDIA GPU and CPU server network cards and storage devices, and/or upward in switch-to-switch interconnect applications throughout the network infrastructure.

Active Optical Cable (AOC), Direct Attach Copper Cable (DAC), and the new active DAC called ACC, which include a signal enhancement integrated circuit (IC) at the end of the cable.

Mellanox Latest Interconnect Cables and Modules

Switches to switches and switches to network cards can be interconnected through different cables, and switch to network card can achieve 1-to-2 or 4-to-1 interconnection.

Mellanox NIC topology in H100

  • The HGX module is logically connected to the head through 4 or 8 PCIE SW chips inside the H100 machine.
  • Each PCIE sw corresponds to two GPU cards and two network cards, and the eight 400G IB cards are designed to correspond one-to-one with the eight H100 cards.
  • If it is fully equipped with eight 400G IB cards, it will require other PCIE SW connections from the CPU to add other network cards.

Mellanox NIC topology in H100

Leave a Comment

Scroll to Top