What is the Difference Between InfiniBand and RoCE?

January 19, 2024

Mia

Optical Transmission Consultant

Table of Contents

Data Center Network Architecture

Crossbar architecture

A type of architecture derived from the earliest telephone switching network (crossbar switch)
Consists of multiple input ports, multiple output ports, and a switch matrix
Very flexible and efficient, can achieve arbitrary connections between different devices.

Clos architecture

Born in 1952, proposed by a person named Charles Clos.
Clos architecture mainly describes the structure of a multi-stage circuit switching network
Clos architecture is an improvement of the crossbar structure, which can provide a non-blocking network. The advantage of Clos is that it saves costs and increases efficiency.

Fat-Tree architecture

A Fat-Tree is a type of CLOS network architecture.

Compared to the traditional tree structure, a Fat-Tree is more like a real tree, with thicker branches near the root. From the leaves to the root, the network bandwidth does not converge.

The basic idea: use a large number of low-performance switches to build a large-scale non-blocking network. For any communication pattern, there is always a path that allows them to achieve the bandwidth of the network card.

After the Fat-Tree architecture was introduced to the data center, the data center became a traditional three-layer structure:

Access layer: used to connect all the computing nodes. Usually in the form of a rack switch (TOR, Top of Rack).

Aggregation layer: used for interconnection of the access layer, and as the boundary of the second and third layers of the aggregation area. Various services such as firewalls, load balancing, etc. are also deployed here.

Core layer: used for interconnection of the aggregation layer, and to implement the third-layer communication between the entire data center and the external network.

The disadvantages of the Fat-Tree architecture:

Resource waste: In the traditional three-layer structure, a lower-layer switch will be connected to two upper-layer switches through two links. Because the STP protocol (Spanning Tree Protocol) is used, only one link actually carries the traffic. The other uplink is blocked (only used for backup). This causes bandwidth waste.

Large fault domain: The STP protocol, due to its own algorithm, needs to reconverge when the network topology changes, which can easily cause faults and affect the network of the entire VLAN.

Not suitable for east-west traffic: The communication between servers and servers requires passing through the access switch, the aggregation switch, and the core switch.

Spine-Leaf network

Like the Fat-Tree structure, it belongs to the CLOS network model.

Compared to the traditional three-layer network architecture, the Spine-Leaf network has been flattened and turned into a two-layer architecture.

Leaf switch, equivalent to the access switch in the traditional three-layer architecture, as TOR (Top Of Rack) directly connected to the physical server. Above the leaf switch is the third layer network, each is an independent L2 broadcast domain. If the servers under two leaf switches need to communicate, they need to be forwarded by the spine switch.

Spine switch, equivalent to the core switch. The leaf and spine switches dynamically select multiple paths through ECMP (Equal Cost Multi Path).

The number of downlink ports of the spine switch determines the number of leaf switches. The number of uplink ports of the leaf switch determines the number of spine switches. They jointly determine the scale of the Spine-Leaf network.

The Advantages of Spine-Leaf Network

High bandwidth utilization

Each leaf switch’s uplink works in a load-balancing manner, making full use of the bandwidth.

Predictable network latency

In the above model, the number of communication paths between leaf switches can be determined, and only one spine switch is required for each path. The east-west network latency is predictable.

Good scalability

When the bandwidth is insufficient, the number of spine switches can be increased to scale the bandwidth horizontally. When the number of servers increases, the number of spine switches can also be increased to expand the data center scale. Planning and expansion are very convenient.

Reduced requirements for switches

North-south traffic can go out from the leaf nodes or the spine nodes. East-west traffic is distributed on multiple paths. Expensive high-performance high-bandwidth switches are not required.

High security and availability

Traditional networks use STP protocol, which will reconverge when a device fails, affecting network performance or even causing faults. In the Spine-Leaf architecture, when a device fails, there is no need to reconverge, and the traffic continues to pass through other normal paths. The network connectivity is not affected, and the bandwidth is only reduced by the bandwidth of one path. The performance impact is negligible.

InfiniBand

RDMA (Remote Direct Memory Access) protocol

In the traditional TCP/IP, the data from the network card is first copied to the kernel memory, and then copied to the application storage space, or the data is copied from the application space to the kernel memory and then sent to the Internet via the network card. This I/O operation mode requires the conversion of the kernel memory. It increases the length of the data flow transmission path, increases the CPU load, and also increases the transmission latency.

RDMA’s kernel bypass mechanism allows direct data read and write between the application and the network card, reducing the data transmission latency within the server to close to 1us.

At the same time, RDMA’s memory zero-copy mechanism allows the receiver to directly read data from the sender’s memory, bypassing the participation of the kernel memory, greatly reducing the CPU load and improving the CPU efficiency.

The background of InfiniBand

InfiniBand (abbreviated as IB) is a powerful communication technology protocol. Its English translation is “infinite bandwidth”. It was born in the 1990s, to replace the PCI (Peripheral Component Interconnect) bus. The PCI bus was introduced by Intel into the PC architecture, and the upgrade speed was slow, which greatly limited the I/O performance and became the bottleneck of the entire system.

The development history of InfiniBand

In the 1990s, Intel, Microsoft, and SUN led the development of the “Next Generation I/O (NGIO)” technology standard, while IBM, Compaq, and HP led the development of “Future I/O (FIO)”.

In 1999, the FIO Developers Forum and the NGIO Forum merged and established the InfiniBand Trade Association (IBTA).

In 2000, the InfiniBand architecture specification version 1.0 was officially released.

In May 1999, several employees who left Intel and Galileo Technology founded a chip company in Israel and named it Mellanox.

After Mellanox was established, it joined NGIO. Later, Mellanox joined the InfiniBand camp. In 2001, they launched their first InfiniBand product. Starting in

2003, InfiniBand turned to a new application field, which is computer cluster interconnection.

In 2004, another important InfiniBand non-profit organization was born-OFA (Open Fabrics Alliance).

In 2005, InfiniBand found another new scenario-the connection of storage devices.

Since then, InfiniBand has entered a stage of rapid development.

InfiniBand Network Architecture

InfiniBand is a channel-based structure, consisting of four main components:

HCA (Host Channel Adapter), which connects the host to the InfiniBand network.
TCA (Target Channel Adapter), which connects the target device (such as storage) to the InfiniBand network.
InfiniBand link, which can be a cable, fiber, or on-board link, connects the channel adapters to the switches or routers.
InfiniBand switch and router, which provide network connectivity and routing for the InfiniBand network.
Channel adapters are used to establish InfiniBand channels. All transmissions start or end with channel adapters, to ensure security or work at a given QoS (Quality of Service) level.

Mellanox, acquired by Nvidia in 2020. Since then, it has been widely used in AI large model training.

RoCE

The birth of RoCE

In April 2010, IBTA released RoCE (RDMA over Converged Ethernet), which “ported” the RDMA technology in InfiniBand to Ethernet. In 2014, they proposed a more mature RoCEv2. With RoCEv2, Ethernet greatly narrowed the technical performance gap with InfiniBand, and combined with its inherent cost and compatibility advantages, it started to fight back.

RoCE V2

RoCE v1: An RDMA protocol based on the Ethernet link layer (the switch needs to support flow control technologies such as PFC, to ensure reliable transmission at the physical layer), which allows communication between two hosts in the same VLAN. RoCE V2: Overcomes the limitation of RoCE v1 being bound to a single VLAN. By changing the packet encapsulation, including IP and UDP headers, RoCE 2 can now be used across L2 and L3 networks.