A survey of the AI network positions of the leading vendors In July 2023, the Ultra Ethernet Consortium (UEC), initiated by the Linux Foundation and its Joint Development Foundation, was officially launched, dropping a depth charge into the turbulent AI network interconnection ecosystem. In August 2023, at the IEEE Hot Interconnects (HOTI) international forum, which focuses on advanced hardware and software architectures and various interconnect network implementations, representatives from Intel, Nvidia, AMD, and other companies participated in a panel discussion on the question of “EtherNET or EtherNOT”, and expressed their views on Ethernet. The emerging AI/ML workloads are driving the demand for high-performance network interconnection. About ten years ago, RDMA over Converged Ethernet (RoCE) introduced low-latency data transmission into the Ethernet architecture, but compared with other network technologies, Ethernet seemed to lag in technology development. Is the battle between EtherNET and EtherNOT coming again? In the Ethernet era, cloud vendors, equipment vendors, and other parties have their interests, and it is a critical decision-making period. How will they choose?
This topic of “EtherNET or EtherNOT” was already discussed at the HOTI conference in 2005, and the conclusion at that time was as follows:
At the discussion of the 2023 HOTI conference, Brad Burres, senior researcher and chief hardware architect for the Network and Edge Group at Intel, and Frank Helms, data center GPU system architect at AMD, favored Ethernet. Brad Burres argued that no matter what technology is adopted, an open ecosystem is needed to reduce the cost of the entire industry and achieve the required software infrastructure. As the protocol matures, Ethernet will be the winner unless another open standard structure emerges immediately (such as CXL). Frank Helms listed the first, second, and fifth places in the global supercomputer TOP500 list, Frontier, Aurora, and LUMI, respectively, which are all based on the Ethernet-based HPE Cray Slingshot-11 network structure for connection. He believed that Ethernet is at the forefront of interconnect technology. The emergence of UEC (Ultra Ethernet Alliance) also reflects that there is a lot of suppressed demand for Ethernet for large-scale AI training cluster interconnection. Larry Dennison, director of network research at NVIDIA, believed that there is still a gap between Ethernet and meeting the needs of AI workloads. If Ethernet meets all these needs, is it still Ethernet? How long can it be achieved? The Ethernet market is indeed huge, it will not disappear, but in the next few years, the development speed of Ethernet will not be able to meet the needs of this market. Torsten Hoefler, professor at ETH Zurich and consultant for Microsoft in the field of large-scale artificial intelligence and network, pointed out that Ethernet is the present and future of data centers and supercomputers, but not the Ethernet we are talking about now, Ethernet needs to evolve.
Open Ecology or Vendor Lock-in?
Historically, InfiniBand and Ethernet have been competing for the dominance of the AI/HPC market, as they are both open standards. However, a key difference is that InfiniBand is currently supported by Nvidia as a single vendor, while Ethernet enjoys multi-vendor support, fostering a vibrant and competitive ecosystem. However, even in the field of AI/HPC network solutions, Ethernet solutions may come with a “partially customized” label, which may lead to vendor lock-in.
For example, Broadcom’s Jericho3 Ethernet switch requires the entire network fabric to use the same switch chip when running in its high-performance “fully scheduled fabric” mode. Cisco’s Silicon One switch and Nvidia’s Spectrum-X switch also have similar situations – high-performance requirements may cause vendor lock-in. Some hyperscale enterprises have designed “custom” NICs, which can also lead to custom networks. Therefore, even when choosing Ethernet solutions, one may encounter custom implementations and vendor lock-in. AI/HPC networks may transition to a new, open, and more powerful transport standard, partially or fully replacing the ROCEv2 RDMA protocol, which is the vision that the Beyond Ethernet Alliance is pursuing.
AI/ML Networking Technology Inventory
How do the hyperscale vendors choose their AI/ML network technologies? Is it EtherNET or EtherNOT?
Amazon AWS
Amazon drew inspiration from the InfiniBand RD protocol and launched the Scalable Reliable Datagram (SRD) transport protocol for HPC networks. Amazon “exclusively” uses Enhanced Network Adapters (ENA), which are based on its proprietary Nitro chip. SRD uses UDP, supports packet spraying across multiple links, and eliminates the “in-order” packet delivery requirement, reducing fabric congestion and tail latency. When necessary, packet reordering is handled by the upper layer of SRD. Amazon continues to pursue a native AI/HPC network strategy and is probably the least cooperative with NVIDIA.
Google uses a mix of its TPUs and NVIDIA’s GPUs. TPUs and GPUs compete with each other and may be deployed depending on the workload suitability. Google is unlikely to use InfiniBand products in its network. Google’s AI/ML network is relatively customized and has been deploying a similar NVLink “coherent” architecture for years. Google has innovated a lot on the network stack and deployed “native” Optical Switching Systems (OCS) – a circuit switch based on Micro-Electro-Mechanical Systems (MEM mirrors) – in its regular data centers and artificial intelligence data centers. Optical switches typically eliminate a layer of physical switches, support higher radix configurations, and reduce power consumption and latency. Optical switches “reflect” light and are independent of network protocols and network switch upgrades. The downside is that the mirror reconfiguration time is usually long, in the tens of milliseconds range, so these OCS switches work as fixed-capacity “circuit”. For artificial intelligence training networks, this is not a major issue, as the traffic patterns are predictable.
Microsoft
Microsoft is the most pragmatic among the hyperscale enterprises, and it adopted InfiniBand early on to build artificial intelligence networks for its partner OpenAI. Although Microsoft developed its custom network adapter and used a custom RDMA protocol for Azure cloud, its openness to InfiniBand, embrace of NVIDIA’s full-stack AI/ML solution, and close collaboration with OpenAI, all make it NVIDIA’s preferred customer. Microsoft acquired Fungible, which invented True Fabric – a reliable datagram protocol based on UDP that handles traffic, congestion, and error control, and optimizes tail latency. Some of Fungible’s technological innovations may appear in Microsoft’s future products and open-source contributions.
Meta
Meta is a dark horse in the AI competition, with its artificial intelligence program having the following outstanding features:
- It adopts an open source approach using foundational models such as Llama.
- It makes AI user-friendly and accessible to every software engineer through the PyTorch software framework/ecosystem.
- It establishes the Open Compute Project community as a key pillar of open hardware innovation.
- It deploys large-scale GPU clusters and stays at the forefront of AI innovation with its recommendation system (DLRM model).
Meta’s AI foundational models and PyTorch ecosystem enable a huge open source AI innovation library, deploy AI/ML clusters based on Ethernet and InfiniBand, and build ASICs for its DLRM model and video transcoding.
Meta is democratizing AI, and although it has not received enough recognition yet, this trend will soon change.
Oracle
Oracle firmly supports Ethernet and does not use InfiniBand. Oracle Cloud Infrastructure (OCI) leverages Nvidia GPUs and ConnectX NICs to build a supercluster based on ROCEv2 RDMA. OCI builds a separate RDMA network, based on a custom congestion notification protocol of DC-QCN, minimizes the use of PFC, and fine-tunes custom profiles for AI and HPC workloads.
NVIDIA
NVIDIA’s GPUs and its full-stack AI/ML solutions make it an undisputed upstream player in the market. NVIDIA DGX Cloud solution integrates Quantum-2 (25.6Tbs) InfiniBand switch with ConnectX and Bluefield network adapters. These network adapters support both Ethernet and InfiniBand. The full-stack InfiniBand solution based on DGX Cloud will also be sold to telecom and enterprise markets by NVIDIA and its OEMs. However, NVIDIA is also investing heavily in Ethernet through its Spectrum-X switch. A few years ago, InfiniBand was the preferred architecture for AI training, making it the ideal choice for NVIDIA’s integrated DGX cloud solution. With the launch of the NVIDIA Spectrum-X Ethernet switch (capacity of 51.2 Tbs, twice the capacity of InfiniBand switch), NVIDIA will switch to Ethernet for large-scale GPU deployment, to take advantage of Ethernet’s higher port speed, cost-effectiveness, and scalability. Spectrum-X Ethernet switch supports advanced ROCEv2 extensions – RoCE adaptive routing and congestion control, telemetry support, and in-network computing called collective (through NVIDIA’s SHARP product).
Broadcom
Broadcom offers comprehensive AI/HPC network solutions, including switch chips and network adapters. Broadcom’s strategic acquisition of “Correct Networks” introduced a transport protocol based on EQDS UDP, which moves all queuing activities from the core network to the transmitting host or leaf switch. This approach supports switch optimization in the Jericho3/Ramon3 chip combination, which is a “fully scheduled fabric” equipped with packet spraying, reordering buffers in leaf switches, path rebalancing, congestion notification dropping, and hardware-driven in-band fault recovery mechanisms. The Tomahawk (52Tbs) series is designed to optimize single-chip capacity and is not a fully scheduled fabric. Tomahawk switches also support edge queues, as well as latency-critical functions in hardware, such as global fabric-level load balancing and path rebalancing. Tomahawk does not support packet sorting in leaf switches, so packet reordering buffers need to be implemented in network adapters (endpoints).
Cisco
Cisco recently launched the Silicon One 52Tb/s switch, demonstrating the versatility of its network solutions. The switch is P4 programmable, allowing flexible programming for various network use cases. Cisco’s Silicon One-based switches provide support for fully scheduled fabrics, load balancing, hardware fault isolation, and telemetry. Cisco partners with multiple NIC vendors to provide complete AI/ML network solutions.
Conclusion
The journey of Ethernet standardization for AI/HPC networks has just begun, and it requires further cost and power reduction through scale, open innovation, and multi-vendor competition. The Super Ethernet Alliance is composed of major network stakeholders and is committed to creating an open, “full-stack” Ethernet solution tailored for AI/HPC workloads. As mentioned above, most of the “necessary” AI/HPC network technologies have been deployed by various Ethernet vendors and hyperscalers in some form or way. Therefore, the challenge of standardization is not technical, but more about building consensus.
Related Products:
- QSFP112-400G-SR4 400G QSFP112 SR4 PAM4 850nm 100m MTP/MPO-12 OM3 FEC Optical Transceiver Module $990.00
- QSFP112-400G-DR4 400G QSFP112 DR4 PAM4 1310nm 500m MTP/MPO-12 with KP4 FEC Optical Transceiver Module $1350.00
- QSFP112-400G-FR1 4x100G QSFP112 FR1 PAM4 1310nm 2km MTP/MPO-12 SMF FEC Optical Transceiver Module $1300.00
- QSFP112-400G-FR4 400G QSFP112 FR4 PAM4 CWDM 2km Duplex LC SMF FEC Optical Transceiver Module $1760.00
- QSFP-DD-400G-SR4 QSFP-DD 400G SR4 PAM4 850nm 100m MTP/MPO-12 OM4 FEC Optical Transceiver Module $600.00
- QSFP-DD-400G-FR4 400G QSFP-DD FR4 PAM4 CWDM4 2km LC SMF FEC Optical Transceiver Module $600.00
- QSFP-DD-400G-SR8 400G QSFP-DD SR8 PAM4 850nm 100m MTP/MPO OM3 FEC Optical Transceiver Module $180.00
- QSFP-DD-400G-DR4 400G QSFP-DD DR4 PAM4 1310nm 500m MTP/MPO SMF FEC Optical Transceiver Module $450.00
- QSFP-DD-400G-SR4.2 400Gb/s QSFP-DD SR4 BiDi PAM4 850nm/910nm 100m/150m OM4/OM5 MMF MPO-12 FEC Optical Transceiver Module $1000.00
- OSFP-400G-PSM8 400G PSM8 OSFP PAM4 1550nm MTP/MPO-16 300m SMF FEC Optical Transceiver Module $1200.00
- NVIDIA MMS4X00-NS400 Compatible 400G OSFP DR4 Flat Top PAM4 1310nm MTP/MPO-12 500m SMF FEC Optical Transceiver Module $800.00
- OSFP-400G-SR4-FLT 400G OSFP SR4 Flat Top PAM4 850nm 30m on OM3/50m on OM4 MTP/MPO-12 Multimode FEC Optical Transceiver Module $650.00
- OSFP-400G-SR8 400G SR8 OSFP PAM4 850nm MTP/MPO-16 100m OM3 MMF FEC Optical Transceiver Module $480.00
- OSFP-400G-DR4 400G OSFP DR4 PAM4 1310nm MTP/MPO-12 500m SMF FEC Optical Transceiver Module $900.00
- OSFP-800G-SR8D-FLT OSFP 8x100G SR8 Flat Top PAM4 850nm 100m DOM Dual MPO-12 MMF Optical Transceiver Module $850.00
- OSFP-800G-DR8D-FLT 800G-DR8 OSFP Flat Top PAM4 1310nm 500m DOM Dual MTP/MPO-12 SMF Optical Transceiver Module $1200.00
- OSFP-800G-SR8D OSFP 8x100G SR8 PAM4 850nm 100m DOM Dual MPO-12 MMF Optical Transceiver Module $750.00
- OSFP-800G-DR8D 800G-DR8 OSFP PAM4 1310nm 500m DOM Dual MTP/MPO-12 SMF Optical Transceiver Module $1100.00