A Glimpse Inside the Marvell 51.2T (64-Port 800G) Switch

Today, we will explore the internals of a large switch spanning 64 ports of 800GbE. The Marvell Teralynx 10 is a 51.2Tbps switch, set to be a key component in AI clusters by 2025. This substantial network switch is quite fascinating.

Marvell 51.2T

Marvell acquired Innovium in 2021, following our examination of the Innovium Teralynx 7-based 32x 400GbE switch. We had previously disassembled this startup’s 12.8Tbps (32-port 400GbE) generation switch.

Innovium Teralynx 7-based 32x 400GbE switch

Innovium emerged as the most successful startup of its generation, making significant inroads into hyperscale data centers. For instance, in 2019, Intel announced its acquisition of Barefoot Networks to obtain Ethernet switch chips. However, by Q4 2022, Intel declared its intention to divest this Ethernet switch business. Broadcom holds a significant position in the commercial switch chip market, while Innovium/Marvell has successfully penetrated hyperscale data centers, unlike other companies that invested heavily but failed.

ai disruption creating new entry points

Given the scale of AI cluster construction, the 51.2Tbps switch chip generation is substantial. We inquired if Marvell could update the 2021 Teralynx 7 teardown and provide insights into the new Marvell Teralynx 10.

Front View of the Switch

The switch features a 2U chassis, primarily composed of OSFP cages and airflow channels. There are 64 OSFP ports in total, each operating at 800Gbps.

The switch features a 2U chassis

Each port is equipped with OSFP pluggable optics, which are generally larger than the QSFP+/QSFP28 generation devices you might be accustomed to.

OSFP pluggable optics

Marvell has introduced several optical modules, leveraging components from its acquisition of Inphi. We have discussed this in various contexts, such as the Marvell COLORZ 800G silicon photonics modules and the Orion DSP for next-generation networks. This switch can utilize these optical modules, and the ports can operate at speeds other than 800Gbps.

Marvell COLORZ 800G silicon photonics modules

One of the intriguing aspects is the long-distance optical modules, capable of achieving 800Gbps over hundreds of kilometers or more. These modules fit into OSFP cages and do not require the large long-distance optical boxes that have been industry standards for years.

These modules fit into OSFP cages

OSFP modules can have integrated heat sinks, eliminating the need for heat sinks in the cages. In some 100GbE and 400GbE switches, optical cages require heat sinks due to the high power consumption of the modules.

optical cages require heat sinks

On the right side of the switch, there are management and console ports.

management and console ports

The back of the switch houses fans and power supplies, each with its own fan.

The back of the switch houses fans and power supplies

Given that this switch can use optical modules consuming around 1.8kW and has a 500W switch chip, power supplies rated over 2kW are expected.

500W switch chip

Next, let’s delve into the internals of the switch to see what powers these OSFP cages.

We will start from the OSFP cages on the right and move towards the power supplies and fans on the left.

Overview of the Switch

Upon opening the switch, the first thing that catches the eye is the large heat sink.

the large heat sink

This heat sink, shown with an expired passport for scale, is quite substantial.

This heat sink, shown with an expired passport for scale

Here is a bottom view of the heat sink.

Here is a bottom view of the heat sink.

The chip itself is a 500W, 5nm component.

The chip itself is a 500W, 5nm component.

Marvell allowed us to clean the chip to take some photos without the heat sink.

without the heat sink

This gives us a clear view of the OSFP cages without the heat sink.

a clear view of the OSFP cages without the heat sink

From this perspective, there are only 32 OSFP cages because the switch PCB is situated between two blocks.

only 32 OSFP cages

Behind the OSFP cages, we have the Teralynx 10 chip.

only 32 OSFP cages

For those interested, more detailed information about the Teralynx 10 can be found in our earlier feature diagram.

feature diagram

One notable difference is that many components on the switch are angled, rather than being horizontal or parallel to the edges of the switch chip.

many components on the switch are angled

Here is a top-down photo of the switch, showcasing the 64-port 800GbE switch chip. For those familiar with server technology, we are looking at 800GbE single-port NICs in the PCIe Gen6 era, while today we have 400GbE PCIe Gen5 x16 NICs. This chip has the capacity to handle the fastest 128 PCIe Gen5 400GbE NICs available today.

the fastest 128 PCIe Gen5 400GbE NICs

Like many switches, the Teralynx 10 switch has a dedicated management controller, based on the Marvell Octeon management board. We were informed that other switches might use x86.

Marvell Octeon management board

An M.2 SSD is located on the main power distribution board.

An M.2 SSD is located on the main power distribution board.

An interesting feature is the built-in PCIe slot for diagnostics.

An interesting feature is the built-in PCIe slot for diagnostics.

Just below this, there is a 10Gbase-T port exposed internally as a management interface.

Just below this, there is a 10Gbase-T port exposed internally as a management interface.

Another aspect to consider is the thickness of the switch PCB. If server motherboards were this thick, many 1U server designs would face significant cooling challenges. In terms of cooling, the switch has a relatively simple fan setup, with four fan modules at the rear of the chassis.

four fan modules at the rear of the chassis

Marvell has a lab in another building where these switches are tested. The company temporarily cleared the lab to allow us to photograph the switch in operation.

these switches are tested

Here is the back view.

the back view

Next to the Teralynx 10 switch is the Keysight Ixia AresONE 800GbE test box.

Keysight Ixia AresONE 800GbE test box.

Generating 800GbE traffic on a single port is no easy feat, as it is faster than PCIe Gen5 x16 on servers. It was fascinating to see this device in operation in the lab. We had previously purchased a neat second-hand Spirent box for 10GbE testing, but Spirent refused to provide a media/analyst license. Devices like this 800GbE box are incredibly expensive.

800GbE box

The company also has a larger chassis in the lab for 100GbE testing. As a switch vendor, Marvell needs such equipment to validate performance under various conditions.

dual 400GbE running

Here is an example of dual 400GbE running through the Teralynx switch at approximately 99.3% line rate.

the Teralynx switch at approximately 99.3% line rate

Why Choose a 51.2Tbps Switch?

There are two main forces driving the adoption of 51.2T switches in the market. The first is the ever-popular topic of AI, and the second is the impact of power consumption and radix.

Why Choose a 51.2Tbps Switch

Marvell’s Teralynx 10 offers a latency of approximately 500 nanoseconds while providing immense bandwidth. This predictable latency, combined with the switch chip’s congestion control, programmability, and telemetry features, helps ensure that large clusters maintain optimal performance. Allowing AI accelerators to idle while waiting for the network is a very costly proposition.

Teralynx 10 offers a latency of approximately 500 nanoseconds

Another example is radix. Larger switches can reduce the number of switching layers, which in turn reduces the number of switches, fibers, cables, and other components needed to connect the cluster.

Larger switches can reduce the number of switching layers

Since the Teralynx 10 can handle a radix of 512, connecting through up to 512x 100GbE links, some networks can reduce from needing three tiers of switching to just two. In large AI training clusters, this not only saves on capital equipment but also significantly reduces power consumption. Marvell provided an example where a larger radix could reduce power consumption by over 1MW.

the Teralynx 10 can handle a radix of 512

Marvell also shared a slide showing a switch with an interesting cooler extending from the chassis. This appears to be a desktop prototype, which we found quite intriguing.

an interesting cooler extending from the chassis.

Finally, while we often see the front and even the back of switches in online and data center photos, we rarely get to see how these switches operate internally. Thanks to Marvell, we were able to see the switch in operation and even disassemble it down to the silicon.

the switch in operation

Innovium, now a subsidiary of Marvell, is one of the few teams in the industry that has successfully competed with Broadcom and achieved hyperscale victories. We have seen other major silicon suppliers fail in this process. Given the market demand for high-radix, high-bandwidth, low-latency switching in AI clusters, the Teralynx 10 is likely to become the company’s largest product line since the Teralynx 7. The competition in this field is intense.

the Teralynx 10 is likely to become the company’s largest product line

Of course, there are many layers to all networks. We could even conduct a comprehensive study of optical modules, not to mention software, performance, and more. However, showcasing what happens inside these switches is still quite fascinating.

Leave a Comment

Scroll to Top