Overview
Last year, due to the reduction in general-purpose computing expenditures by traditional cloud computing and AI orders being absorbed by NVIDIA, AEC did not fully benefit from high-speed interconnection demands, resulting in a relatively sluggish market.
Starting from the second half of this year, as cloud providers began to control their AI network deployments and the deployment of self-developed chip compute power increased, AEC’s advantages (mid-to-high distance transmission, controllable error rates, and cost-effectiveness) were appreciated by more customers, leading to significant growth.
Major demanders, such as AWS and X.AI, are now deploying AEC on a large scale for high-speed interconnections within and between cabinets. Companies like Microsoft, Google, and Chinese firms (Alibaba, ByteDance, etc.) have also begun adopting AEC.
Over the next 1-2 years, the AEC market is expected to experience a trade-off between volume and price: rapid volume doubling and gradual price decline. The overall market space is set to expand steadily, and the competitive landscape will become more diversified.
Market Changes in AEC Over the Past Year and a Half
Last year, around May and June, when NVIDIA launched the GB200, there were discussions about using AEC (Active Electrical Cable) connections, including by companies like FiberMall. At that time, Microsoft was not very satisfied with the first batch of FiberMall’s AEC, so no large orders were placed. The industry generally believed that AEC had difficulties meeting data center requirements in terms of distance and error rates. Why, then, has this technology suddenly become popular a year and a half later, with large companies like Amazon now placing orders?
Why AEC Became Popular Again
Previously, NVIDIA’s solution used copper cables. In the GH200 system, the first layer consists of eight cards interconnected with the first layer switch using what are called “cartridges,” which are essentially bundles of copper cables. In the past, NVIDIA did not use AEC but passive copper cables (DAC).
However, traditional cloud computing networks have previously used FiberMall’s AEC. Last year, around February and March, many AEC orders were cut by Microsoft because traditional cloud computing businesses were squeezed by AI investments, leading to many orders being canceled or postponed.
At that time, AEC was mainly used for mid-to-low-speed rates in general-purpose computing data centers. Traditional cloud computing reduced inventory and capital expenditures, diverting funds to AI-related projects. Last year was not favorable for AEC, as AI’s growth was integrated and packaged by NVIDIA (GPU+interconnection solutions) and sold to cloud computing providers, without a place for AEC. NVIDIA used DAC (Direct Attach Copper) and AOC (Active Optical Cables) with multimode optical modules.
Since the second half of this year, more customers (cloud providers) have started building their AI networks independently, not fully relying on NVIDIA’s packaged solutions. Consequently, AEC has seen a surge in demand, particularly with noticeable orders from AWS.
Why does NVIDIA persist with ACC for interconnection instead of opting for AEC?
NVIDIA appears to favor Active Copper Cables (ACC) over AEC. But why is this the case?
Latency Considerations
AEC requires Retimer chips for signal retiming, whereas ACC utilizes simpler Redriver amplification, resulting in lower latency. NVIDIA prioritizes latency, making ACC the more attractive option.
High-Density Deployment
NVIDIA’s GPUs have high computational density, with short distances between cards within a rack. ACC, with its low latency, is suitable for this environment. AEC’s advantage lies in supporting longer distances (5-7 meters), suitable for chip clusters with lower computational density than NVIDIA’s, such as AWS’s Trainium2, which requires AEC for interconnecting many cards across longer distances.
Cost Differences
NVIDIA considers the cost difference between ACC and AEC to be minimal. Although ACC might be slightly cheaper, its lower latency aligns better with NVIDIA’s product positioning. From the perspective of cloud service providers, AEC would be selected for longer distances and lower-density structures.
Growth in Demand for AEC from Various Manufacturers
AWS (Trainium2)
AWS procures approximately 1.5 million cards annually, mostly interconnected using AEC. Trainium2, with lower computational power than NVIDIA’s H100, can operate with 400G AEC (instead of 800G). With the potential introduction of Trainium3 by the end of the year, the demand for 800G AEC may increase. Currently, FiberMall alone cannot meet AWS’s demand and is actively expanding its AEC production capacity.
Microsoft
Historically, Microsoft’s procurement of AEC has been stable, primarily for use in general-purpose data centers. AI-related demand for AEC has not yet surged dramatically. Microsoft is now starting to use AEC to build AI networks, although the growth rate is slower than that of AWS.
Other Manufacturers
X.AI has recently shown significant demand for AEC, with growth potentially outpacing Microsoft’s next year. They heavily purchase NVIDIA cards but prefer cost-effective solutions like AEC for first-layer interconnections. Google’s TPU interconnect (ICI) currently uses passive copper cables (DAC); however, as speeds increase, they may transition to AEC. In China, companies like Alibaba and ByteDance are also considering or have begun adopting AEC.
The Relationship Between AEC and Optical Modules: Substitutive or Complementary?
Layered Structure
In an AI network, interconnections can be layered as follows:
GPU/Accelerator Card ↔ Top-of-Rack (ToR) Switch
ToR ↔ Higher-Level Switches
For the first layer (within a rack), where distances are short, various cabling options such as copper cables, AEC, ACC, DAC, and AOC are viable. Optical modules are typically used for longer, cross-rack distances.
Limited Substitution Effect
Switching from passive copper cables (DAC) to AEC does not impact optical modules. However, AEC can partially replace AOC (short-distance active optical cables) or multimode optical modules, but manufacturers like NVIDIA are unlikely to abandon optical solutions completely.
Overall, while AEC may capture some market share from AOC or multimode optical modules, the extent depends on factors like cabling needs, latency, cost, and maintenance considerations. Accurate predictions are challenging without specific design details from various manufacturers. Current order information suggests that AEC won’t significantly impact the share of optical modules.
Market Size and Outlook for AEC
Growth Rate
This year’s AEC market is valued at less than $300 million, with expectations to double to around $600 million next year. Shipment volumes might increase from this year’s 1 to 2 million units to 5 million units next year, accompanied by a price drop.
Price Trends
Currently, 400G AEC costs approximately $150, while 800G AEC is about $250. With more manufacturers entering the market, competition will drive prices down by around 20% per year. The entry of Chinese manufacturers will further pressure profit margins, leading to overall price reductions.
Customer Breakdown
- AWS: Anticipated to be the largest contributor to growth, with over 2 million units by the end of this year and next year.
- Microsoft: Incremental growth remains steady, primarily in cloud computing and some AI networks.
- X.AI: Experiencing rapid growth, potentially requiring 800,000-900,000 units annually.
- Google/NVIDIA: Only minor upgrades are planned.
- China’s Alibaba, ByteDance, etc.: Gradually increasing adoption, contributing to an overall upward trend.
Case Study: Interconnection of X.AI with GB200/B300
X.AI has purchased numerous GB200 or B300 chips from NVIDIA. However, NVIDIA uses passive copper cables or ACC for its internal 72-card interconnection, not AEC. So, where does X.AI use AEC?
Within a GPU rack (72 cards), the connection between the cards and the Top-of-Rack (ToR) switch requires cable lengths of several meters to over 5 meters. AEC can support lengths of 5-7 meters. For high-density large cabinets, where copper cables need to bend and wind, lengths of 3-5 meters or more are often required. When ACC or DAC fails to meet the requirements or results in higher error rates, AEC is needed. The links from the top of the cabinet to other switches may use optical modules. Therefore, AEC is utilized for the connection from within the cabinet to the ToR switch.
Google TPU Interconnection
In Google’s TPU clusters, 64 chips form a “Cube” (similar to a cabinet), with internal interconnections using ICI. Currently, passive copper cables are predominantly used.
Current Usage of DAC
For Google TPU v6, within a single cabinet of 64 chips, ICI interconnections are mostly DAC.
Potential Upgrade to AEC
As speed further increases, the distance and error rates of DAC may fall short, necessitating active solutions like AEC to ensure transmission quality.
Limited Impact on Optical Modules
The connections within the cabinet are not long-distance, so this is not the primary battlefield for optical modules; inter-cabinet connections typically require optical modules or OCS.
Substitution Rate of AEC for Optical Modules
Will widespread use of AEC significantly reduce orders for optical modules?
Overall, the impact is limited. Optical modules are primarily used for cross-cabinet, long-distance scenarios. For the first layer or some short-distance interconnections, the choice might be between DAC, AOC, or AEC. Even within the same data center, customers may use a mix of different solutions. AEC will not completely replace AOC or optical modules. The specific substitution ratio depends on factors such as customer topology design, price, maintenance costs, and latency requirements.
Adoption of AEC by Chinese Manufacturers
Will ByteDance and Alibaba start adopting AEC? And for which chips might it be used?
ByteDance
ByteDance is purchasing chips from multiple suppliers, including Cambricon and NVIDIA, with numerous cards arranged in parallel. Multiple suppliers also provide copper cable solutions. For Cambricon, companies like Broadex Technologies are providing AEC and AOC.
Alibaba
Alibaba is starting to adopt 400G AEC, potentially reaching tens of thousands of units or more, depending on the supply chain’s capacity to meet demand.
Price Estimation and Outlook
With AEC volumes expected to increase two to three times next year, prices may decrease.
While overall volume is rapidly increasing, unit prices are likely to decrease to some extent. This year’s market is approximately $200-$300 million, potentially reaching $600 million next year, with continued high growth in the following years. As more manufacturers enter the competition, prices will continue to fall, and the market structure will undergo reshuffling.