Semianalysis indicates that at the GTC 2025 conference, NVIDIA's innovations such as the inference Token extension, inference stack with Dynamo technology, and Co-Packaged Optics (CPO) technology will significantly reduce the total cost of ownership for AI, greatly lowering the deployment costs of efficient inference systems and strengthening NVIDIA's leading position in the Global AI ecosystem.
On local time March 18th, Tuesday, $NVIDIA(NVDA.US)$ CEO Jensen Huang delivered a keynote speech at the NVIDIA AI event GTC 2025 held in San Jose, California. The well-known semiconductor consulting institution Semianalysis provided an in-depth interpretation of Jensen Huang's GTC speech, detailing NVIDIA's latest advancements in enhancing AI inference performance.
The market is concerned that the significant cost savings brought by DeepSeek-style software optimization and hardware advancements led by NVIDIA may result in a decline in demand for AI hardware. However, as prices affect demand, when the cost of AI decreases, the boundaries of AI capability are continually surpassed, leading to increased demand.
With NVIDIA's improvements in inference efficiency in both hardware and software, the deployment costs for model inference and intelligent agents have been significantly reduced, thereby achieving a cost-effective diffusion effect, which actually increases the consumption volume, as NVIDIA's slogan states: "Buy more, save more."
The following are the key points of the article:
Inference Token Expansion: The synergistic effect of pre-training, post-training, and inference-time expansion laws continuously enhances AI model capability.
Jensen Huang's mathematical rules: includes sparse rates of FLOPs, bidirectional bandwidth measurements, and a new rule for calculating the number of GPUs based on the number of GPU chips in a package.
GPU and system roadmap: Introduced the critical specifications and performance improvements of Blackwell Ultra B300, Rubin, and Rubin Ultra, emphasizing breakthroughs in performance, memory, and network interconnections in the next generation of products.
The released inference stack with Dynamo technology: New features such as intelligent Routers, GPU planners, improved NCCL, NIXL, and NVMe KVCache offload managers have greatly enhanced inference throughput and efficiency.
Co-packaged optics (CPO) technology: Detailed the advantages of CPO in reducing power consumption, increasing Switch base, and network flattening, as well as its potential in future large-scale network deployments.
The article points out that these innovations will significantly reduce the total ownership cost of AI, greatly lowering the deployment cost of efficient inference systems and consolidating NVIDIA's leading position in the Global AI ecosystem.
Semianalysis deeply interprets the full text for AI translation.
Inference Token explosion.
The advancement of AI models is accelerating rapidly, with improvements in the past six months surpassing the progress made in the previous six months. This trend will continue as three expansion laws—pre-training expansion, post-training expansion, and inference-time expansion—are working in synergy to drive this process.
This year's GTC (GPU Technology Conference) will focus on addressing new expansion paradigms.

Claude 3.7 has demonstrated impressive performance in the field of Software Engineering. Deepseek v3 shows that the costs of the previous generation models are sharply decreasing, which will further promote their widespread application. OpenAI's o1 and o3 models prove that extending inference time and search capabilities significantly enhances the quality of answers. As demonstrated by early pre-training laws, there is no upper limit to increasing computational resources in the post-training phase. This year, NVIDIA is committed to greatly improving inference cost efficiency, aiming for a 35-fold improvement in inference costs to support model training and deployment.
Last year's market slogan was "The more you buy, the more you save", but this year's slogan has turned into "The more you save, the more you buy". The improvements in inference efficiency of NVIDIA's Hardware and Software have significantly reduced the deployment costs of model inference and intelligent agents, thus realizing a diffusion effect of cost benefits, which is a classic embodiment of the Jevons Paradox.
Market concerns are that the significant cost savings brought by DeepSeek-style software optimization and NVIDIA's hardware advancements could lead to a decline in demand for AI hardware, potentially resulting in a surplus of Tokens in the market. Prices will influence demand; when AI costs decrease, the boundaries of AI capabilities are continually broken, leading to an increase in demand. Today, AI capabilities are limited by inference costs, and as costs decrease, actual consumer volume may instead increase.
Concerns about Token deflation are similar to discussions about the decreasing cost per data packet connection for fiber-optic Internet, while neglecting the ultimate impact of websites and Internet applications on our lives, society, and economy. The key difference is that bandwidth has a limit, whereas with significant enhancements in capability and declining costs, the demand for AI can grow indefinitely.
The data provided by NVIDIA supports the viewpoint of the Jevons Paradox. The number of Tokens in existing models exceeds 100 trillion, while a single inference model has a Token amount 20 times higher, with computational demands being 150 times greater.

During testing, calculations require hundreds of thousands of Tokens per query, with billions of queries each month. In the post-training expansion phase, the model is 'going to school,' where each model needs to handle trillions of Tokens, along with hundreds of thousands of post-training models. Furthermore, AI with agency capabilities means multiple models will work together to solve increasingly complex problems.
Huang Renxun's mathematics changes every year.
Every year, Huang Renxun introduces new mathematical rules. This year is particularly complicated, as we observe a third new mathematical rule from Huang Renxun.
The first rule is that the FLOPs data published by NVIDIA is measured at a 2:4 sparsity (which, in reality, no one uses), whereas the true performance indicator is the dense FLOPs - meaning that the H100 is reported as 989.4 TFLOPs under FP16, while the actual dense performance is about 1979.81 TFLOPs.
The second rule is that bandwidth should be measured as a bidirectional bandwidth. The bandwidth of NVLink5 is reported as 1.8TB/s, since its sending bandwidth is 900GB/s, plus a receiving bandwidth of 900GB/s. Although these figures add up in the specifications, in the networking field, the standard is to measure it as unidirectional bandwidth.
Now, the third mathematical rule of Huang Renxun has emerged: the number of GPUs will be counted based on the number of GPU chips in the package, rather than the number of packages. Starting with the Rubin series, this naming convention will be adopted. The first generation Vera Rubin rack will be referred to as NVL144, even though its system architecture is similar to the GB200 NVL72, just using the same Oberon rack and 72 GPU packages. Although this new counting method is puzzling, we can only accept this change in Huang Renxun's world.
Now, let’s review the roadmap.
Roadmap for GPUs and systems.

Blackwell Ultra B300

The Blackwell Ultra 300 has been previewed, and the details are fundamentally consistent with what was shared last Christmas. The main specifications are as follows: the GB300 will not be sold in a single board form, but will appear as the B300 GPU on a portable SXM module, along with the Grace CPU, also in a portable BGA form. In terms of performance, the B300 improves over the B200 with more than a 50% increase in FP4 FLOPs density. The memory capacity is upgraded to 288GB per package (8 stacks of 12-Hi HBM3E), but the bandwidth remains unchanged at 8 TB/s. The key to achieving this goal lies in reducing many (but not all) FP64 computing units and replacing them with FP4 and FP6 computing units. Double precision workloads are mainly used for HPC and supercomputing, rather than AI. While this may disappoint the HPC community, NVIDIA is shifting its focus to emphasize the more important AI market.
The B300 HGX version is now referred to as B300 NVL16. This will use the previously known single GPU version 'B300A', now abbreviated to 'B300'. Since the single B300 does not have a high-speed D2D interface connecting two GPU chips, there may be more communication overhead.
The B300 NVL16 will replace the B200 HGX form factor, incorporating 16 packages and GPU chips on a single board. To achieve this, two single-chip packages are placed on each SXM module, totaling 8 SXM modules. It remains unclear why NVIDIA does not continue with the 8× dual-chip B300 and opts for this method; we suspect it is to improve yield from smaller CoWoS modules and packaging substrates. Notably, this packaging technology will adopt CoWoS-L rather than CoWoS-S, which is a significant decision. The maturity and capacity of CoWoS-S are the reasons for the single-chip B300A, and this shift indicates that CoWoS-L has rapidly matured and its yield has stabilized compared to the initial sluggish performance.
These 16 GPUs will communicate via the NVLink protocol, similar to the B200 HGX, with two NVSwitch 5.0 ASICs located between the two arrays of the SXM module.
The new detail is that unlike previous HGX models, the B300 NVL16 will no longer use.$Astera Labs(ALAB.US)$However, some hyperscale Cloud Computing Service providers may choose to add PCIe Switches. We disclosed this news to Core Research subscribers earlier this year.
Another important detail is that the B300 will introduce the CX-8 NIC, which provides 4 channels of 200G, totaling 800G throughput, offering next-generation network speeds for InfiniBand, which is double that of the existing CX-7 NIC.
Rubin Technical Specifications


Rubin将采用$Taiwan Semiconductor (TSM.US)$3nm工艺,拥有两个reticle-size计算芯片,左右各配备两个I/O Tile,内置所有NVLink、PCIe以及NVLink C2C IP,以释放主芯片上更多用于计算的空间。
Rubin offers an incredible 50 PFLOPs of dense FP4 computing performance, more than three times the generational performance of the B300. How did NVIDIA achieve this? They expanded through the following key vectors:
1. As mentioned above, the area freed by the I/O chips could increase by 20%-30%, allowing for more stream processors and tensor cores.
2. Rubin will use a 3nm process, possibly utilizing a custom NVIDIA 3NP or standard N3P. The transition from 3NP to 4NP significantly enhances logical density, although there is minimal reduction in SRAM.
3. Rubin will have a higher TDP - we estimate around 1800W, which may even push for higher clock frequencies.
4. Structurally, NVIDIA's generationally expanding tensor core systolic array will further expand: from Hopper's 32×32 to Blackwell's 64×64, Rubin may expand to 128×128. A larger systolic array offers better data reuse and lower control complexity, while being more efficient in area and power consumption. Despite the increased programming difficulty, NVIDIA achieves a high yield of parameters with built-in redundancy and repair mechanisms, ensuring that even with the failure of individual computing units, overall performance remains secure. This contrasts with TPUs, whose ultra-large tensor cores lack the same fault tolerance.

Rubin will continue to use the Oberon rack architecture, similar to the GB200/300 NVL72, and will be equipped with the Vera CPU, the 3nm successor to Grace. It is important to note that the Vera CPU will adopt NVIDIA's fully customized core, while Grace heavily relies on Arm's Neoverse CSS core. NVIDIA has also developed a custom interconnect system that allows a single CPU core to access more memory bandwidth, which is something that AMD and Intel find difficult to compete with.
This is the origin of the new naming convention. The new rack will be named VR200 NVL144. Although the system architecture is similar to the previous GB200 NVL72, the new system contains 2 compute chips per package for a total of 144 compute chips (72 packages × 2 compute chips/package), and NVIDIA is changing the way we count the number of GPUs.
As for AMD, its marketing team needs to be aware that there is a discrepancy in AMD's claim that the MI300X family can scale up to 64 GPUs (8 packages per system × 8 XCD chipsets per package), which is a key market opportunity.
HBM and Interconnect
NVIDIA's HBM capacity will remain at 288GB from generation to generation but upgraded to HBM4: 8 stacks, each 12-Hi, with a layer density maintained at 24GB per layer. The application of HBM4 allows for an increase in total bandwidth, with 13TB/s of total bandwidth primarily due to the bus width doubling to 2048 bits, and a pin speed of 6.5Gbps, in accordance with JEDEC standards.

The speed of NVLink 6th generation is doubled to 3.6TB/s (bidirectional), resulting from the doubling of channel counts, with NVIDIA still using 224G SerDes.
Returning to the Oberon rack, the backplane still uses a Copper backplane, but it is believed that the number of cables has also increased accordingly to accommodate the doubling of each GPU channel.
Regarding NVSwitch, the next-generation NVSwitch ASIC will also double the total bandwidth through the doubling of channel numbers, further enhancing the performance of the Switch.
Rubin Ultra specifications.

Rubin Ultra is a stage of significantly improved performance. NVIDIA will use 16 HBM stacks directly in one package, increasing from 8 to 16. The entire rack will consist of 4 mask size GPUs, equipped with 2 I/O chips in the middle. The computing area has doubled, with computing performance also doubling to 100 PFLOPs intense FP4 performance. HBM capacity has increased to 1024GB, more than 3.5 times that of the ordinary Rubin. It adopts a dual-stacking design, simultaneously increasing density and layers. To achieve 1TB of memory, the package will contain 16 HBM4E stacks, each with 16 layers of 32Gb DRAM core chips.
It is believed that this package will be split into two interconnectors placed on the substrate to avoid using one super-large interconnector (almost 8 times the mask size). The two GPUs in the middle will be interconnected via thin I/O chips, with communication achieved through the substrate. This requires a super-large ABF substrate, with dimensions exceeding the current JEDEC package size limits (both width and height are 120mm).
The system has a total of 365TB of high-speed storage, with each Vera CPU having 1.2TB LPDDR, totaling 86TB (72 CPUs), leaving about 2TB of LPDDR for each GPU package as additional secondary memory. This is the implementation of customized HBM base functionality. The LPDDR memory controller is integrated into the base functionality to serve the additional secondary memory, which is located on the board's LPCAMM module and works in conjunction with the secondary memory brought by the Vera CPU.

This is also when we will see the launch of the Kyber rack architecture.
Kyber rack architecture
The key new feature of the Kyber rack architecture is that NVIDIA improves density by rotating the rack 90 degrees. For the NVL576 (144 GPU packages) configuration, this is another significant enhancement for large-scale expansion of network size.

Let's take a look at the key differences between the Oberon rack and the Kyber rack:

· The calculation tray is rotated 90 degrees, forming a shape similar to a cartridge, thereby achieving higher rack density.
· Each rack consists of 4 silos, with each silo containing two layers of 18 computation cards.
For NVL576, each computation card includes one R300 GPU and one Vera CPU.
Each silo has a total of 36 R300 GPUs and 36 Vera CPUs.
This brings NVLink's global scale to 144 GPUs (576 chips).
· The PCB backplane replaces the Copper backplane as the key component for expanded links between GPUs and NVSwitch.
This transformation is mainly due to the difficulty of laying cables in a smaller footprint.

There are signs that a variant of the Kyber rack with VR300 NVL1,152 (288 GPU packages) has appeared in the supply chain. If calculated based on the number of wafers mentioned in the GTC keynote speech, you will see the 288 GPU packages highlighted in red. We believe this may be a future SKU, with rack density and NVLink scaling from the showcased NVL576 (144 packages) to NVL1,152 (288 packages).
Additionally, there is a brand new generation of NVSwitch worth noting. This is the first introduction of NVSwitch to the mid-platform, which enhances the total bandwidth and baseline of the switch, scalable to 576 GPU chips (144 packages) within a single domain, although the topology may no longer be a fully interconnected single-level multi-plane structure but could shift to a two-level multi-plane network topology with oversubscription or even adopt a non-Clo topology.
Blackwell Ultra improved exponential hardware unit
Various attention mechanisms (such as flash-attention, MLA, MQA, and GQA) all require matrix multiplication (GEMM) and the SOFTMAX function (row reduction and element-wise exponential operations).
In GPUs, GEMM operations are mainly performed by tensor cores. Although the performance of tensor cores has continually improved in each generation, the advancements for the multifunction unit (MUFU) responsible for softmax calculations have been modest.
In bf16 (bfloat16) Hopper, computing the softmax for the attention layer occupies 50% of the GEMM cycle. This requires kernel engineers to 'hide' the latency of softmax through overlapping calculations, making kernel writing exceptionally difficult.

In FP8 (floating-point 8-bit) Hopper, the cycles required for softmax computation in the attention layer are the same as for GEMM. Without any overlap, the computation time for the attention layer would double, requiring approximately 1536 cycles for matrix multiplication and an additional 1536 cycles for softmax computation. This is where overlapping techniques are key to improving throughput. Since the cycles required for softmax and GEMM are the same, engineers need to design a kernel with perfect overlap, but achieving this ideal state is challenging in reality. According to Amdahl's Law, perfect overlap is difficult to realize, thus impairing hardware performance.
In the world of Hopper GPUs, this challenge is particularly evident, and the first generation of Blackwell also faces similar issues. NVIDIA addressed this issue with Blackwell Ultra, which redesigned the SM (streaming multiprocessor) and added new instructions, increasing the speed of MUFU computing the softmax portion by 2.5 times. This will reduce reliance on perfect overlapping calculations, giving CUDA developers greater fault tolerance when writing attention kernels.

This is exactly where NVIDIA's new inference stack and Dynamo technology shine.
Inference Stack and Dynamo
At last year's GTC, Nvidia discussed how the large-scale GPU scaling of GB200 NVL72 improved inference throughput by 15 times compared to H200 under FP8.

Nvidia has not slowed down; instead, it is accelerating the improvement of inference throughput in both hardware and software areas.
The Blackwell Ultra GB300 NVL72 improves FP4 intensive performance by 50% over GB200 NVL72, while HBM capacity also increases by 50%, both of which will enhance inference throughput. The roadmap also includes several upgrades to network speeds in the Rubin series, which will significantly improve inference performance.
The next leap in hardware for inference throughput will come from the expanded network scale in Rubin Ultra, which will scale from 144 GPU chips (or packages) in Rubin to 576 GPU chips, which is just part of the hardware improvements.
In terms of software, Nvidia has launched Nvidia Dynamo—an open AI engine stack designed to simplify inference deployment and scaling. Dynamo has the potential to disrupt existing VLLM and SGLang, providing more features and higher performance. Combined with hardware innovations, Dynamo will further shift the curve between inference throughput and interactivity to the right, especially providing improvements for application scenarios that require higher interactivity.

Dynamo has introduced several key new features:
·Smart Router: The Smart Router can reasonably allocate each Token in multi-GPU inference deployment, ensuring a balanced load during both pre-loading and decoding phases to avoid bottlenecks.
·GPU Planner: The GPU Planner can automatically adjust pre-loading and decoding nodes, dynamically increasing or reallocating GPU resources based on daily demand fluctuations, further achieving load balancing.
·Improved NCCL Collective for Inference: The new algorithm from NVIDIA Collective Communications Library (NCCL) reduces the transmission latency of small messages by four times, significantly improving inference throughput.
·NIXL (NVIDIA Inference Transfer Engine): NIXL utilizes InfiniBand GPU-Async Initialized (IBGDA) technology to transfer control flow and data flow directly from the GPU to the NIC without going through the CPU, greatly reducing latency.
·NVMe KV-Cache Offload Manager: This module allows KV Cache to be stored offline on NVMe devices, avoiding repeated calculations during multi-turn conversations, thereby speeding up responses and freeing up pre-loading node capacity.
Intelligent Router
The intelligent router can intelligently route each token simultaneously to preloading (prefill) and decoding (decode) GPUs in multi-GPU inference deployments. During the preloading phase, it ensures that incoming tokens are evenly distributed among the GPUs responsible for preloading, thus avoiding bottlenecks due to traffic overload in any particular expet parameter module.
Similarly, during the decoding phase, it is crucial to ensure a reasonable distribution and balance of sequence length and requests among the GPUs responsible for decoding. For those expet parameter modules that handle a larger load, the GPU Planner can also replicate them to further maintain load balance.
In addition, the intelligent router can achieve load balancing among all model replicas, which is an advantage not possessed by many inference engines like vLLM.

GPU Planner
The GPU Planner is an auto-scaler for preloading and decoding nodes that can start additional nodes based on the natural fluctuations in demand throughout the day. It is capable of implementing a certain degree of load balancing among multiple expet parameter modules based on expert models (MoE), whether during the preloading or decoding phase. The GPU Planner activates additional GPUs to provide more computational power for high-load expet parameter modules and can dynamically reallocate resources between preloading and decoding nodes as needed, maximizing resource utilization.
In addition, it also supports adjustments to the GPU ratio used for decoding and preloading - which is particularly important for applications like Deep Research, as these applications require preloading a large amount of contextual information, while the actual generated content is relatively small.

Improved NCCL Collective Communications
A new set of low-latency communication algorithms added to the NVIDIA Collective Communications Library (NCCL) can reduce the latency of small message transmission by 4 times, significantly enhancing overall inference throughput.
At this year's GTC, Sylvain detailed these improvements in his speech, focusing on how single and double all-reduce algorithms achieve this effect.
Since AMD's RCCL library is essentially a replica of NVIDIA's NCCL, Sylvain's reconstruction of NCCL will continue to expand CUDA's moat, while forcing AMD to spend significant engineering resources synchronizing NVIDIA's major reconstruction results, allowing NVIDIA to use this time to continue pushing the frontier development of collective communication software stacks and algorithms.

NIXL —— NVIDIA Inference Transmission Engine
To achieve data transmission between the pre-loading node and the decoding node, a low-latency and high-bandwidth communications transmission library is required. NIXL adopts InfiniBand GPU-Async Initialized (IBGDA) technology.
Currently in NCCL, the control flow passes through the CPU proxy thread, while the data flow is transmitted directly to the network card without going through the CPU buffer. After using IBGDA, both the control flow and data flow can be transmitted directly from the GPU to the network card without CPU mediation, significantly reducing latency.
In addition, NIXL can abstract the complexity of transferring data between CXL, local NVMe, remote NVMe, CPU memory, remote GPU memory, and GPUs, simplifying the data movement process.

NVMe KVCache Offload Manager
The KVCache unloading manager improves the overall efficiency of the preloading phase by storing the KV cache generated in previous user conversations on NVMe devices instead of discarding it directly.

When users engage in multi-turn conversations with large language models (LLM), the model needs to consider the earlier Q&A as input tokens. Traditionally, inference systems discard the KV cache used to generate these Q&A, necessitating recalculation and repeating the same computational process.
After adopting NVMe KVCache unloading, when users temporarily leave, the KV cache is unloaded to NVMe storage; when users ask questions again, the system can quickly retrieve the KV cache from NVMe, avoiding the overhead of recalculation.
This not only frees up the computational capacity of the preloading nodes, allowing them to handle more input traffic but also improves the user experience by significantly shortening the time from starting the conversation to receiving the first token.

According to DeepSeek's GitHub description on the 6th day of Open Source Week, researchers have disclosed that their disk KV cache hit rate is 56.3%, indicating that the typical KV cache hit rate in multi-turn dialogues can reach 50%-60%, which significantly improves the efficiency of preloading deployment. Although recalculating may be cheaper than loading in shorter conversations, the overall cost savings brought by using NVMe storage solutions is substantial.
Friends tracking DeepSeek Open Source Week should be familiar with the above technologies. These technologies are an excellent way to quickly understand NVIDIA's innovative achievements with Dynamo, and NVIDIA will also release more documentation regarding Dynamo.
All these new features together have significantly accelerated inference performance. NVIDIA has even discussed how performance can be further improved when Dynamo is deployed on existing H100 nodes. Essentially, Dynamo enables the innovative achievements of DeepSeek to benefit the entire Community, not just those with top-tier inference deployment engineering capabilities. All users can deploy efficient inference systems.
Finally, since Dynamo can widely handle distributed inference and expert parallelism, it is particularly advantageous for single replication and higher interactivity deployments. Of course, to fully leverage Dynamo's capabilities, a large number of nodes must be a prerequisite to achieve significant performance improvements.

These technologies have collectively brought about a significant increase in inference performance. NVIDIA mentioned that when Dynamo is deployed on existing H100 nodes, significant performance improvements can also be achieved. In other words, Dynamo allows the best results of open-source inference technology to benefit all users, not just those from top AI laboratories with substantial engineering backgrounds. This enables more enterprises to deploy efficient inference systems, reducing overall costs and enhancing the interactivity and scalability of applications.
The total cost of AI ownership has decreased.
After discussing Blackwell, Jensen Huang emphasized that these innovations have made him the "chief income disruptor." He noted that Blackwell has improved performance by 68 times compared to Hopper, leading to an 87% reduction in cost. Rubin is expected to achieve a performance increase of 900 times over Hopper, with a 99.97% decrease in costs.
Clearly, NVIDIA is relentlessly driving technological advancement—just as Jensen Huang said: "When Blackwell starts shipping at scale, you won't even be able to give Hopper away for free."

We emphasized the importance of deploying computing power early in the product cycle back in October last year in the "AI Neocloud Action Guide," and this is what drives the accelerated decline in H100 rental prices starting in mid-2024. We have continuously called for the entire ecosystem to prioritize the deployment of next-generation systems, such as B200 and GB200 NVL72, rather than continuing to procure H100 or H200.
Our AI cloud Total Cost of Ownership (TCO) model has already shown customers the leap in productivity of each generation of chips, and how this leap drives changes in AI Neocloud rental prices, which in turn affects the net present value for chip owners. So far, our H100 rental price forecasting model released in early 2024 has achieved a 98% accuracy rate.

CPO (Co-Packaged Optics) Technology

In the keynote speech, NVIDIA announced the first Co-Packaged Optics (CPO) solution, to be deployed in the expanded Switch. Through CPO, transceivers are replaced by External Laser Sources (ELS), working in coordination with Optical Engines (OE) placed directly beside the chip silicon wafer to achieve data communication. Now, optical fibers are directly inserted into ports on the Switch, routing signals to the Optical Engine without relying on traditional transceiver ports.

The main advantage of CPO is a significant reduction in power consumption. As digital signal processors (DSP) are no longer needed on the Switch and lower power consuming laser sources can be used, substantial power savings are realized. Using Linear Plug-In Optical modules (LPO) can also achieve similar effects, but CPO allows for a higher switch base, effectively flattening the network structure—enabling the entire cluster to operate with a two-layer network using CPO instead of traditional three-layer networks. This not only reduces costs but also saves power, with energy savings that are almost as significant as the reduction in transceiver power consumption.
Our analysis shows that for a 400k* GB200 NVL72 deployment, transitioning from a three-layer network based on DSP transceivers to a two-layer network based on CPO can result in a total cluster power savings of up to 12%, reducing transceiver power consumption from 10% of computing resources to only 1%.

NVIDIA today launched several CPO-based Switches, including $QMCO(QMCO.US)$ the CPO version of the X-800 3400 Switch, which debuted last year at GTC 2024, featuring 144 800G ports with a total throughput of 115T, and equipped with 144 MPO ports and 18 ELS. This Switch is expected to launch in the second half of 2025. Another Spectrum-X Switch offers 512 800G ports, also suitable for high-speed, flat network topologies, and this Ethernet CPO Switch is planned for release in the second half of 2026.

Although today's release is already groundbreaking, it is believed that NVIDIA is merely warming up in the CPO field. In the long run, the greatest contribution of CPO in scaled networks is its ability to significantly enhance the base and aggregated bandwidth of the GPU Diffusion network, thereby enabling faster and flatter network topologies, unlocking a scaled world far beyond 576 GPUs. A more detailed article will be published soon, exploring NVIDIA's CPO solutions in depth.
NVIDIA still reigns supreme, targeting your computing costs.
Today, 'Information' published an article stating that$Amazon (AMZN.US)$The pricing of the Trainium chip is only 25% of the H100 price. At the same time, Jensen Huang claimed, "When Blackwell starts shipping at scale, you won't even be able to give away the H100 for free." We believe this statement carries significant weight. Technological advancements are driving down the total cost of ownership, and aside from the TPU, we can see replicas of NVIDIA's roadmap everywhere. Jensen Huang is pushing for continuous breakthroughs in technology. New architectures, rack designs, algorithm improvements, and CPO technology all set NVIDIA apart from its competitors. NVIDIA is leading in almost all areas, and when competitors catch up, they breakthrough in another direction. As NVIDIA maintains its annual upgrade pace, we expect this momentum to continue. Some talk about ASICs being the future of computing, but we have already seen that platform advantages, like in the CPU era, are hard to surpass. NVIDIA is rebuilding this platform through GPUs, and we anticipate they will continue to be at the forefront.
As Huang Renxun said, "Good luck keeping up with this chief revenue disruptor."
Editor/Rocky