share_log

NVLink,英伟达的另一张王牌

NVLink, another trump card for Nvidia

遠川研究所 ·  Dec 18, 2023 23:11

Source: Toonkawa Technology Review
Author: Ho Lil Heng

The US Department of Commerce's rhetoric is getting tighter and tighter, forcing the yellow knife method to return to the market: various sources have confirmed that Nvidia is about to launch three special GPUs. Due to export restrictions, the H20, which has the strongest performance, has also shrunk 80% in computing power compared to the H100.

Computing power is limited to death, and Nvidia can only write articles elsewhere. The biggest highlight of the H20 falls on bandwidth:

The bandwidth reached 900g/s, which is the same as the H100, which is the highest among all Nvidia products. This is a significant increase from the 600g/s of the A100 and 400g/s of the other two special chips, the A800 and H800.

Castrate computing power and increase bandwidth. It looks like cutting chives, but in fact, the gold content is not low.

H20踩着红线免受制裁
H20 stepped on the red line and was exempt from sanctions

Simply put, the amount of bandwidth determines the total amount of data transferred to the GPU per unit of time. Considering artificial intelligence's pathological requirements for data throughput capabilities, bandwidth has now become the most important indicator in addition to computing power when measuring the quality of GPUs.

On the other hand, cloud service companies and big model manufacturers do not just buy a few chips, but purchase hundreds or thousands of them at once to form clusters. The efficiency of data transmission between chips has also become an urgent issue.

Data transmission issues between GPUs and GPUs have brought Nvidia's other trump card outside of chip computing power and CUDA ecology to the surface: NVLink.

Data transmission, the tight spell of computing power

To understand the importance of NVLink, you first need to understand how a data center works.

When we usually play games, we usually only need one CPU plus one GPU. However, to train large models, what is needed is a “cluster” composed of hundreds or thousands of GPUs.

Inflection once claimed that the AI cluster they were building included as many as 22,000 NVIDIA H100s. According to Musk, GPT-5 training may require 30,000 to 50,000 H100 sheets. Although Altman denied it, it is also possible to see how much GPU the big model consumes.

Tesla's own supercomputing Dojo exaPod consists of multiple cabinet cabinets. Each cabinet has multiple training units, and each training unit contains 25 D1 chips. A complete exapod contains 3,000 D1 chips.

However, in this kind of computing cluster, a serious problem will be encountered: the chips are independent of each other, how to solve the data transmission problem between chips?

特斯拉的超算ExaPOD
Tesla's supercomputing ExaPod

When a computing cluster performs tasks, it can be simply understood that the CPU is responsible for issuing commands, and the GPU is responsible for computing. This process can be roughly summed up as follows:

The GPU first gets the data from the CPU - the CPU issues a command, the GPU performs the computation - the GPU is done, and the data is sent back to the CPU. This cycle goes back and forth until the CPU has compiled all the results of the calculation.

Data comes back and forth, and transmission efficiency is critical. If there are multiple GPUs, tasks must also be distributed between GPUs, which in turn involves data transmission.

Therefore, if a company buys 100 H100 chips, the computing power it has is not a simple addition of the computing power of the 100 chips; it is also necessary to consider the loss caused by data transmission.

PCIe has always been the mainstream solution for data transmission. In 2001, Intel proposed to replace the past bus agreement with PCIe and joined forces with more than 20 industry companies to draft technical specifications. Nvidia is also a beneficiary. Today, however, the shortcomings of PCIe are becoming more and more obvious.

First, data transmission efficiency has been left far behind by improvements in computing power.

From 2001 to 2017, the computing power of computing devices increased 5,000 times. In the same period, PCIe was iterated to 4.0, and the bandwidth (single channel) only increased from 250Mb/s to 2Gb/s, an increase of only 8 times.

The huge gap between the transmission of computing power has led to a drastic reduction in efficiency. It's like setting up a table full of hands, and the tableware is just an earspoon, so you won't be happy to eat it.

Second, artificial intelligence has revealed PCIe's design flaws.

In the PCIe design concept, data transmission between GPUs must go through the CPU. In other words, if GPU1 wants to exchange data with GPU2, it must be distributed by the CPU.

This wasn't a problem before, but artificial intelligence focused on a major miracle, and the number of GPUs in the computing cluster is rapidly expanding. If every GPU had to rely on CPU communication, efficiency would be greatly reduced. To describe it in a phrase that is familiar to everyone, it is “if you alone delay for one minute, the whole class wastes an hour”.

Dramatically increasing PCIe bandwidth is not in line with Intel's toothpaste addiction. Dramatically increasing the processing power of CPUs is a solution, but if Intel had this ability, Nvidia and AMD would not survive today.

As a result, Nvidia, which felt reluctant to wait any time, moved its mind to start a different story.

In 2010, Nvidia introduced GPU Direct Shared Memory technology, which accelerates the data transmission speed of GPU1-CPU-GPU2 by reducing the steps of one copy.

The following year, Nvidia also launched GPU Direct P2P technology, which directly removed the step of transferring data to the CPU and further speeding up the transmission speed.

It's just that these minor technical improvements are all based on PCIe solutions.

Like CUDA, PCIe's competitiveness lies in ecology. The core of the so-called “ecology” is “everyone is using you to make it special.” Since most devices use a PCIe interface, even if Nvidia wants to set the table, others will have to weigh compatibility issues.

The turning point came in 2016, when AlphaGo defeated Lee Se-seok 3:0. Overnight, the GPU went from being a game video card that poisoned teenagers to the pearl of artificial intelligence technology. Nvidia was finally able to enter the village brightly and fairly.

NVLink, unseal PCIe

In September 2016, IBM released a new version of the Power 8 server, equipped with Nvidia GPUs:

The two Power 8 CPUs are connected to 4 Nvidia P100 GPUs. Among them, the data transmission link was changed from PCIe to Nvidia's own NVLink. The bandwidth was as high as 80 G/s, the communication speed was increased 5 times, and the performance was increased by 14%.

Power8+P100架构
Power8+P100 architecture

At the same time, NVLink also enables direct transmission between GPUs and GPUs, so you can play without PCIe.

In 2017, a model based on Power8+P100 was implemented on a 22K ImageNet data set, and the recognition accuracy rate reached 33.8%. Although the accuracy rate increased by only 4% compared to the previous year, the training time was drastically reduced from 10 days to 7 hours.

The small test results are good, and Lao Huang is not ready to install it anymore.

Beginning with the Volta architecture in 2017, Nvidia has equipped every generation of GPUs with NVSwitch chips based on the NVLink scheme to process data transmission between GPUs.

The relationship between NVLink and NVSwitch can be simply understood as: NVLink is a technical solution, and both NVSwitch and NVLink switches are carriers of this type of solution.

Currently, in the latest DGX H100 servers, each server has 8 H100 GPUs and 4 NVSwitch chips connected to each other.

带有标注的NVSwitch芯片裸片
Marked NVSwitch chip die

At the same time as the DGX H100 server was released, Nvidia also released an NVLink switch equipped with two NVSwitch chips to handle data transmission between DGX H100 servers.

In other words, NVLink is not only responsible for connecting the 8 GPUs within the DGX server, but is also responsible for data transmission for each GPU between the entire server.

According to Nvidia's design, an H100 SuperPod system will use 32 servers with a total of 256 H100 GPUs, and the computing power is as high as 1 eFLOPS. Each system is equipped with 18 NVLink switches, which add up to 128 NVSwitch chips.

As mentioned above, the computing power of a cluster is not a simple addition of the computing power of each GPU; the efficiency of data transmission between servers is the main limiting factor. As clusters grow in size, NVLink becomes more and more important.

NVLink is gradually becoming a climate, and Lao Huang's ambition is also gradually taking shape: unlike PCIe gang groups, NVLink must be tied to Nvidia chips for use. Of course, considering the established PCIe ecosystem, there are also several PCIe-enabled versions in the H100 series.

In order to expand its sphere of influence, Nvidia also launched Grace server CPUs based on ARM architecture, using Nvidia's CPU+Nvidia's GPU+Nvidia's interconnection solutions, bundled together to unify the data center market.

With this layer of paving the way, it is not difficult to understand the lethality of the H20.

Although the computing power has been cut to a large extent and cannot handle model training with large parameters, the H20's own high bandwidth and NVLink support can form larger clusters. In terms of training and reasoning for some small-parameter models, it is more cost-effective.

Under Nvidia's example, AI's internal volume and computing power have also switched to connected technology.

Connectivity, the second half of AI chips

In November 2023, AMD released the long-anticipated MI300 series, which directly targets the Nvidia H100.

At the press conference, in addition to routine paper computing power comparisons, Lisa Su emphasized that the Mi300 is far ahead in bandwidth: the Mi300x has a bandwidth of 5.2 Tb/s, which is 1.6 times higher than the H100.

That's the truth, but you have to squeeze the water out first.

The H100 SXM version used by Lisa SU to compare with the MI300X is the H100 SXM version, but the higher performance H100 NVL version integrates two GPUs through NVLink and reaches 7.8 Tb/s of bandwidth, which is still higher than the MI300X.

But this shows the importance AMD places on bandwidth and the new focus of AI chip competition: connected technology.

A few months after Nvidia released NVLink, AMD launched Infinity Fabric, a high-speed interconnect technology that provides up to 512 Gb/s of bandwidth between CPUs and CPU-GPUs, and later expanded to GPU-GPU and CPU-GPU interconnects.

Watching the two major rivals throw off the burden of bandwidth and let loose, Intel, as the leading brother of PCIe, is naturally in a mixed mood.

In 2019, Intel teamed up with Dell, HP, etc. to launch a new connectivity standard CXL. Essentially like NVLink and inifinity Fabric, they are all aimed at getting rid of bandwidth constraints. The maximum bandwidth of the 2.0 standard can reach 32 GT/s.

Intel's interest is that since the CXL is based on PCIe extensions, it is compatible with the PCIe interface. In other words, devices that used PCIe interfaces in the past could “painlessly” switch to CXL, and the ecological law has achieved great success again.

The chip giants are in full swing over interconnection technology, and AI giants that develop their own chips are also solving the connectivity problem.

Google used self-developed optical circuit switch technology (OCS) on its TPU, and even developed its own optical path switch chip Palomar, just to improve the communication speed between thousands of TPUs in the data center. Tesla also developed its own communication protocol to handle data transmission within Dojo.

Going back to the beginning of this article, it is this gap that has made NVLink a new “knife method” for Nvidia.

The computing power required for large models is not untouchable by domestically produced AI chips, yet poor data transmission technology still causes cost problems that cannot be ignored.

Here's a less strict example to help you understand this question:

Assuming that the unit price of H20 and a domestic AI chip is 10,000 yuan, the computing power provided by an H20 is 1, and the computing power provided by a domestic chip is 2, but considering the computing power loss caused by the cluster size, due to the presence of NVLink, the H20's loss is 20%, and the domestic chip is 50%, then a data center requiring 100 computing power requires 125 H20 or 200 domestic chips.

In terms of cost, this is the difference between 1.25 million and 2 million.

The larger the model, the more chips the data center requires, and the greater the cost gap. If Hwang In-hoon were ruthless and sharper, he might be able to sell at a lower price. If you are the purchasing director of a domestic AIGC manufacturer, how do you choose?

Weakness in connectivity technology has created another trump card for Nvidia.

According to current information, H20, which was originally released in November, has been delayed until the first quarter of next year, and the acceptance of orders and shipping times will also be delayed. The reason for the delay in release is uncertain, but before the H20 officially goes on sale, the window of opportunity left for domestic chips is already counting down.

The greatness of Nvidia is that it is highly forward-looking and has almost single-handedly pioneered an artificial intelligence highway.

And its success lies in the fact that Hwang In-hoon repaired toll gates ahead of time on every lane you might pass through.

Editor/jayden

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment