Source: Tencent Technology
Author: Zhang Shujia Morris
On October 17, the United States updated its export control standards, requiring that the performance of advanced chips exceed a certain threshold, that is, it is necessary to apply for an export license. Under strict restrictions,$NVIDIA (NVDA.US)$The special edition H800 and A800 chips for the Chinese market are also facing a ban on sales. The following are the standards set by the US Department of Commerce for the performance of advanced chips:
●Sum of total computing power ≥4800 TOPS,
●Total computing power ≥1600, and performance density ≥5.92;
●2400 ≤ total computing power <4800, and 1.6
●The total computing power is greater than 1600, and the performance density is less than 3.2 and <5.92.
Faced with the new regulations, Nvidia offered two solutions: first, to communicate with the US Department of Commerce to apply for permission to “open up” specific Chinese customers; second, to once again customize a new special version for the new regulations.
Nvidia Chief Financial Officer Colette Kress confirmed the news during the fiscal third quarter conference call that was just held. Kress said Nvidia is cooperating with some customers in the Middle East and China to obtain permission from the US government to sell high-performance products. Additionally, Nvidia is trying to develop new data center products that are in line with government policies and do not require a license.
01. How is H800 “castrated” into H20?
The new special edition Nvidia is trying to develop, that is, H20, L20 and other products that are popular in the industry. The latest news shows that the launch plan for related products has been delayed until the first quarter of 2024.
The problem is that the development, design, and production of new specialty chips such as the H20 has completely broken out of the rhythm of conventional chips. How did Nvidia come up with this special supply solution in a short period of time?
Its answer is one of the key questions we want to discuss in this article: stop the production process later, and summarize it in a more commonly used term, that is, castration.
Judging from the normal design, production cycle, and pace of product release, H20/L20 chips specially for the Chinese market were released at this point in time. It is unlikely that they are products of remade light masks and re-films. One relatively reasonable inference is that they are new SKUs that were introduced through the transformation and repackaging of the physical disconnection process in the semiconductor back channel.
The dotting process is a modification method in the post-processing (BEOL) of semiconductor manufacturing. Some pipe/wire repair processes can be used without the need to redo the mask, including surface laser dotting, CoWoS level interception, and even manual wire carving through a tunnel mirror.
You can assume a scenario like this. The OEM for Nvidia H800$Taiwan Semiconductor (TSM.US)$In the clean rooms of Nanke Fab 18A, Taichung Fab 15B, and Taichung Advanced Packaging Factory 5, several batches of nude films produced at reduced regulations have not yet had time to be cut and plated with metal wires and electrodes. They have not yet been packaged into H800 and L40S. Instead, they have been packaged into H20 and L20 through later point-off production processes.
02. Surface laser dotting is the traditional art of semiconductor manufacturing
According to industry practice, the cache size (cache size) and underlying physical interconnect (PHY channels) of a digital logic chip can all be screened through repair/disconnection during the post-processing process. In particular, the transformation method for low-score nude films is considered to be a traditional performing art for decades. For example, one of the important differences of early Pentium and Celeron processors is the click-break cache.
If it is a small part, it used to be done by hand (equivalent to microcarving); for parts with a slightly larger area, the Layout can be redesigned to reserve the point break position, and then the machine can complete the point failure.
In practice, most fabs are equipped with professional equipment to cut the lines/grooves directly on the die with a laser, while at the Intel Fab42 factory in Chandler, Arizona, there is also equipment for hand-carving transistors directly under a special tunneling mirror, claiming to be atomic-scale, unlike ordinary scanning tunneling microscopes. A few years ago, Intel had a promotional video referring to this device. According to rumors, there are no more than 14 licensed operators worldwide.
In fact, before planar transistors, microscopic hand carving was not a difficult operation, but after entering FinFET, due to the vertical 3D gate structure, the cost of hand carving equipment and operators became far out of reach.
Specifically, when it comes to H20/L20, how did these two special products get downgraded through the H800 and L40S regulations? You can take a look at the relevant parameters first:
H20: Compatible with H100/800 series, Hopper architecture (HBM3, 2.5D CoOS package, NVLink)
L20: Compatible with L40S series, Ada Lovelace architecture (GDDR6, 2D info package, PCIe Gen4)
*Note: Firmware is modified accordingly;
Looking back at the critical underlying physical interconnect (SerDes PHY) differences between the same H100/H800 architectures, the H100 downgraded to H800 can be achieved through local physical point failure processing; however, in contrast, although the H20 is isomorphic to the previous two products, it is speculated that the removed Dark Si area may be large, so it is uncertain whether the regular point interruption operation is not worth it, and it may be necessary to rearrange the layout.
However, in addition to the differences in the underlying physical layer interconnect (SerDes PHY), there is also a difference between the unit area of double precision floating point computation (FP64) and the unit area of tensor kernels (used for matrices and convolutional computation tasks). This part is inconclusive, but it can be speculated that it is similar to using physical redundancy design and shielding. After all, today's design methodologies promote modularity, and the difference between 70-point die and 90-minute die after the film, and there is more than one FP64 on the GPU chip, and it is reasonable for local physical point failure to fail.
03. Design redundancy creates conditions for point failure, which is also the foundation of large factories
For example: A. The Intel F series CPUs that are still visible on the market today are the 70 minute die that disconnects the display core; B, the first two generations of the Apple Si officially announced 8-core NPUs. There are actually 9, which is redundant in design.
The above are also considered basic operations in the wafer manufacturing process. In particular, during the transition period between pilot plants/lines and Alpha-Beta flow sheets, if there are minor mistakes, they will be changed directly by hand, and the mask will not be changed again.
From the perspective of chip designers, design redundancy originally existed in the chip development process, because the previous lithography process emphasized high yield, specifically the number of failed transistors, and the testing process determined the yield at the module level. The bad points can be directly cut off the circuit, and the subsequent lead and capping process will not change.
For example, 3 years ago, Intel introduced an F series CPU without a display core to the market, which is a product of physical downgrading/castration, cutting off the display core, and re-packaging sales. However, this chip occasionally consumes a lot of electricity. After user complaints, verification of the construction environment revealed that it was an inexplicable power failure caused by the display core, which had originally failed through physical disconnection, and was not controlled after being connected to electricity.
This case reflects the situation we mentioned above. The same assembly line can continue to be sold after the chip has been cut and broken, and the subsequent wire/pin and packaging process have not changed. In particular, in the early days, the yield of Intel 10nm was very low, and it was only when there was a backlog of many such low fragments that the display core failed was added to the F mark and continued to be sold.
There is probably a lot of room for this “redundancy” today. After all, the H100 is already a large 814 square mm chip, almost close to the edge of the mask size (26mm*33mm=858mm2). However, the H20 downgraded model released today has about 15% performance of the H100, but its material cost is almost the same.
04. Perception at the package level has better operability and economy
In addition to the laser point breaking process on the surface of the logic chip, there are also point breaking requirements for certain special locations, such as the point breaking of the CoWOS interlayer.
As TSMC's 2.5D packaging solution, CoWOS allows multiple chips to be packaged together. Devices such as interconnect and memory are all interconnected through a silicon interlayer, achieving the effects of small package size, low power consumption, and few pins.
Compared to laser cutting on the surface, the front part of CoOS — that is, the CoW part is a silicon through-hole and intermediary layer — is operated at this level to differentiate, making it more economical and easier to guarantee yield. Because the computing power logic chip and the I/O chip are separated, it is possible to block the underlying physical interconnect channels, and also reduce HBM3 memory performance, and it is easier to modify and differentiate in the silicon intermediate layer. Compared to modifying everything on the logic chip, the cost is lower, because the line width accuracy of the operation on the intermediate layer can be lower, and even the line width of the top layer of metal can be cut off.
However, the CoWOS intermediary layer can only block the physical interconnection and HBM memory, but it cannot block the area of computational logic chips such as FP64 units and Tensor Core units. This requires the addition of the method described above to stop and fail on the logic die surface.
Also, under normal circumstances, a circuit that has failed at physical points cannot be detected by an external third party, and the process is irreversible; in particular, now chips are all about ten layers of metal, and the surface of the die has been modified, and the metal layer on top is invisible, unless an anti-engineering perspective scan is used.
In summary, we have seen models such as the H20/L20 that have been further specially supply/downgraded. It can be determined that they are modified products of the post-physical disconnection process of the H800 and L40S bare sheets, while at the same time repackaging and remodifying the firmware to become new SKUs.
Recall that Nvidia's previous backlog of 5 billion dollar GPU products originally sold to China have not yet been delivered, but now it has been returned to the factory for post-processing modifications to be able to release new SKUs so quickly, then it is speculated that the 5 billion dollar orders from domestic manufacturers may be converted to these three models.
05. The ability and failure of H20 after “castration”
Core AI chip related parameters and export control conditions. APPLIES is regulated, and DOESN'T APPLY is not regulated
The following is a horizontal comparison of H20 and H100/H800/A100 products. The comparison dimensions include “product specifications, single card and cluster computing power efficiency, material costs, and pricing system”:
In terms of comprehensive computing power, the H100/H800 is currently the top deployment of AIDC computing power clusters; among them, the H100 theoretical expansion limit is a 50,000 card cluster, which can reach up to 100,000 P computing power; the H800 largest cluster is 20,000 to 30,000 cards, with a total of 40,000 P computing power; and the A100 largest cluster is 16,000 cards, with a total of 9,600P computing power.
However, for H20, the theoretical expansion limit of the cluster is 50,000 cards. Calculated with a single card computing power of 0.148P (FP16/BF16), the cluster provides a total of 7,400P computing power, far lower than H100/H800/A100.
At the same time, based on estimates of computing power and communication balance, the reasonable median overall computing power of 50,000 H20 sheets is about 3000P. If the H20 faces 100 billion parameter model training, I'm afraid it will be overstretched, and the cluster network topology needs greater epitaxial expansion.
However, judging from the comprehensive hardware parameters of the HGX H20, it almost fills up all indicators other than the computing power threshold strictly limited in the US Department of Commerce's performance density ban. Obviously, it is positioned as a processor for general use.
It's just that for the LLM large model business format, H20 is actually used for kilocalorie distributed training. Although most of the effective utilization time is the matrix multiplication calculation time on the GPU, and the proportion of time spent on communication and access is reduced, after all, the single-card computing power specification is low, and the expansion of the kilocalorie cluster beyond the limit will reduce its cost-effectiveness ratio. H20 is more suitable for training/reasoning for vertical models, making it difficult to meet the training requirements of 100 billion parameter-level LLM.
It should be noted that using more low-spec, cheaper GPU parallel clusters in an attempt to equalize or surpass the performance of a GH200 with ultra-high computing power is a paradox.
Because of the many constraints of this solution, the ROI of setting up and running the environment is not high. Because it is impossible to obtain an ideal solution in terms of computing power utilization, implementation of parallel strategies, comprehensive cluster energy consumption, hardware costs, and networking costs; the performance of H20 clusters and A800 clusters can be compared, but it is impractical to compare the performance of H100/GH200 clusters.
In terms of the basic specifications of the H20, the computing power level is about 50% A100 and 15% H100, and the single-card computing power is 0.148P (FP16)/0.296P (Int8), 900GB/S NVLink, 6 HBM3e (the display memory materials are the same configuration as the H100 SXM version, that is, 6*16GB = 96GB capacity), and the die size is also 814mm2.
Considering that the HBM particle cost is 55%-60% of the H100 GPU single card material cost, and the material cost of the whole card is about 3,320 US dollars (the cost of H20 is similar, and the cost is even higher due to the addition of the additional L2 Cache and the additional point stop process, and the HBM3 capacity and NVLink lanes bandwidth have been increased compared to the H800), then corresponding to the final channel pricing rules, the H20 channel unit price may be at a similar level to that of H100/H800.
Refer to several market prices compared to the previous year (channel prices from a certain Internet company and a certain line server manufacturer):
- DGX A800 PCIe 8-card server about 1.45 million yuan/unit, NVLink version 2 million yuan/unit
- DGX H800 NVLink version server, domestic channel price is about 3.1 million yuan/unit (excluding IB)
- DGX H100 NVLink version server, Hong Kong channel price is about 450,000 US dollars/unit (excluding IB)
- The price of the H100 PCIe single card is about US$2.5-30,000. The H800 PCIe single card is uncertain, and the distribution channels for the single card are not regulated
Editor/jayden