Investing
- Stocks
  HK Stocks
  
  US Stocks
  
  JP Stocks
  
  A-Shares
  
  Margin Trading
- Derivatives & Cryptos
  ETF
  
  Options
  
  Futures
  
  Crypto
- Futu Money Plus
  Cash Plus
  
  Funds
  
  US Treasuries
  
  Structured Products
- Futu PWM
  Privare Wealth Management
Go FUTU Securities check more
>
Markets
- Quotes
  Stocks
  
  Options
  
  Futures
  
  Fores
  
  ETF
- Trading Tools
  Paper Trade
  
  Stock Screener
  
  Heat Map
  
  Earnings Calendar
  
  Institutional Tracker
  
  Investment Themes
App Store
Download >
Google Play
Download >
Mac
Download >
Windows
Download >
News & Community
- News
  Headlines
  
  24/7 News
  
  Economic Calendar
  
  Topics
  
  Live
- Learn
  Library
  
  Courses
  
  Lectures
- Community
  Feeds
  
  Hot Topics
Pricing
Promotions

Check more promotion
>
Blog
About
- About us >
  About Futu
  
  Milestones
  
  Newsroom
  
  Corporate Social Responsibility
  
  Contact us
  
  Join Us
  
  Futu I&E
- Futu Experience Stores
  
  FUTU Store (Tsim Sha Tsui)
  G/F, 96 Nathan Road, Tsim Sha Tsui, Kowloon
  
  FUTU Store（Tsuen Wan)
  G/F, No. 280 Sha Tsui Road, Tsuen Wan, New Territories
  
  FUTU Store（Mong Kok)
  G/F, 67-69 Argyle Street, Mongkok, Kowloon
  
  FUTU Head Office (Admiralty)
  34/F, United Centre, 95 Queensway, Admiralty, HK
  
  FUTU Store (Causeway Bay)
  G/F, Lai Yuen Apartments , 29-33 Lee Garden Road, Causeway Bay, Hong Kong
Check more
>
Support
More
- About
  About us
- Support

English
繁體中文
简体中文

Nvidia “reduces regulations” for China: H800 becomes H20, how can technology be realized, and can sex be used?

騰訊科技 · Nov 28, 2023 09:07

来源：腾讯科技
作者：张书嘉 Morris

10月17日，美国更新出口管制标准，要求先进芯片性能超过特定阈值，即需要申请出口许可。在严苛的限制条件下，$英伟达 (NVDA.US)$针对中国市场的特供版H800、A800两款芯片也面临禁售，以下为美国商务部对先进芯片性能的划定标准：

●总算力之和≥4800 TOPS ，
●总算力≥1600，且性能密度≥5.92；
●2400≤总算力＜4800，且1.6＜性能密度＜5.92；
●总算力≥1600，且3.2≤性能密度＜5.92。

面对新的管制条例，英伟达给了两个解法：其一，沟通美国商务部申请许可，给特定的中国客户“开白”；其二，针对新的管制条例，再次定制全新的特供版本。

刚刚举办的第三财季电话会议上，英伟达首席财务官科莱特·克雷斯确认了这一消息。克雷斯表示，英伟达正在与中东和中国的一些客户合作，以获得美国政府销售高性能产品的许可。此外，英伟达正试图开发符合政府政策且不需要许可证的新数据中心产品。

01、H800是如何“阉割”成为H20？

英伟达试图开发的新的特供版，即业内盛传的H20、L20等产品，最新消息显示，相关产品的上市计划已经延后至2024年第一季度。

问题在于，H20等全新特供芯片的研发、设计、生产，完全跳出了常规芯片的节奏，英伟达是如何在短时间内拿出这套特供解决方案？

它的答案就是我们这篇文章要讨论的关键问题之一：后道点断生产工艺，用大家更为常用的词汇总结即——阉割。

按正常的设计、生产周期和产品发布节奏来推断，特供中国市场的H20 / L20等型号的芯片在这个时间节点发布，不太可能是重做光罩、重新投片的产物，一个相对合理的推论——即它们是通过半导体后道的物理点断工艺的改造+再封装，进而推出的新SKUs。

点断工艺是半导体制造的后道工序（BEOL）中的改造方法，可以在无需重做光罩的前提下使用一些管/线修补工艺，包括表面激光点断、CoWoS层面点断，甚至通过隧道镜手工雕线。

可以假定一下这样的场景，代工英伟达H800的$台积电 (TSM.US)$南科Fab18A、台中Fab15B和台中先进封装5厂的洁净室里，此前降规生产的几批次裸片，还没来得及切割、镀上金属线和电极，还未封装成H800和L40S，转而通过后道点断生产工艺再封装成H20、L20。

02、表面激光点断是半导体制造传统艺能

行业惯例来说，一颗数字逻辑芯片的缓存大小(Cache Size)、底层物理互连（PHY channels）都可以通过在后道封测环节重修/点断做失效屏蔽处理的，尤其是针对低分数裸片的改造方法算是几十年的传统艺能，例如早期的奔腾、赛扬处理器的重要区别之一就是点断缓存。

倘若是局部微小部分，曾经可以手工完成（相当于微雕）；面积稍大的部分，可以重新设计Layout预留点断位置，再由机器完成点断失效。

实操上，通常的晶圆厂都会配置专业设备，由激光直接在裸片上切割线路/沟槽，而在亚利桑那钱德勒市的Intel Fab42工厂里，还有直接在专用隧道镜下面手工雕刻晶体管的设备，宣称是原子尺度的，不同于寻常的扫描隧道显微镜，几年前Intel有个宣传视频，提到这台设备，据传全球持证的操作手不超过14人。

其实在平面晶体管以前，显微镜手雕不算是高难度动作，但进入FinFET以后，由于垂直方向的3D栅极结构，手雕设备的代价和操作员就变得遥不可及了。

具体到H20/L20，这两款特供产品，是如何通过H800、L40S降规而来？可以先看看相关参数：

H20：对应H100/800系列，Hopper架构（HBM3、2.5D CoWoS封装、NVLink）
L20：对应L40S系列，Ada Lovelace架构（GDDR6，2D InFO封装，PCIe Gen4）
*注：固件相应修改；

回顾H100/H800相同架构之间比较关键的底层物理互连（SerDes PHY）的差异，H100降规阉割成H800，可以通过局部物理点断失效处理来实现；但相比之下，H20虽然与前面两款产品同构，但推测割掉的Dark Si面积可能较大，不确定常规点断操作是否不值得，也许需要重新做Layout。

但是除了底层物理层互连（SerDes PHY）的区别，还有双精度浮点计算（FP64）单元面积、张量核（用于矩阵、卷积类计算任务）单元面积的区别，这部分不好定论，但可以推测是类似利用物理冗余设计并加以屏蔽的操作，毕竟如今的设计方法学都是推动模块化的，流片后的测试原本就会有70分 die与90分 die的区别，以及GPU芯片上也不止一个FP64，局部操作物理点断失效也是合理的。

03、设计冗余为点断创造条件，也是大厂基操

举个例子：A、如今市面仍可见的Intel F系列CPU，就是点断显核的70分die；B、Apple Si的前两代，官宣8核NPU，实际有9个，就是设计冗余。

以上这些，在晶圆制造工序中也算是基本操作，特别是中试厂/线，Alpha - Beta流片的过渡期间，有小错就会直接手改，不会返回修改掩膜重新流片的。

从芯片设计者的角度来看，设计冗余度是在芯片开发流程中原本存在的，因为前道光刻过程是强调高良率的，具体到失效晶体管数，测试环节判断模块级别的良率，坏点可以直接电路割断，后续引线、封盖工艺流程都不变。

例如3年前，Intel曾向市场推出过不带显核的F系列CPU，就是物理降规/阉割的产物，点断显核，重新封装销售。但是该款芯片偶尔耗电巨大，经用户投诉，建环境验证后发现就是原本通过物理点断失效的显核在接电之后不受控制而导致的莫名电源故障。

这个案例反映的情况就是我们上文所讲的，同一条流水线，经过点断失效的芯片，后续的导线/引脚和封装过程不变，可以继续销售。尤其早期Intel 10nm的良率很低，积压很多这样的低分片，才会把显核失效的芯片加印F标继续销售。

如今这个“冗余度”可能有很大空间，毕竟H100已然是814平方毫米的大芯片，几乎接近光罩尺寸边缘（26mm*33mm=858mm2）。而如今发布的H20降规型号，大概是H100 15%的性能，但是其物料成本几近相同。

04、封装层面点断可操作性、经济性更好

除了在逻辑芯片表面的激光点断工艺之外，还有针对某些特殊位置的点断要求，比如CoWoS中介层的点断。

CoWoS作为台积电的2.5D封装方案，可以使得多颗芯片封装到一起，互连和内存等器件均通过硅中介层互联，达到了封装体积小，功耗低，引脚少的效果。

相比表面激光点断，在CoWoS的前道部分——即CoW部分是硅通孔和中介层——在该层面操作点断，做差异化，反而更经济，也更容易保证良率。因为算力逻辑芯片和I/O 芯片是分列的，可以屏蔽底层物理互连的通道，也可以缩减HBM3内存性能，而且在硅中介层修改差异化更容易，相比全部在逻辑芯片上修改的代价更低，因为中介层上操作的线宽精度可以较低，甚至点断最上面那层金属的线宽即可。

但是，CoWoS中介层上面是只能够屏蔽物理互连和HBM内存，但是无法屏蔽FP64单元、Tensor core单元这样的计算逻辑芯片面积，这就需要补充用到前文所说的在逻辑die表面点断失效的方法。

另外，正常情况下，物理点断失效的电路是不能从外部第三方察觉的，且工艺不可逆；尤其如今芯片都是十几层金属，裸片的表面修改了，上面金属层是看不穿的，除非是用到反工程的透视扫描。

综上，我们看到进一步特供/降规生产的H20/L20等型号，可以判断是H800和L40S的裸片的后道物理点断工序的改造产物，同时重新封装、重新修改固件，成为新的SKUs。

回想Nvidia之前积压的、原本销往中国的50亿美元的GPU产品尚未交付，如今返厂做了后道改造才得以如此快速的发布新的SKU，那么猜测国内厂商的50亿美元订单也许会转换为这三个型号。

05、“阉割”后的H20的能与不能

核心AI芯片相关参数及出口管制情况，APPLIES对应受管制，DOESN'T APPLY对应不受管制

如下是针对H20与H100/H800/A100的产品横向比较，比较维度包括“产品规格、单卡和集群算力效能、物料成本、定价体系”等四个方面：

集群综合算力方面，H100/H800目前是AIDC算力集群的顶流部署；其中H100理论扩容极限是5万张卡集群，最多可达10万P算力；H800最大集群是2-3万张卡，合计4万P算力；A100最大集群是1.6万张卡，合计9600P算力。

然而对于H20，其集群的理论扩容极限是5万张卡，以单卡算力0.148P（FP16/BF16）计算，集群合计提供7400P算力，远低于H100/H800/A100。

同时，基于算力与通信均衡度预估，5万张H20合理的整体算力中位数约为3000P左右，倘若H20面对千亿级参数模型训练，恐怕捉襟见肘，需要集群网络拓扑有更大的外延扩展。

但从HGX H20的硬件参数综合来看，几乎把美国商务部性能密度禁令中严格限制的算力门槛以外的指标全部拉满，显然是定位为一颗训推通用的处理器。

只是针对LLM大模型业态而言，实际使用H20做千卡分布式训练，虽然大部分有效利用时间都是GPU上的矩阵乘加计算的时间，通信和访存的时间占比缩小，但毕竟单卡算力规格较低，超限度的千卡集群扩展反而会使其费效比降低，H20更适用于垂直类模型的训练/推理，不容易满足千亿参数级LLM的训练需求。

需要注意的是，选用更多低规格、更廉价的GPU并联集群，试图追平或是超过一台超高算力的GH200效能，这是一种悖论。

因为这种方案的掣肘很多，环境搭建和运行的ROI并不高。因为在算力利用率、并行策略的执行、集群综合能耗、硬件成本和组网成本等等方面都不可能获得理想方案；H20集群与A800集群效能可以同比，对比H100/GH200集群效能则是不实际的。

H20的基本规格方面，算力水平约等于50% A100和15% H100，单卡算力是0.148P（FP16）/ 0.296P（Int8），900GB/S NVLink，6颗HBM3e（显存的物料与H100 SXM版本配置相同，即6*16GB=96GB容量），die size同样都是814mm2 。

考虑到H100 GPU单卡物料成本中的HBM颗粒成本独占55%-60%，整卡的物料成本约3320美元（H20成本相近，甚至由于增配的L2 Cache以及追加了点断工序而成本更高，且相比H800更加增配了HBM3容量和NVLink lanes带宽），那么对应最终的渠道定价规则，H20的渠道单价可能与H100/H800处于相近水平。

同比参考几个市面流通价格（来自某一线互联网公司和某一线服务器厂的渠道货价）：

- DGX A800 PCIe 8卡服务器约145万元/台，NVLink版本200万元/台
- DGX H800 NVLink版本服务器，国内渠道报价约310万元/台（不含IB）
- DGX H100 NVLink版本服务器，香港渠道报价约45万美元/台（不含IB）
- H100 PCIe单卡报价约2.5-3万美元，H800 PCIe单卡尚不确定，且单卡流通渠道不正规

编辑/jayden

Source: Tencent Technology
Author: Zhang Shujia Morris

On October 17, the United States updated its export control standards, requiring that the performance of advanced chips exceed a certain threshold, that is, it is necessary to apply for an export license. Under strict restrictions,$NVIDIA (NVDA.US)$The special edition H800 and A800 chips for the Chinese market are also facing a ban on sales. The following are the standards set by the US Department of Commerce for the performance of advanced chips:

●Sum of total computing power ≥4800 TOPS,
●Total computing power ≥1600, and performance density ≥5.92;
●2400 ≤ total computing power <4800, and 1.6
●The total computing power is greater than 1600, and the performance density is less than 3.2 and <5.92.

Faced with the new regulations, Nvidia offered two solutions: first, to communicate with the US Department of Commerce to apply for permission to “open up” specific Chinese customers; second, to once again customize a new special version for the new regulations.

Nvidia Chief Financial Officer Colette Kress confirmed the news during the fiscal third quarter conference call that was just held. Kress said Nvidia is cooperating with some customers in the Middle East and China to obtain permission from the US government to sell high-performance products. Additionally, Nvidia is trying to develop new data center products that are in line with government policies and do not require a license.

01. How is H800 “castrated” into H20?

The new special edition Nvidia is trying to develop, that is, H20, L20 and other products that are popular in the industry. The latest news shows that the launch plan for related products has been delayed until the first quarter of 2024.

The problem is that the development, design, and production of new specialty chips such as the H20 has completely broken out of the rhythm of conventional chips. How did Nvidia come up with this special supply solution in a short period of time?

Its answer is one of the key questions we want to discuss in this article: stop the production process later, and summarize it in a more commonly used term, that is, castration.

Judging from the normal design, production cycle, and pace of product release, H20/L20 chips specially for the Chinese market were released at this point in time. It is unlikely that they are products of remade light masks and re-films. One relatively reasonable inference is that they are new SKUs that were introduced through the transformation and repackaging of the physical disconnection process in the semiconductor back channel.

The dotting process is a modification method in the post-processing (BEOL) of semiconductor manufacturing. Some pipe/wire repair processes can be used without the need to redo the mask, including surface laser dotting, CoWoS level interception, and even manual wire carving through a tunnel mirror.

You can assume a scenario like this. The OEM for Nvidia H800$Taiwan Semiconductor (TSM.US)$In the clean rooms of Nanke Fab 18A, Taichung Fab 15B, and Taichung Advanced Packaging Factory 5, several batches of nude films produced at reduced regulations have not yet had time to be cut and plated with metal wires and electrodes. They have not yet been packaged into H800 and L40S. Instead, they have been packaged into H20 and L20 through later point-off production processes.

02. Surface laser dotting is the traditional art of semiconductor manufacturing

According to industry practice, the cache size (cache size) and underlying physical interconnect (PHY channels) of a digital logic chip can all be screened through repair/disconnection during the post-processing process. In particular, the transformation method for low-score nude films is considered to be a traditional performing art for decades. For example, one of the important differences of early Pentium and Celeron processors is the click-break cache.

If it is a small part, it used to be done by hand (equivalent to microcarving); for parts with a slightly larger area, the Layout can be redesigned to reserve the point break position, and then the machine can complete the point failure.

In practice, most fabs are equipped with professional equipment to cut the lines/grooves directly on the die with a laser, while at the Intel Fab42 factory in Chandler, Arizona, there is also equipment for hand-carving transistors directly under a special tunneling mirror, claiming to be atomic-scale, unlike ordinary scanning tunneling microscopes. A few years ago, Intel had a promotional video referring to this device. According to rumors, there are no more than 14 licensed operators worldwide.

In fact, before planar transistors, microscopic hand carving was not a difficult operation, but after entering FinFET, due to the vertical 3D gate structure, the cost of hand carving equipment and operators became far out of reach.

Specifically, when it comes to H20/L20, how did these two special products get downgraded through the H800 and L40S regulations? You can take a look at the relevant parameters first:

H20: Compatible with H100/800 series, Hopper architecture (HBM3, 2.5D CoOS package, NVLink)
L20: Compatible with L40S series, Ada Lovelace architecture (GDDR6, 2D info package, PCIe Gen4)
*Note: Firmware is modified accordingly;

Looking back at the critical underlying physical interconnect (SerDes PHY) differences between the same H100/H800 architectures, the H100 downgraded to H800 can be achieved through local physical point failure processing; however, in contrast, although the H20 is isomorphic to the previous two products, it is speculated that the removed Dark Si area may be large, so it is uncertain whether the regular point interruption operation is not worth it, and it may be necessary to rearrange the layout.

However, in addition to the differences in the underlying physical layer interconnect (SerDes PHY), there is also a difference between the unit area of double precision floating point computation (FP64) and the unit area of tensor kernels (used for matrices and convolutional computation tasks). This part is inconclusive, but it can be speculated that it is similar to using physical redundancy design and shielding. After all, today's design methodologies promote modularity, and the difference between 70-point die and 90-minute die after the film, and there is more than one FP64 on the GPU chip, and it is reasonable for local physical point failure to fail.

03. Design redundancy creates conditions for point failure, which is also the foundation of large factories

For example: A. The Intel F series CPUs that are still visible on the market today are the 70 minute die that disconnects the display core; B, the first two generations of the Apple Si officially announced 8-core NPUs. There are actually 9, which is redundant in design.

The above are also considered basic operations in the wafer manufacturing process. In particular, during the transition period between pilot plants/lines and Alpha-Beta flow sheets, if there are minor mistakes, they will be changed directly by hand, and the mask will not be changed again.

From the perspective of chip designers, design redundancy originally existed in the chip development process, because the previous lithography process emphasized high yield, specifically the number of failed transistors, and the testing process determined the yield at the module level. The bad points can be directly cut off the circuit, and the subsequent lead and capping process will not change.

For example, 3 years ago, Intel introduced an F series CPU without a display core to the market, which is a product of physical downgrading/castration, cutting off the display core, and re-packaging sales. However, this chip occasionally consumes a lot of electricity. After user complaints, verification of the construction environment revealed that it was an inexplicable power failure caused by the display core, which had originally failed through physical disconnection, and was not controlled after being connected to electricity.

This case reflects the situation we mentioned above. The same assembly line can continue to be sold after the chip has been cut and broken, and the subsequent wire/pin and packaging process have not changed. In particular, in the early days, the yield of Intel 10nm was very low, and it was only when there was a backlog of many such low fragments that the display core failed was added to the F mark and continued to be sold.

There is probably a lot of room for this “redundancy” today. After all, the H100 is already a large 814 square mm chip, almost close to the edge of the mask size (26mm*33mm=858mm2). However, the H20 downgraded model released today has about 15% performance of the H100, but its material cost is almost the same.

04. Perception at the package level has better operability and economy

In addition to the laser point breaking process on the surface of the logic chip, there are also point breaking requirements for certain special locations, such as the point breaking of the CoWOS interlayer.

As TSMC's 2.5D packaging solution, CoWOS allows multiple chips to be packaged together. Devices such as interconnect and memory are all interconnected through a silicon interlayer, achieving the effects of small package size, low power consumption, and few pins.

Compared to laser cutting on the surface, the front part of CoOS — that is, the CoW part is a silicon through-hole and intermediary layer — is operated at this level to differentiate, making it more economical and easier to guarantee yield. Because the computing power logic chip and the I/O chip are separated, it is possible to block the underlying physical interconnect channels, and also reduce HBM3 memory performance, and it is easier to modify and differentiate in the silicon intermediate layer. Compared to modifying everything on the logic chip, the cost is lower, because the line width accuracy of the operation on the intermediate layer can be lower, and even the line width of the top layer of metal can be cut off.

However, the CoWOS intermediary layer can only block the physical interconnection and HBM memory, but it cannot block the area of computational logic chips such as FP64 units and Tensor Core units. This requires the addition of the method described above to stop and fail on the logic die surface.

Also, under normal circumstances, a circuit that has failed at physical points cannot be detected by an external third party, and the process is irreversible; in particular, now chips are all about ten layers of metal, and the surface of the die has been modified, and the metal layer on top is invisible, unless an anti-engineering perspective scan is used.

In summary, we have seen models such as the H20/L20 that have been further specially supply/downgraded. It can be determined that they are modified products of the post-physical disconnection process of the H800 and L40S bare sheets, while at the same time repackaging and remodifying the firmware to become new SKUs.

Recall that Nvidia's previous backlog of 5 billion dollar GPU products originally sold to China have not yet been delivered, but now it has been returned to the factory for post-processing modifications to be able to release new SKUs so quickly, then it is speculated that the 5 billion dollar orders from domestic manufacturers may be converted to these three models.

05. The ability and failure of H20 after “castration”

Core AI chip related parameters and export control conditions. APPLIES is regulated, and DOESN'T APPLY is not regulated

The following is a horizontal comparison of H20 and H100/H800/A100 products. The comparison dimensions include “product specifications, single card and cluster computing power efficiency, material costs, and pricing system”:

In terms of comprehensive computing power, the H100/H800 is currently the top deployment of AIDC computing power clusters; among them, the H100 theoretical expansion limit is a 50,000 card cluster, which can reach up to 100,000 P computing power; the H800 largest cluster is 20,000 to 30,000 cards, with a total of 40,000 P computing power; and the A100 largest cluster is 16,000 cards, with a total of 9,600P computing power.

However, for H20, the theoretical expansion limit of the cluster is 50,000 cards. Calculated with a single card computing power of 0.148P (FP16/BF16), the cluster provides a total of 7,400P computing power, far lower than H100/H800/A100.

At the same time, based on estimates of computing power and communication balance, the reasonable median overall computing power of 50,000 H20 sheets is about 3000P. If the H20 faces 100 billion parameter model training, I'm afraid it will be overstretched, and the cluster network topology needs greater epitaxial expansion.

However, judging from the comprehensive hardware parameters of the HGX H20, it almost fills up all indicators other than the computing power threshold strictly limited in the US Department of Commerce's performance density ban. Obviously, it is positioned as a processor for general use.

It's just that for the LLM large model business format, H20 is actually used for kilocalorie distributed training. Although most of the effective utilization time is the matrix multiplication calculation time on the GPU, and the proportion of time spent on communication and access is reduced, after all, the single-card computing power specification is low, and the expansion of the kilocalorie cluster beyond the limit will reduce its cost-effectiveness ratio. H20 is more suitable for training/reasoning for vertical models, making it difficult to meet the training requirements of 100 billion parameter-level LLM.

It should be noted that using more low-spec, cheaper GPU parallel clusters in an attempt to equalize or surpass the performance of a GH200 with ultra-high computing power is a paradox.

Because of the many constraints of this solution, the ROI of setting up and running the environment is not high. Because it is impossible to obtain an ideal solution in terms of computing power utilization, implementation of parallel strategies, comprehensive cluster energy consumption, hardware costs, and networking costs; the performance of H20 clusters and A800 clusters can be compared, but it is impractical to compare the performance of H100/GH200 clusters.

In terms of the basic specifications of the H20, the computing power level is about 50% A100 and 15% H100, and the single-card computing power is 0.148P (FP16)/0.296P (Int8), 900GB/S NVLink, 6 HBM3e (the display memory materials are the same configuration as the H100 SXM version, that is, 6*16GB = 96GB capacity), and the die size is also 814mm2.

Considering that the HBM particle cost is 55%-60% of the H100 GPU single card material cost, and the material cost of the whole card is about 3,320 US dollars (the cost of H20 is similar, and the cost is even higher due to the addition of the additional L2 Cache and the additional point stop process, and the HBM3 capacity and NVLink lanes bandwidth have been increased compared to the H800), then corresponding to the final channel pricing rules, the H20 channel unit price may be at a similar level to that of H100/H800.

Refer to several market prices compared to the previous year (channel prices from a certain Internet company and a certain line server manufacturer):

- DGX A800 PCIe 8-card server about 1.45 million yuan/unit, NVLink version 2 million yuan/unit
- DGX H800 NVLink version server, domestic channel price is about 3.1 million yuan/unit (excluding IB)
- DGX H100 NVLink version server, Hong Kong channel price is about 450,000 US dollars/unit (excluding IB)
- The price of the H100 PCIe single card is about US$2.5-30,000. The H800 PCIe single card is uncertain, and the distribution channels for the single card are not regulated

Editor/jayden

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

英伟达为中国“降规”：H800变身为H20，技术如何实现、性能够用吗？