share_log

和H100相比,英伟达的H20、L20 和 L2 芯片性能如何?

Compared to the H100, how are Nvidia's H20, L20, and L2 chips performing?

wallstreetcn ·  Nov 10, 2023 15:36

Source: Wall Street News
Author: Bu Shuqing

Theoretically, the H100 is more than 6 times faster than the H20, but in terms of LLM reasoning, the H20 is more than 20% faster than the H100.

According to the latest media reports,$NVIDIA (NVDA.US)$At least three new AI chips will soon be launched, including the H20 SXM, PCIe L20, and PCIe L2, to replace the H100, which is restricted from being exported by the US. All three chips are based on the Hopper GPU architecture, and the maximum theoretical performance can reach 296 TFLOPs (number of floating-point operations per second, also known as peak speed per second).

What is almost certain is that all three AI chips are “castrated versions” or “reduced versions” of the H100.

Theoretically, the H100 is 6.68 times faster than the H20. According to a recent blog post published by analyst Dylan Petal, even if the H20's actual utilization rate can reach 90%, its performance in the actual multi-card interconnection environment is still only close to 50% of the H100.

There are also media reports thatThe comprehensive computing power of the H20 is only equivalent to 20% of the H100, and due to the addition of HBM video memory and NVLink interconnection modules, the computing power cost has increased significantly.

However,The advantages of the H20 are also obvious, being more than 20% faster than the H100 in large language model (LLM) reasoning. The reason is that the H20 is similar in some ways to the next generation super AI chip H200 to be released next year.

Nvidia has already made samples of these three chips. The H20 and L20 are expected to be launched in December this year, while the L2 will be launched in January next year. Product sampling will begin one month before launch.

H20 Vs. H100

Let's look at the H100 first. It has 80GB HBM3 memory, a memory bandwidth of 3.4 TB/s, theoretical performance of 1979 TFLOP, and a performance density (TFlops/die size) of 19.4. It is the most powerful GPU in Nvidia's current product line.

The H20 has 96GB HBM3 memory, and the memory bandwidth is up to 4.0 Tb/s, all higher than the H100, but the computing power is only 296 TFLOPs, and the performance density is 2.9, which is far inferior to the H100.

Theoretically, the H100 is 6.68 times faster than the H20。 However, it is worth noting that this comparison is based on the floating point computing power of FP16 Tensor Core (FP16 Tensor Core Flops) and uses sparse computation (the amount of computation is greatly reduced, so the speed will increase significantly), so it does not fully reflect all of its computational capabilities.

In addition, the thermal design power consumption of this GPU is 400W, which is lower than the H100's 700W, and an 8-channel GPU can be configured in the HGX solution (Nvidia's GPU server solution). It also retains the 900 Gb/s NVLink high-speed interconnection function, and also provides a 7-channel MIG (Multi-Instance GPU, multi-instance GPU) function.

H100 SXM TF16 (Sparsity) FLOPS = 1979
H20 SXM TF16 (Sparsity) FLOPS = 296

According to PETA's LLM performance comparison model, H20 has a peak tokens/second at a low batch size, which is 20% higher than H100, and the token-to-token delay at a low batch size is 25% lower than H100. This is due to the reduction in the number of chips required for inference from 2 to 1. If an additional 8-bit quantization is used, the LLAMA 70B model can operate effectively on a single H20 instead of requiring 2 H100s.

It is worth mentioning that although the computing power of the H20 is only 296 TFLOPs, far less than the H100's 1979, the H20's actual utilization rate is MFU (currently the H100's MFU is only 38.1%), which means that the H20 can actually run 270 TFLOPS, then the performance of the H20 in the actual multi-card interconnection environment is close to 50% of the H100.

From the perspective of traditional computation, the H20 has been downgraded compared to the H100, but in terms of LLM reasoning, the H20 will actually be more than 20% faster than the H100. The reason is that the H20 is similar to the H200 to be released next year in some ways. Note that the H200 is the successor to the H100, a superchip for complex AI and HPC workloads.

L20 and L2 configurations are more streamlined

Meanwhile, the L20 is equipped with 48 GB of memory and 239 TFLOPs of computing performance, while the L2 is configured with 24 GB of memory and 193 TFLOPs of computing performance.

L20 is based on L40 and L2 is based on L4, but these two chips are not commonly used in LLM reasoning and training.

Both L20 and L2 use PCIe form factors and use PCIe specifications suitable for workstations and servers. Compared with models with higher specifications such as the Hopper H800 and A800, the configuration is also more streamlined.

But Nvidia's software stack for AI and high performance computing is so valuable to some customers that they are reluctant to abandon the Hopper architecture, even if the specifications are downgraded.

L40 TF16 (Sparsity) floPS = 362
L20 TF16 (Sparsity) floPS = 239
L4 TF16 (Sparsity) floPS = 242
L2 TF16 (Sparsity) floPS = 193

editor/tolk

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment