share_log

英伟达,被弯道超车?

Nvidia, overtaken by a corner?

半導體行業觀察 ·  Jan 6 16:40

Source:Semiconductor industry observation

Author: Shao Yiqi

According to incomplete statistics, the semiconductor industry has now developed about 1000 package types, divided according to interconnect types, including lead bonding, flip chips, wafer-level packages (WLP), and silicon through holes (TSV). Countless dies are connected through interconnected devices, making up the increasingly prosperous packaging market today.

Among them, advanced packaging has become the field that has received the most attention and popularity in the past two years. As advanced processes progress slowly, its importance becomes more prominent. $Advanced Micro Devices (AMD.US)$,$Intel (INTC.US)$und$NVIDIA (NVDA.US)$These traditional “imperial tribes” have stepped in, switched from 2D packaging to 2.5D packaging, and also challenged the peak of 3D packaging.

In June 2023, AMD officially launched the MI300X and MI300A AI accelerators in San Francisco. Among them, the MI300X uses 8 XCD, 4 IO dies, 8 HBM3 stacks, up to 256 MB AMD Infinity Cache and 3.5D package design, and supports new mathematical formats such as FP8 and sparsity. It is a design for all AI and HPC workloads, and its transistors have reached 153 billion, making it AMD's manufacturing so far The largest chip.

AMD said that the performance of the MI300X in artificial intelligence inference workloads is 1.6 times higher than the Nvidia H100, and the performance in training work is comparable to the H100, thus providing the industry with a much-needed high-performance replacement to replace Nvidia's GPU. Additionally, the accelerators also have more than twice the HBM3 memory capacity of Nvidia GPUs, reaching an astonishing 192 GB, enabling their MI300X platform to support more than twice as much LLM per system and run larger models than the H100 HGX.

Of course, the most notable one is the 3.5D package claimed by AMD. AMD says it has achieved a new “3.5D package” technology by introducing 3D hybrid bonding and a 2.5D silicon intermediate layer.

Sam Naffziger, senior vice president and corporate researcher at AMD, said, “This is a truly amazing silicon stack, providing the highest density performance the industry has ever known. This integration uses$Taiwan Semiconductor (TSM.US)$The two technologies are SoIC (integrated chip system) and CoWoS (chip substrate chip). The former (SoIC) uses hybrid bonding technology to stack smaller chips on top of larger chips, and can be directly connected to the copper pads on each chip without solder. It helps stack the high-speed buffer storage V-Cache chip on the highest-end CPU chip, while the latter (CoWOS) stacks the chip on a larger silicon chip. This silicon chip is called an interposer (interposer) to accommodate high-density interconnects.”

While Nvidia also used TSMC's CowOS 2.5D package in the H200, AMD took the lead in combining the TSMC SoIC 3D package with the CoOS 2.5D package, and its earlier layout of Chiplet seemed to have made sufficient preparations for this corner overtaking.

Build a chip like a building block

First, let's review the specific architecture of the MI300X and MI300A. According to AMD's official explanation, the MI300 series uses TSMC's 3D hybrid SoIC (silicon on integrated circuit) technology to 3D stack various computing elements on top of the four underlying I/O chips, whether it's a CPU CCD (core computing chip) or GPU XCD. Each I/O chip can hold two XCDs or three CCDs. Each CCD is the same CCD used in existing EPYC chips, and each CCD has eight hyperthreaded Zen 4 cores. The MI300A used three CCDs and six XCDs, while the MI300X used eight XCDs.

The so-called XCD is the chiplet that AMD is responsible for computing in GPUs. On the MI 300X, 8 XCDs contain 304 CDNA3 computing units, which means that each computing unit contains 34 CUs. For comparison, the AMD MI 250X has 220 CUs, which is a big leap forward.

The HBM stack, on the other hand, is connected using a standard intermediary layer of 2.5D packaging technology. Each I/O chip contains a 32-channel HBM3 memory controller to host two of the eight HBM stacks, providing a total of 128 16-bit memory channels for the device. The MI300X uses a 12Hi HBM3 stack with a capacity of 192GB, while the MI300A uses an 8Hi stack with a capacity of 128GB.

Specifically, AMD's CPU CCD communicates through 3D hybrid bonding to the underlying I/O chip and uses a GMI3 interface in a standard 2.5D package. AMD has added a new pad through hole interface that can bypass the GMI3 link, thereby providing the TSV required to stack the chips vertically.

The 5nm XCD GPU chip marks the full chipization of AMD GPU design. XCD and IOD have hardware-assisted mechanisms to break down jobs into smaller parts, dispatch them, and keep them in sync, thereby reducing host system overhead, and these units also have hardware-assisted cache consistency.

AMD has been preparing for this small step in the MI300 series package. The earliest origins date back to 1965, when AMD engineers developed a design that split each large chip into small pieces based on the “chipset” concept.

In the CPU competition with Intel, the failure of the bulldozer architecture has put AMD in a precarious situation. It urgently needs a low-cost solution to compete with Intel's more advanced architecture. Zen came into being. The new generation of Ryzen processors uses a chipset or MCM (multi-chip module) architecture, marking a complete transformation of the entire PC and chip manufacturing industry.

The original Zen architecture was relatively simple. It used an SoC design. Everything from the core to the I/O and controller was located on the same chip. At the same time, the CCX concept was introduced. The CPU core was divided into quad-core units and combined using an infinite high-speed cache to form a single chip. However, the consumer grade was still a single-chip design.

The Zen+ situation remains largely the same (using more advanced nodes), but Zen 2 is a major upgrade. It is the first Chiplet-based consumer CPU design with two compute chips or CCD plus an I/O chip. AMD has added a second CCD to Ryzen 9, with a number of cores never seen before in the consumer sector.

Zen 3 further refines the Chiplet design, removing CCX and merging eight cores and 32MB cache into a unified CCD, which greatly reduced cache latency and simplified the memory subsystem. For the first time, AMD Ryzen processors provided better gaming performance than rival Intel. Other than reducing the CCD design, Zen 4 made no significant changes to the CCD design.

In the EPYC series, the first generation AMD EPYC processor is based on four copied small chips. Each processor has 8 “Zen” CPU cores, 2 DDR4 memory channels, and 32 PCIe channels to meet performance goals, and AMD must provide some extra space for the Infinity Fabric interconnect between the four small chips.

The first chiplet of the second-generation EPYC is called an I/O die (IOD), which uses a 12nm process and contains 8 DDR4 memory channels, 128 PCIe gen4 I/O channels, and other I/O (such as USB and SATA, SoC data structures, and other system-level features). The second chiplet is a composite core DIe (CCD), using a 7nm process. In an actual product, AMD assembles an Iod with up to 8 CCDs, each providing 8 Zen 2 CPU cores, so 64 cores can be provided at once.

On the third-generation EPYC, AMD provides up to 64 cores and 128 threads, using AMD's latest Zen 3 core. The processor is designed with eight Chiplets, each with eight cores, and this time all eight cores in the Chiplet are connected, enabling an effective dual L3 cache design to achieve a lower overall cache latency structure.

In the fourth-generation EPYC, AMD uses up to 12 small chip designs with 5 nm complex core chips (CCD) on the original architecture. Among them, the I/O chip uses 6nm process technology, while the surrounding CCD uses a 5nm process. Each chip has 32MB of L3 cache and 1MB of L2 cache.

These CPUs finally paved the way in terms of technology for the Mi300 series Chiplets.

In January 2021, AMD applied for and passed a patent for an MCM GPU Chiplet design. AMD disclosed a patent entitled “GPU Chiplets Using High Bandwidth Crosslinking” at the US Patent and Trademark Office. The patent number is “US 2020/0409859 A1.” In the patent description, AMD outlines the future of graphics chips in Chiplet design. GPU Chiplets should directly communicate with the CPU, while other small Chiplets cross over through passive and high bandwidth The links communicate with each other and are arranged as system-on-chip (SoC) on the corresponding intermediary layer.

In November 2023, AMD also disclosed a patent on the Chiplet design, which describes a GPU design that is very different from the existing chip layout, that is, a large number of memory cache chips (MCD) are distributed around a large main GPU chip. It describes a system that allocates the geometric workload to multiple chips, all of which work in parallel. Furthermore, no “central chip” will allocate work to subordinate chips, as they will all operate independently. The patent indicates that AMD is exploring chipsets to make GCDs, not just a giant piece of silicon.

From the consumer sector to the supercomputing field to the AI field, AMD used Chiplet to set off a red storm, and what continues to help this storm is advanced packaging technology from TSMC.

The people behind AMD

In an interview with IEEE Spectrum, AMD product technology architect Sam Naffziger said, “Five or six years ago, we started developing the EPYC and Ryzen CPU series. At the time, we conducted extensive research to find the most suitable packaging technology to connect the chip. It's a complex equation involving cost, performance, bandwidth density, power consumption, and manufacturing capabilities. It's relatively easy to come up with good packaging technology, but it's two different things to actually achieve mass production at low cost.”

In 2011, TSMC developed the 2.5D package CoWoS for the first time, and it was immediately adopted by Xilinx's high-end FPGA, but due to its high price, it was slow to break the situation in the packaging market. Until the AI wave swept the world, Nvidia, AMD, Google, and Intel threw the olive branch, propelling CoOS to the position of the most popular advanced package.

The following is a schematic diagram of TSMC's CoWoS (chip on wafer substrate) package. CoWoS allows the integration of multiple chips or dies on a single package. This allows different types of chips, such as processors, memory, and graphics chips, to be integrated into a single package, thereby increasing performance, reducing power consumption, and reducing the form factor. Multiple chips are vertically stacked through silicon holes (TSV) and interconnected using microbumps. Compared to traditional 2D packaging, this stacking method can shorten interconnect length, reduce power consumption, and improve signal integrity.

CowOS has a lot of output on AMD's Chiplet, and by dividing large monolithic chips into smaller chipsets, designers can focus on optimizing the specific features of each chipset. It enables better power management, higher clock speeds, and higher performance per watt, while also helping to integrate these high-performance chips with other components such as memory into a single package, thereby further improving system performance.

CoWOS provided valuable experience for subsequent 3D packaging. In 2018, TSMC introduced SoIC technology. As an innovative multi-chip stacking technology, it mainly performs wafer-level bonding for process technology below 10nm. Compared with CoWoS technology, SoIC can provide higher package density and smaller bonding intervals, and can also be shared with CoWoS to achieve multiple chiplet integration.

At the IEDM conference, TSMC's vice president presented more details of the company's SoIC roadmap. According to the roadmap, TSMC will first use the 9μm bonding spacing currently available. It then plans to launch 6 μm spacing, followed by 4.5 μm and 3 μm. In other words, TSMC hopes to launch a new key pitch every two years or so, increasing the scaling ratio of each product generation by 70%.

He also used AMD's processor as an example of SoIC applications. AMD designed a processor and SRAM based on the 7nm process, then handed over production to TSMC, and finally connected to the chip using SoIC technology with a 9μm bonding spacing.

What is mentioned here is the 3D V-Cache cache added to the EPYC processor codenamed Milan-X introduced by AMD in 2021. This is also the world's first data center processor using 3D chips stacked.

AMD said that 3D V-Cache added 64 MB to the current third-generation EPYC CPU's 32 MB of SRAM per compute chip, bringing Milan-X's third-level cache to 96 MB. Since there are up to 8 compute chips in the Milan-X processor architecture, the shared L3 cache in the CPU can reach up to 768 MB. The additional L3 cache can relieve memory bandwidth pressure and reduce latency, thereby significantly improving applications Program performance.

TSMC's SoIC technology contributed greatly to this step. It permanently binds the interconnect in the V-Cache to the CPU, reducing the distance between the chips, thus achieving a communication bandwidth of 2 Tb/s. Compared with the 2D small chip package used in third-generation EPYC CPUs, the interconnect in the Milan-X CPU consumes only one-third of the energy per bit, increasing the interconnection density by 200 times, and increasing the efficiency by three times.

This technology was later devolved to the Ryzen 7 5800X3D processor and began to make a big difference in the consumer market, including the latest Ryzen 9 7950X3D, which also used 3D V-Cache technology.

In 2023, TSMC highlighted the new 3DFabric technology at the North American Technology Forum, which mainly consists of three parts: advanced packaging, 3D chip stacking, and design. Through advanced packaging, more processors and memories can be placed in a single package to improve computational efficiency; in terms of design support, TSMC has launched the latest version of the open standard design language to help chip designers handle complex and large chips.

From 2011 to 2023, TSMC's packaging technology evolved over more than ten years, and AMD's Chiplet dream was finally realized, and the Mi300 series was based on the latest 3DFabric, integrating TSMC's SOIC front-end technology with CowOS back-end technology, and is the culmination of advanced packaging technology in mass production.

Blue Giant package layout

For Intel, packaging is also one of the focuses of its development, and unlike AMD, Intel chose to do its own packaging in an attempt to master the entire process of chip development, production and application.

Intel's 2.5D packaging technology for TSMC CoOS is called EMIB. It was officially applied to the product in 2017. Intel's data center processor Sapphire Rapid used this technology; the first-generation 3D IC package was called FoverOS, which was already used in Intel's computer processor Lakefield in 2019.

EMIB's biggest feature is that it connects various chips (dies) such as memory (HBM) and computation from below through a silicon bridge (Sillicon Bridge). It is also because the silicon bridge is embedded in the substrate (substrate) and connected to the chip, so that the memory and the computing chip can be directly connected, speeding up the energy efficiency of the chip itself.

Foveros, on the other hand, is a 3D stack. After stacking chipsets with different functions such as memory, computing, and architecture, copper wire is used to penetrate each layer to achieve the effect of connection. Finally, the factory will send the chips that have already been stacked to the packaging site for assembly, and the copper wire is connected to the circuit on the circuit board.

In 2022, Intel integrated 2.5D and 3D packaging technology for the first time, named Co-EMIb. This is an innovative application combining eMIB and Foveros technology, which can interconnect two or more Foveros components and basically reach the performance level of a single chip. Through this technology, it launched Ponte Vecchio, the SoC with the largest transistor scale at the time, mainly aimed at the high-performance computing market.

Each Ponte Vecchio processor is actually a mirror set of two Chiplets connected together using Intel Co-EMIb. Co-EMIb forms a high-density interconnect bridge between the two 3D Chiplet stacks, and the bridge itself is a small piece of silicon embedded in a packaged organic substrate. Interconnections on silicon can be narrower than interconnections on organic substrates. The normal connection gap between Ponte Vecchio and the package substrate is 100 microns, and the connection density in Co-EMiB chips is almost double that. Co-eMiB chips also connect high-bandwidth memory (HBM) and Xe Link I/O chiplets to the “base silicon” (the largest chiplet), and the other chips are stacked on top of this “base silicon.”

The base chip also uses Intel's 3D stacking technology called Foveros, which establishes a dense array of chip-to-chip vertical connections between the two chips. These connections are only 36 microns apart and are achieved by connecting the chip “face to face”; that is, the top of one chip is bonded to the top of another chip. The signal and power supply enter the stack through TSV silicon holes, which are fairly wide vertical interconnects, directly passing through most of the silicon. The Foveros technology used on the Ponte Vecchio is an improvement on the technology used to make Intel's Lakefield mobile processors, doubling the signal connection density.

Achieving this isn't easy, and Intel Academician Wilfred Gomes said it requires innovation in production management, clock circuitry, thermal regulation, and power transmission. For example, Intel engineers chose to supply the processor with a higher voltage than normal (1.8 volts) to reduce current and simplify packaging; the circuit in the substrate reduces the voltage to close to 0.7 volts for use on computing chips, and each computing chip must have its own power domain within the substrate.

For Intel, Ponte Vecchio has brought its current advanced packaging technology to the peak. Compared with AMD's MI300 series, it is not much inferior. It can be described as the red and blue double star of today's advanced packaging.

In fact, although Intel lags slightly behind TSMC in advanced manufacturing processes, it is on par with TSMC in advanced packaging. Intel said that its flexible foundry service allows customers to mix and match their wafer manufacturing and packaging products. As an established manufacturer, wafer packaging plants are scattered all over the world and can use geographical advantages to expand production capacity and services.

Intel CEO Pat Gelsing also said in an interview that Intel has the advanced capabilities of next-generation memory architectures and the advantages of 3D stacking, which can be used not only for Chiplets, but also for oversized packages for artificial intelligence and high-performance servers. In the future, we will apply these technologies to products, and also show them to foundry (IFS) customers,

Why Chiplet?

After reading the technical history of AMD, Intel, and TSMC, I'm sure many people will have a question. Why are they so obsessed with 3D packaging and Chiplet?

The reason stems from internal demand in the semiconductor industry. The emergence of Moore's Law allows increasing device integration to continue to adapt to the same physical size. Reduced lithography can reduce the building block by 30%, so circuits can be increased by 42% without increasing the chip size.

However, not all semiconductor devices can enjoy this dividend; for example, I/O, which can include analog circuits, can expand at about half the speed of logic, making people have to look for new ways. Moreover, the cost of reducing lithography is not cheap. The cost of wafers processed using the 7nm process is higher than the cost of wafers processed using the 14nm process, the cost of the 5nm process is higher than the 7nm process, and so on... As wafer prices rise, Chiplets are often more economical than single wafers.

Furthermore, as new chip designs require design and engineering resources, and as the complexity of new nodes continues to increase, the typical cost of a new design for each new process node also increases, further encouraging people to create reusable designs.

The Chiplet design philosophy makes this possible because new product configurations can be achieved by simply changing the number and combination of chips. By integrating a single small chip into 1, 2, 3, and 4 chip configurations, 4 different processor varieties can be created from a single flow sheet, and if they want to be integrated into a single chip, 4 separate flow sheets are required.

In its technical presentation on the new Radeon RX 7900 series “Navi 31” graphics processors, AMD explained in detail why it is necessary to take the chipset route for high-end graphics processors.

In fact, compared to CPUs, AMD's Radeon GPUs in the past ten years are not optimistic in terms of profit or revenue. Facing Nvidia's competition, the need to reduce manufacturing costs has become more prominent. With the introduction of the GeForce “Ada Lovelace” generation, Nvidia continues to bet on single-chip silicon GPUs, even the largest “AD102” chips are single-chip GPUs, which provides AMD with an opportunity to reduce GPU manufacturing costs.

Chiplet enabled AMD to enter into a price war with Nvidia to gain more market share. The most typical example is AMD's relatively aggressive pricing of $999 and $899 for the Radeon RX 7900 XTX and RX 7900 XT, respectively. According to AMD's official website data, these two products have the ability to compete with Nvidia's $1199 RTX 4080, and in some cases, may even compete with the $1,599 RTX 4090.

In fact, this is one of Chiplet's most notable advantages. By using Chiplet, AMD can quickly increase yield and simplify design/verification, while also being able to select the best process for each small chip. The logic part can be manufactured using a cutting-edge process. High-capacity SRAM can be manufactured using a process of about 7 nm, while the I/O and peripheral circuits can be manufactured using a process of about 12 nm or 28 nm, thereby reducing design and manufacturing costs.

Furthermore, Chiplet can also help it easily manufacture derivative types, such as the same logic but different peripheral circuits, or the same peripheral circuit with different logic, and can mix and match small chips from different manufacturers rather than being limited to a single manufacturer.

AMD is like this, and Intel is nothing more. AMD relies on TSMC's existing technology and makes every effort to study chip architecture design. On the one hand, Intel will have to work a little harder. On the one hand, it is also necessary to study advanced manufacturing processes and packaging, and on the other hand, it also needs to start iterative improvements of the chip and Chiplet. The two companies are even playing a ring on packaging.

It is no longer important to judge the victory or loss of the competition, because 3D packaging and chiplets are gradually moving from data centers and AI accelerators to PC processors in the consumer market, eventually benefiting notebooks and mobile phones, and have become a new trend recognized by everyone.

Write it at the end

Compared to AMD and Intel, Nvidia is so “slow” in 3D packages and Chiplets.

In June 2017, Nvidia published the paper “MCM-GPU: Multi-Chip-Module GPUs for Optimal Performance Evaluation”, which proposed the MCM design, which can basically be viewed as today's Chiplet.

However, Nvidia never put this design into practice. Instead, it published a paper called “GPU Domain Specialization via Composable On-Package Architecture” in December 2021. The COPA-GPU architecture proposed in it actually only separates the L2 cache separately. This means that Nvidia will continue to stick to the Monolithic single lithography design in the future.

The reason why Nvidia insists on big chips is actually very simple. Compared with the communication bandwidth within Monolithic, Chiplet may not be suitable for high AI computing power situations, and is more suitable for making a big difference in the CPU field. The Grace CPU Superchip released by Nvidia in 2022 uses NVLink-C2C technology to achieve high-speed chip interconnection. The chip also follows UCIE, a specification for interconnection between die and die.

Prudence on Chiplet also caused Nvidia to lose touch with 3D packaging. Although Nvidia is currently one of TSMC's 2.5D CoWOS biggest customers, SoIC's customers are not including it for the time being, making it the last company in the three Imperial families to embrace this advanced technology.

Along with the rapid development of Chiplet, Nvidia may also begin to embrace this design concept in the future. This year's whistleblower Kopite7kimi said that Nvidia's next-generation Blackwell GB100 GPU for high-performance computing (HPC) and artificial intelligence (AI) customers will fully adopt the Chiplet design.

Now AMD is one step ahead in AI chips, using Chiplet and 3.5D packages to create a bigger and stronger Mi300x. Intel has also fully embraced Chiplet and 3D packages. Although Nvidia still has a huge AI market, there is an undetectable crack in its throne. Red, Blue, and Green. Who can have a real say in chip packaging?

edit/emily

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment