Source: Xin Zhiyuan
截止目前,$Microsoft (MSFT.US)$、 $Meta Platforms (META.US)$ 、 $Alphabet-C (GOOG.US)$ 、$Amazon (AMZN.US)$Among the world's top five technology companies, the total computing power of xAI and others is approximately 3.55 million equivalent H100 blocks.$NVIDIA (NVDA.US)$In 2025, it is expected to sell 7 million GPU units, almost all of which are from the latest Hopper and Blackwell series.
In the chip battle of AI giants, Google and Microsoft currently occupy the top two positions. As a newcomer, xAI is rapidly rising. In this competition, who will emerge as the ultimate winner?
This year, Musk created a global sensation with the world's largest AI supercomputer, Colossus.
This supercomputer is equipped with 0.1 million NVIDIA H100/H200 graphics cards and is expected to expand to 0.2 million in the near future.
Since then, AI giants have been feeling the pressure, fueling a fierce battle of datacenters. The giants are each preparing their own construction plans.
Recently, a blog was published on the LessWrong website, estimating the production volume of NVIDIA chips and the number of GPUs/TPUs of various AI giants based on public data, and looking forward to the future of chips.
As of now, the computing power owned by the world's top five technology companies in 2024, as well as the forecast for 2025:
Microsoft has 0.75 million-0.9 million equivalent H100 chips, expected to reach 2.5 million-3.1 million next year.
Google has 1 million-1.5 million equivalent H100 chips, expected to reach 3.5 million-4.2 million next year.
Meta has 0.55 million-0.65 million equivalent H100 chips, expected to reach 1.9 million-2.5 million next year.
Amazon has 0.25 million-0.4 million equivalent H100 chips, expected to reach 1.3 million-1.6 million next year.
xAI has 0.1 million equivalent H100 chips, expected to reach 0.55 million-1 million next year.
Summary of chip quantity estimates
It is evident that they are all actively laying out their computing power maps to train the next generation of more advanced models.
Google Gemini 2.0 is expected to be officially launched this month. Before this, Musk also revealed that Grok 3 will debut by the end of the year, but the specific timing is still unknown.
He stated that after training on legal issue data sets, the next generation Grok 3 will be a powerful private lawyer, able to provide services around the clock.
In order to catch up with their strong rivals, OpenAI's o2 model is reportedly also in training.
All of this training cannot be carried out without GPU/TPU.
NVIDIA firmly holds the position as the dominator in the GPU market, with 25 years of sales peaking at 7 million units.
Undoubtedly, NVIDIA has long been rising as the largest producer of data center GPUs.
On November 21, NVIDIA's Q3 FY2025 financial report forecasts that data center revenue for the 2024 calendar year will reach $110 billion, more than doubling from $42 billion in 2023, and is expected to exceed $173 billion in 2025.
The main source of income is GPU.
It is estimated that NVIDIA's sales in 2025 will be between 6.5 million and 7 million units of GPUs, almost all of which are from the latest Hopper and Blackwell series.
According to production ratios and yield expectations, this includes approximately 2 million units of Hopper and 5 million units of Blackwell.
This year's production: 5 million units of H100.
So, what is the actual production capacity of NVIDIA in 2024? Currently, there is limited information available on this data, some of which do not even match.
However, estimates suggest that around 1.5 million units of Hopper GPUs will be produced in the fourth quarter of 2024. However, this includes some lower-performance H20 chips, so it is an upper limit.
Based on the proportion of data center revenue between quarters, the total production volume for the year could be up to 5 million units - this is based on an assumption that the revenue per equivalent chip like H100 is about $0.02 million, which seems to be on the low side; if a more reasonable calculation of $0.025 million is used, the actual production should be around 4 million units.
There is a discrepancy in the estimate of 1.5 million to 2 million units of H100 production at the beginning of the year. It is currently unclear whether this difference can be attributed to the difference between H100 and H200, expanded capacity, or other factors.
However, due to the inconsistency between this estimate and the revenue data, a higher number was chosen as a reference.
Previous production volume
To assess who currently and in the future has the most computing resources, data prior to 2023 has limited impact on the overall landscape.
This is mainly due to the improvement in GPU performance itself, and from the sales data of Nvidia, the production volume has seen a significant increase.
Estimates suggest that Microsoft and Meta each obtained approximately 0.15 million H100 GPUs in 2023. Considering Nvidia's data center revenue, the total production volume of H100 and similar products in 2023 is likely to be around 1 million units.
Prediction of equivalent H100 for the five technology giants
By the end of 2024, how many equivalent H100 units will Microsoft, Meta, Google, Amazon, and xAI own? How many GPU/TPU units will they expand to by 2025?
From Nvidia's quarterly reports (10-Q) and annual reports (10-K), it is evident that Nvidia's customers are categorized into 'direct customers' and 'indirect customers'.
Among them, 46% of the revenue comes from direct customers, including system integrators such as SMC, HPE, Dell.
They purchase GPUs and assemble them into servers, providing them to indirect customers.
The range of indirect customers is very wide, including public cloud service providers, internet consumer companies, enterprise users, public sector institutions, and venture companies.
In simpler terms, Microsoft, Meta, Google, Amazon, and xAI are all 'indirect customers' (they have relatively loose disclosure of GPU-related information, but their credibility may be lower).
In the 2024 fiscal year report, Nvidia disclosed that approximately 19% of total revenue comes from indirect customers who purchase products through system integrators and distributors.
According to trading rules, they must disclose customer information whose revenue accounts for more than 10%. So, what does Nvidia's data reveal here?
Either the second largest customer is only half the size of the largest customer, or there is a margin of error in these data measurements.
Among these, who might be the largest customer?
From the existing information, the most likely candidate is Microsoft.
Microsoft, meta
Microsoft is very likely Nvidia's biggest customer in the past two years, this determination is based on several factors:
Firstly, Microsoft has one of the world's largest public cloud computing services platforms; secondly, it is a major computing power supplier for OpenAI; furthermore, unlike Google and Amazon, Microsoft has not deployed its own custom chips on a large scale; finally, Microsoft seems to have established a special partnership with Nvidia - they are the first company to receive Blackwell GPUs.
In October this year, Microsoft Azure has started testing racks of 32 GB200 servers.
The revenue share data for Microsoft in 2024 is not as precise as that of 2023, Nvidia's second quarter financial report (10-Q) mentioned that the first half of the year accounted for 13%, and the third quarter was only 'over 10%'.
This indicates that Microsoft's share in Nvidia's sales has decreased compared to 2023.
According to Bloomberg's statistics, Microsoft accounts for 15% of Nvidia's revenue, followed by Meta at 13%, Amazon at 6%, and Google at approximately 6% (however, the data does not specifically indicate which years these figures correspond to).
According to statistics from Omdia research last year, by the end of 2023, Meta and Microsoft each had 0.15 million H100s, while Amazon and Google had 0.05 million each.$Oracle (ORCL.US)$This data is more consistent with Bloomberg data.
However, Meta previously claimed that by the end of 2024, it will have equivalent to 0.6 million H100 computing power. This is said to include 0.35 million H100s, probably the remaining part is H200, and a small amount of Blackwell chips to be delivered in the last quarter.
Assuming the accuracy of this 0.6 million number and combining it with revenue distribution for calculation, it is possible to more accurately estimate Microsoft's available computing power.
Microsoft is expected to have 25% to 50% more equivalent H100 computing power than Meta, which means around 0.75 million to 0.9 million H100s.
Google and Amazon
Just from the revenue contribution from NVIDIA, Amazon and Google are undoubtedly behind Microsoft and Meta. However, the situations of these two companies have significant differences.
Google already has a large number of in-house developed customized TPUs, which are the main computing chips for internal workloads.
In December last year, Google launched the next generation of its most powerful AI accelerator to date, TPU v5p.
Semianalysis pointed out in a report at the end of 2023 that Google is the only company with outstanding in-house chips.
Google's ability in low-cost, high-performance, and reliable large-scale AI deployments is almost unparalleled, making it the most powerful enterprise in global computing power.
Moreover, Google's investment in infrastructure will only increase. According to the third-quarter 2024 financial report estimate, AI spending will be 13 billion US dollars, with 'most' of it used to build technical infrastructure, 60% of which is on servers (GPU/TPU).
Most likely meaning 7-11 billion US dollars, with an estimated expenditure of 4.5-7 billion US dollars on TPU/GPU servers.
Based on a 2:1 estimate of TPU to GPU spending, and conservatively assuming that TPU performance per dollar is equivalent to Microsoft's GPU spending, it is estimated that by the end of 2024 Google will have the equivalent of 1 to 1.5 million H100 compute units.
In contrast, Amazon's internal AI workload is likely to be much smaller.
They hold a significant amount of Nvidia chips, primarily to meet the external GPU demand through their cloud platform, especially to meet Anthropic's computing power needs.
After all, just like Amazon and Microsoft, they are big sponsors responsible for providing sufficient computing power to OpenAI's main competitors.
On the other hand, while Amazon also has self-developed Trainium and Inferentia chips, their start in this area is much later than Google's TPU.
These chips seem far behind the industry's most advanced level, as they even offer up to 0.11 billion US dollars of free quota to attract users to try, indicating that the current market acceptance is not ideal.
However, in the middle of this year, Amazon's custom chips seem to have taken a turn.
During the third quarter 2024 earnings call, CEO Andy Jassy mentioned the Trainium2 chips, stating that these chips have garnered significant market interest, and we have negotiated with manufacturing partners multiple times to significantly increase the original production plan.
According to the Semianalysis report, 'Based on the data we know, Microsoft and Google's investment plans in AI infrastructure in 2024 are far ahead of Amazon's deployed computing power.'
The equivalent of these chips to H100 is not clear, and it is also difficult to obtain specific quantities of Trainium/Trainium2 chips. It is only known that 0.04 million units were provided in the aforementioned free quota plan.
xAI
This year, ai's most iconic event in infrastructure development was the world's largest supercomputer built in 122 days, consisting of 0.1 million H100 units.
Moreover, this scale is still expanding. Musk has announced future plans to expand to a supercomputer composed of 0.2 million units of H100 and H200.
It is said that xai supercomputers currently seem to be facing some issues in terms of site power supply.
In 2025, the Blackwell chip is predicted.
The latest 2024 AI status report estimates the purchase volume of Blackwell.
Large cloud computing companies are actively purchasing GB200 systems: Microsoft between 0.7 million and 1.4 million units, Google 0.4 million units, AWS 0.36 million units. It is rumored that OpenAI alone has at least 0.4 million GB200 units.
If Microsoft's estimated value of GB200 is set at 1 million units, then the numbers of Google, AWS, and their relative proportions compared to Microsoft in the purchase from NVIDIA are consistent.
This also makes Microsoft account for 12% of Nvidia's total revenue, consistent with a slight decrease in its revenue share in Nvidia in 2024.
Although the report does not provide specific estimates for Meta, Meta expects significant acceleration in infrastructure spending related to artificial intelligence next year, indicating that it will continue to maintain a high share of spending on Nvidia.
Lesswrong expects that Meta's expenditure scale will remain at about 80% of Microsoft's expenditure level in 2025.
Although xAI is not mentioned, Musk claims that a computing cluster with 0.3 million Blackwell chips will be deployed in the summer of 2025.
Considering Musk's consistent exaggerated style, a more reasonable estimate is that by the end of 2025, they may actually have 0.2 million to 0.4 million chips.
So, how many H100 chips does one B200 equal? This question is crucial for evaluating computational power growth.
In terms of training, performance is expected to soar (as of November 2024). On the day Nvidia was released, the data provided stated that GB200 composed of two B200s has a performance 7 times greater than H100 and a training speed 4 times faster than H100.
For Google, assuming Nvidia chips continue to account for one-third of its total marginal computing power. For Amazon, this ratio is assumed to be 75%.
It is worth noting that a large number of H100 and GB200 chips have not been included in the above statistics.
Some are institutions that have not reached the revenue reporting threshold of Nvidia, while others, like cloud service providers such as Oracle and other small to medium-sized cloud service providers, may hold a considerable number of chips.
In addition, this also includes some important non-U.S. customers of Nvidia.
After fully understanding how much GPU/TPU computing power each company has, the next question is, where will this computing power be used?
How much computing power have the tech giants used to train models?
The above discussions are speculations about the total computing power of various AI giants, but many may be more concerned about how much computing resources are used for training the latest cutting-edge models.
The following will discuss the situations of OpenAI, Google, Anthropic, Meta, and xAI.
However, since these companies are either non-public companies or are so large that they do not need to disclose specific cost details (such as Google, where AI training costs are currently only a small part of its enormous business), the following analysis is somewhat speculative.
OpenAI and Anthropic
In 2024, OpenAI's training cost is expected to reach 3 billion US dollars, while the inference cost is 4 billion US dollars.
It is reported that Microsoft has provided OpenAI with 400,000 pieces of GB200 GPU, used to support its training. This surpasses the overall GB200 capacity of AWS, enabling OpenAI's training capabilities far beyond Anthropic.
On the other hand, Anthropic is expected to lose approximately 2 billion US dollars in 2024, with revenue only in the hundreds of millions of dollars.
Considering that Anthropic's revenue mainly comes from API services and is expected to bring positive gross profit, and the inference cost should be relatively low, this means that most of the 2 billion US dollars is used for model training.
Conservatively estimated, its training cost is 1.5 billion US dollars, which is roughly half of OpenAI's, but it does not hinder its competitiveness in cutting-edge models.
This difference is also understandable. Anthropic's main cloud provider is the relatively limited resource AWS, whose resources are generally less than Microsoft, which provides computational support for OpenAI. This may limit Anthropic's capabilities.
Google and Meta
Google's Gemini Ultra 1.0 model uses about 2.5 times the computing resources of GPT-4, but was released 9 months later. The computing resources used are 25% higher than Meta's latest Llama model.
Although Google may have more computing power than other companies, as a cloud service giant, it faces a more diverse workload. Unlike Anthropic or OpenAI, which focus on model training, Google and Meta both need to support a large number of other internal workloads, such as recommendation algorithms for social media products.
Llama 3 uses fewer computing resources than Gemini and was released 8 months later, indicating that Meta allocates fewer resources to cutting-edge models compared to OpenAI and Google.
xAI
It is reported that xAI used 0.02 million H100 chips to train Grok 2, and plans to use 0.1 million H100 chips to train Grok 3.
For reference, GPT-4 reportedly used 0.025 million A100 chips for 90-100 days of training.
Considering that the performance of H100 is about 2.25 times that of A100, the training computational load of Grok 2 is about twice that of GPT-4, while Grok 3 is expected to reach five times that, at the forefront of computational resource utilization.
In addition, xAI does not rely entirely on its own chip resources, with some resources coming from leasing - it is estimated that they rented 0.016 million H100 chips from the Oracle cloud platform.
If xAI is allocated a proportion of training computing resources that is close to OpenAI or Anthropic, it is speculated that its training scale may be equivalent to Anthropic, but lower than that of OpenAI and Google.
Editor/rice