share_log

十万块H100打造全球最强超算集群,马斯克是如何考虑的?

How did Musk consider building the world's strongest supercomputer cluster with 100,000 H100?

半導體行業觀察 ·  Jul 23 16:15

Elon Musk's ai startup xAI has launched a supercomputer cluster in Memphis, Tennessee, which is made up of 100,000 H100 GPUs. CEO Elon Musk confirmed the milestone in a post on social media platform X. xAI's supercomputer cluster is expected to be used to train the company's large language model, Grok, which is currently launched as a feature for X Premium subscribers. Earlier this month, Musk stated in an article on X that xAI's Grok 3 would be trained on 100,000 H100 GPUs, so "this should be a very special thing".$NVIDIA (NVDA.US)$Supercomputer Cluster xAI established by Elon Musk's ai startup xAI in Memphis, Tennessee has been powered by 100,000 liquid-cooled H100 Graphics Processing Units(GPUs) made by Nvidia from last year. It is highly sought after by ai model providers, including those of Musk's former competitor at OpenAI. CEO Musk pointed out that it runs on a single RDMA structure or Remote Direct Memory Access structure, which Cisco said was a more efficient and lower-latency data transfer between compute nodes without increasing the central processing unit (CPU) burden. xAI's goal is to train its own large models on superclusters. More importantly, Musk stated in a reply that the company's goal is to train "the world's most powerful artificial intelligence according to various indicators" and "achieve this by December of this year." He also claimed that the Memphis supercluster would provide "significant advantages" for this purpose.$Tesla (TSLA.US)$Most hardware is provided by Supermicro, and the CEO also commented on Musk's post, praising the team's execution. Earlier, Supermicro CEO had voiced enthusiastic support for Musk's liquid-cooled AI data center. If xAI's computing resources in Memphis are viewed against some background, of course, in terms of scale, the new xAI Memphis supercluster easily surpasses any supercomputer on the latest Top500 list in terms of GPU horsepower. The world's most powerful supercomputers, such as Frontier (37,888 AMD GPUs), Aurora (60,000 Intel GPUs), and Microsoft Eagle (14,400 Nvidia H100 GPUs), all seem to lag far behind the xAI machine.

In 2023, the company's overall sales volume of 18,000 kiloliters, up 28.10% year-on-year, showed significant growth. On the product structure side, operating income of 4.01/12.88/0.06 billion yuan was achieved for 10-30 billion yuan products respectively.

Today's weather is good. Today's weather is good.

Please use your Futubull account to access the feature.

Clearly, xAI's goal is to train its own large models on the supercluster. But more importantly, Musk said in his reply that the company's goal is to train "the world's most powerful artificial intelligence by all indicators" and achieve this goal "before December of this year".

He also tweeted that the Memphis supercluster will provide "significant advantages" for this.

In May, we reported on Musk's ambitious plan to open a supercomputing factory before the fall of 2025. At the time, Musk was eager to get started on developing the supercluster, so he had to buy the current generation "Hopper" H100 GPU. This seems to indicate that the tech giant was not patient enough to wait for the release of H200 chips, let alone the upcoming Blackwell-based B100 and B200 GPUs. Although newer Nvidia Blackwell data center GPUs are expected to ship by the end of 2024, the situation remains the same.

So, if the supercomputing factory is expected to open in the fall of 2025, does today's news mean the project is a year ahead of schedule? It may well be, but it is more likely that the sources who were interviewed by Reuters and The Information earlier this year were mistaken or misquoted about the project timeline. Additionally, with xAI's Memphis supercluster already up and running, the question of why xAI did not wait for more powerful or next-generation GPUs has been answered.

Supermicro provided most of the hardware, and the CEO also commented on Musk's post, praising the team's execution. Earlier, the CEO of Supermicro recently lauded Musk's liquid-cooled AI data center.

In subsequent tweets, Musk explained that the new super cluster will "train the world's most powerful artificial intelligence on all indicators." From the previous statement of intention, we assume that the features installed on the 100,000 H100 GPUs of xAI will now be used to train Grok 3. Musk said that the improved LLM should complete the training phase "by December of this year".

If we look at the computing resources of the Memphis Supercluster in a certain context, of course, in terms of scale, the new xAI Memphis Supercluster easily surpasses any supercomputer on the latest Top500 list in GPU horsepower. The world's most powerful supercomputers, such as Frontier (37,888 AMD GPUs), Aurora (60,000 Intel GPUs), and Microsoft Eagle (14,400 Nvidia H100 GPUs), seem to be far behind the xAI machine.

As early as June, it was reported that xAI would build a supercomputer cluster in the former Elex Memphis factory, which covers an area of 785,000 square feet, informally known as the "computing super factory." The Greater Memphis Chamber announced in a press release that xAI's supercomputer project is the largest capital investment made by a new public company in Memphis history.

xAI's investment is huge. According to a report by Benzinga, the cost of each Nvidia H100 GPU is estimated to be between $30,000 and $40,000. Considering that xAI uses 100,000 Nvidia H100 units, Elon Musk's AI startup seems to have spent about $3 billion to $4 billion on the project.

It is worth mentioning that Tesla, owned by Elon Musk, has deployed about 35,000 Nvidia H100s for training self-driving cars and is developing supercomputers using its custom Dojo chip.

Edited by Jeffrey

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment