The fastest big model bombing site in history! Groq became popular overnight, and its self-developed LPU speed crushed Nvidia GPUs

wallstreetcn · Feb 20 20:39

来源：华尔街见闻

一觉醒来，AI圈又变天了。

还没消化完Sora带来的震撼，又一家硅谷初创企业带着史上最快的大模型和自研芯片LPU霸占了热搜。

就在昨天，AI芯片创企Groq（不是马斯克的Gork）开放了自家产品的免费试用。相比其他AI聊天机器人，Groq闪电般的响应速度迅速引爆互联网讨论。经过网友测试，Groq每秒生成速度接近500 tok/s，碾压GPT-4的40 tok/s。

有网友震惊地说：

它回复的速度比我眨眼还快。

不过，需要强调的是，Groq并没有研发新模型，它只是一个模型启动器，主页上运行的是开源模型Mixtral 8x7B-32k和Llama 270B-4k。

冠绝大模型圈子的响应速度，来自驱动模型的硬件——Groq并未使用$英伟达 (NVDA.US)$的GPU，而是自研了新型AI芯片——LPU（Language Processing Units）。

每秒500 tokens，写论文比你眨眼还快

LPU最突出的特点就是快。

根据2024年一月的测试结果，由Groq LPU驱动$Meta Platforms (META.US)$ Llama 2模型，推理性能遥遥领先，是顶级云计算供应商的18倍。

华尔街见闻此前文章提及，Groq LPU搭配Meta Llama 2 70B能在7分钟内就能生成与莎士比亚《哈姆雷特》相同数量的单词，比普通人的打字速度快75倍。

如下图所示，有推特网友问了一个和营销有关的专业问题，Groq在四秒钟之内就输出了上千词的长篇大论。

还有网友测试同时用Gemini、GPT-4和Groq完成一个代码调试问题。

结果，Groq的输出速度比Gemini快10倍，比GPT-4快18倍。

Groq在速度上对其他AI模型的降维打击，让网友直呼，“AI推理界的美国队长来了”。

LPU，英伟达GPU的挑战者？

再强调一遍，Groq没有开发新的模型，它只是用了不一样的芯片。

根据Groq官网的介绍，LPU是一种专为AI推理所设计的芯片。驱动包括GPT等主流大模型的GPU，是一种为图形渲染而设计的并行处理器，有数百个内核，而LPU架构则与GPU使用的SIMD（单指令，多数据）不同，这种设计可以让芯片更有效地利用每个时钟周期，确保一致的延迟和吞吐量，也降低了复杂调度硬件的需求：

Groq的LPU推理引擎不是普通的处理单元；它是一个端到端系统，专为需要大量计算和连续处理的应用（如LLM）提供最快的推理而设计。通过消除外部内存瓶颈，LPU推理引擎的性能比传统GPU高出几个数量级。

简单来说，对用户而言，最直观的体验就是“快”。

使用过GPT的读者一定知道，痛苦地等待大模型一个一个地吐出字符是一种怎样痛苦的体验，而LPU驱动下的大模型，基本可以做到实时响应。

比如下图，华尔街见闻向Groq询问LPU和GPU的区别，Groq生成这个回答用时不到3秒，完全不会像GPT、Gemini那样出现显著的延迟。如果以英文提问，生成速度还会更快。

Groq官方的介绍还显示，创新的芯片架构可以把多个张量流处理器（Tensor Streaming Processor，简称TSP）连接在一起，而不会出现GPU集群中的传统瓶颈，因此具有极高的可扩展性，简化了大规模AI模型的硬件要求。

能效也是LPU的另一个亮点。通过减少管理多个线程的开销和避免内核的利用率不足，LPU每瓦特可以提供更多的算力。

Groq创始人兼首席执行官Jonathan Ross在采访中，时时不忘给英伟达上眼药。

他此前对媒体表示，在大模型推理场景，Groq LPU芯片的速度比英伟达GPU快10倍，但价格和耗电量都仅为后者的十分之一。

实时推理是通过经过训练的AI模型运行数据的计算过程，以提供AI应用的即时结果，从而实现流畅的最终用户体验。随着AI大模型的发展，实时推理的需求激增。

Ross认为，对于在产品中使用人工智能的公司来说，推理成本正在成为一个问题，因为随着使用这些产品的客户数量增加，运行模型的成本也在迅速增加。与英伟达GPU相比，Groq LPU集群将为大模型推理提供更高的吞吐量、更低的延迟和更低的成本。

他还强调，Groq的芯片，由于技术路径不同，在供应方面比英伟达更充足，不会被$台积电 (TSM.US)$或者SK海力士等供应商卡脖子：

GroqChip LPU的独特之处在于，它不依赖于三星或SK海力士的HBM，也不依赖于台积电将外部HBM焊接到芯片上的CoWoS封装技术。

不过，另有一些AI专家在社交媒体上表示，Groq芯片的实际成本并不低。

如人工智能专家贾扬清分析称，Groq综合成本相当于英伟达GPU的30多倍。

考虑到每张Groq芯片的内存容量为230MB，实际运行模型需要572张芯片，总成本高达1144万美元。

相比之下，8张H100的系统在性能上与Groq系统相当，但硬件成本仅为30万美元，年度电费约2.4万美元。三年总运营成本对比显示，Groq系统的运营成本远高于H100系统，

而且，更关键的是，LPU目前仅用于推理，要训练大模型，仍然需要购买英伟达GPU。

创始人为谷歌TPU设计者之一相信未来2年能卖出100万个LPU

在今天互联网上一炮而红之前，Groq已经低调埋头研发7年多的时间。

公开资料显示，Groq成立于2016年，总部位于美国加州圣塔克拉拉山景城。公司创始人Jonathan Ross是前谷歌高级工程师，是$谷歌-A (GOOGL.US)$/$谷歌-C (GOOG.US)$自研AI芯片TPU的设计者之一。产品主管John Barrus曾在谷歌及$亚马逊 (AMZN.US)$担任产品高管。

高管内唯一一位华裔面孔、副总裁Estelle Hong，在公司任职已有四年，此前曾供职于美国军队及英特尔。

就在去年8月，Groq也宣布了和三星的合作计划，表示其下一代芯片将在美国德克萨斯州三星芯片工厂采用4纳米工艺生产，预计量产时间为24年下半年。

展望下一代LPU，Ross相信GroqChip的能效将提高15到20倍，可以在相同的功率范围内为设备增加更多的矩阵计算和SRAM存储器。

在去年底的采访中，Ross表示，考虑到GPU的短缺和高昂的成本，他相信Groq未来的发展潜力：

在12个月内，我们可以部署10万个LPU，在24个月内，我们可以部署100万个LPU。

编辑/jayden

Source: Wall Street News

As soon as I woke up, the AI world changed again.

Having yet to absorb the shock brought by Sora, another Silicon Valley startup dominated the hot search with the fastest big model in history and the self-developed LPU chip.

Just yesterday, AI chip startup Groq (not Musk's Gork) opened a free trial of its products. Compared to other AI chatbots, Groq's lightning-fast response speed quickly ignited internet discussions. After testing by netizens, Groq's generation speed is close to 500 tok/s per second, crushing GPT-4's 40 tok/s.

Some netizens were shocked and said:

It responds faster than I blink.

However, it should be emphasized that Groq is not developing a new model; it is just a model launcher. The homepage runs open source models Mixtral 8x7B-32K and Llama 270B-4K.

The response speed of the largest model circle comes from the hardware that drives the model - Groq is not used$NVIDIA (NVDA.US)$Instead of GPUs, they developed their own new AI chip—LPU (Language Processing Units).

500 tokens per second, writing a paper is faster than the blink of an eye

The most prominent characteristic of LPU is that it is fast.

Driven by Groq LPU according to January 2024 test results$Meta Platforms (META.US)$ The Llama 2 model, with far leading inference performance, is 18 times that of top cloud computing vendors.

Wall Street News mentioned in a previous article that Groq LPU with Meta Llama 270B can generate the same number of words as Shakespeare's “Hamlet” within 7 minutes, 75 times faster than an average person's typing speed.

As shown in the picture below, a Twitter netizen asked a professional question related to marketing, and Groq wrote a long story of thousands of words within four seconds.

Another netizen tested using Gemini, GPT-4, and GroQ at the same time to complete a code debugging problem.

As a result, Groq's output speed is 10 times faster than Gemini and 18 times faster than GPT-4.

Groq's speed reduction attack on other AI models made netizens call out, “Captain America in the AI reasoning world is here.”

LPU, Nvidia GPU challenger?

Once again, Groq is not developing a new model; it just uses a different chip.

According to Groq's official website, LPU is a chip specially designed for AI inference. The GPU that drives mainstream models, including GPT, is a parallel processor designed for graphics rendering. It has hundreds of cores, and the LPU architecture is different from the SIMD (single instruction, multiple data) used by GPUs. This design allows the chip to make more effective use of each clock cycle, ensure consistent latency and throughput, and also reduces the need for complex scheduling hardware:

Groq's LPU inference engine is not an ordinary processing unit; it is an end-to-end system designed to provide the fastest inference for applications that require intensive computation and continuous processing, such as LLM. By eliminating external memory bottlenecks, the performance of the LPU inference engine is several orders of magnitude higher than traditional GPUs.

Simply put, for users, the most intuitive experience is “fast.”

Readers who have used GPT must know how painful it is to wait for the big model to spit out characters one by one, and the large model driven by LPU can basically respond in real time.

As shown below, Wall Street News asked Groq about the difference between LPU and GPU. Groq generated this answer in less than 3 seconds, and there was no significant delay like GPT or Gemini at all. If you ask questions in English, they will be generated faster.

Groq's official introduction also shows that the innovative chip architecture can connect multiple tensor streaming processors (Tensor Streaming Processors, TSP for short) without traditional bottlenecks in GPU clusters, so it is extremely scalable and simplifies the hardware requirements for large-scale AI models.

Energy efficiency is another highlight of LPU. By reducing the overhead of managing multiple threads and avoiding underutilization of the core, an LPU can provide more computing power per watt.

In interviews, Groq founder and CEO Jonathan Ross never forgot to give Nvidia eye drops.

He previously told the media that in the big model inference scenario, the Groq LPU chip is 10 times faster than the Nvidia GPU, but the price and power consumption are only one-tenth of the latter.

Real-time inference runs the data calculation process through a trained AI model to provide immediate results for AI applications to achieve a smooth end user experience. With the development of big AI models, the need for real-time inference has surged.

Ross believes that inference costs are becoming an issue for companies using artificial intelligence in their products because the cost of running models is rapidly increasing as the number of customers using these products increases. Compared to Nvidia GPUs, the Groq LPU cluster will provide higher throughput, lower latency, and lower cost for large model inference.

He also stressed that Groq's chips, due to different technical paths, are more adequate than Nvidia in terms of supply and will not be$Taiwan Semiconductor (TSM.US)$Or the neck of a supplier such as SK Hynix is stuck:

GroqChip LPU is unique in that it does not rely on Samsung or SK Hynix's HBM, or TSMC's CowOS packaging technology to solder an external HBM to the chip.

However, some other AI experts said on social media that the actual cost of the Groq chip is not that low.

As AI expert Jia Yangqing analyzed, the comprehensive cost of GroQ is equivalent to more than 30 times that of Nvidia's GPU.

Considering that each GroQ chip has a memory capacity of 230MB, the actual operating model requires 572 chips, and the total cost is as high as 11.44 million US dollars.

In comparison, the 8-H100 system is comparable in performance to the Groq system, but the hardware cost is only $300,000, and the annual electricity bill is around $24,000. The three-year total operating cost comparison shows that the operating cost of the Groq system is much higher than that of the H100 system.

Also, more importantly, LPU is currently only used for reasoning, and to train large models, you still need to buy an Nvidia GPU.

The founder is one of Google's TPU designers and believes it can sell 1 million LPUs in the next 2 years

Before becoming an instant hit on the internet today, Groq had been in low-key development for over 7 years.

According to public information, Groq was founded in 2016 and is headquartered in Santa Clara Mountain View, California, USA. Company founder Jonathan Ross is a former senior Google engineer and is$Alphabet-A (GOOGL.US)$/$Alphabet-C (GOOG.US)$One of the designers of self-developed AI chip TPU. Product Director John Barrus worked at Google and$Amazon (AMZN.US)$Serves as a product executive.

Estelle Hong, the only Chinese face and vice president among the executives, has been with the company for four years and previously worked for the US military and Intel.

Just in August of last year, Groq also announced a cooperation plan with Samsung, stating that its next-generation chips will be produced using a 4 nm process at the Samsung chip factory in Texas, USA, and the expected mass production time is in the second half of '24.

Looking ahead to the next generation of LPUs, ROSS believes that GroqChip's energy efficiency will be 15 to 20 times higher, and more matrix computation and SRAM memory can be added to the device within the same power range.

In an interview at the end of last year, Ross said he believes in Groq's future development potential given the shortage and high cost of GPUs:

In 12 months, we can deploy 100,000 LPUs, and in 24 months, we can deploy 1 million LPUs.

Editor/jayden

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

史上最快大模型炸场！Groq一夜爆红，自研LPU速度碾压英伟达GPU