Semiconductor industry tracking

Amazon founder invests in Jim Keller, targeting nvidia.

Observation of the semiconductor industry. · Dec 3 09:58

来源：半导体行业观察

近日，由行业知名人士Jim Keller担任CEO的Tenstorrent宣布完成由三星证券和 AFW Partners 领投的 6.93 亿美元 D 轮融资。在这轮融资之后，这家 AI 芯片初创公司的估值约为 26 亿美元。

Tenstorrent 创始人兼半导体先驱 Jim Keller 在接受采访时表示，该公司希望开发一款芯片，试图打破 $英伟达 (NVDA.US)$ 对 AI 业务的垄断，该公司在由韩国 AFW Partners 和三星证券领投的一轮融资中筹集了资金。Bezos Expeditions 与 LG Electronics Inc. 和 Fidelity 联手参与了这轮融资，看好 Keller 的实力和人工智能技术领域的蓬勃发展机会。

值得一提的是，Bezos Expeditions的实控人为$亚马逊 (AMZN.US)$创始人Jeff Bezos。考虑到AWS对英伟达芯片的采购量，可以看到这个投资背后的深层次含义。

除了领投方之外，许多知名投资者也参与了此轮融资，其中包括 XTX Markets、Corner Capital、MESH、加拿大出口发展局、安大略省医疗养老金计划、LG 电子、现代汽车集团、富达管理与研究公司、Baillie Gifford、Bezos Expeditions 等。

Tenstorrent方面表示，由于投资者需求强劲，该轮融资获得超额认购。Jim Keller 在接受采访时更是表示，该公司希望开发一款芯片，试图打破 Nvidia 对 AI 业务的垄断。

Tenstorrent是谁？

关于谁是Jim Keller，媒体已经做了很多报道，我们就不再多言。《Jim Keller的芯片研发封神之路》可以看到其光辉的履历。至于Tenstorrent，则是一家由Jim Keller支持并担任CEO的公司。

Tenstorrent 总部位于加利福尼亚州圣克拉拉，主要开发和销售专为 AI 工作负载而设计的计算系统，这些系统均围绕该公司的 Tensix 核心开发。该公司的愿景是打破 Nvidia 在芯片硅片市场的垄断，设计出更实惠的 AI 训练和部署硬件，避免使用 Nvidia 使用的高带宽内存等昂贵组件。

“如果你使用 HBM，你就无法击败 Nvidia，因为 Nvidia 购买的 HBM 最多，而且具有成本优势，”Jim Keller在接受彭博社采访时候说。“但他们永远无法像 HBM 内置到他们的产品和插槽中那样降低价格。”

众所周知，Nvidia 为开发人员提供了全套专有技术，涵盖从芯片到互连甚至数据中心布局的方方面面，并承诺所有部件都能更好地工作，因为它们是协同设计的。而竞争对手 AMD和 Tenstorrent 等公司则致力于与其他技术提供商实现更大的互操作性，无论是通过共享行业标准还是开放设计供他人使用。

为了吸引更多潜在客户，该公司专注于与其他供应商进行可互操作的硬件设计。它使用开放标准的RISC-V 处理器架构，旨在为工程师和开发人员提供一个更开放的生态系统，以便将其处理器和系统应用于他们的数据中心和服务器设置。“过去，我使用专有技术，这真的很艰难，”Jim Keller 说。“开源可以帮助你构建更大的平台。它吸引了工程师。是的，这是一个充满激情的项目。”

为了实现这一目标，Tenstorrent将 AI 和 RISC-V 知识产权授权给想要拥有和定制专用芯片的客户。RISC-V 是一种开源指令架构，用于基于所谓的“精简指令集”为不同应用开发定制处理器，这使得它非常易于使用、定制和优化功率、性能和功能。

与 RISC-V 和日本合作伙伴 Rapidus一样，Tenstorrent 仍有很多需要证明的地方。迄今为止，这家新兴公司已与客户签订了总额近 1.5 亿美元的合同，与 Nvidia 每季度数百亿美元的数据中心收入相比，这相形见绌。

该公司表示，将利用新资金构建开源 AI 软件堆栈，并聘请开发人员来扩大全球开发和设计中心。这将使该公司能够构建系统和云，供 AI 开发人员在其系统上使用和测试模型。

Tenstorrent 表示，其首批芯片由 $GlobalFoundries (GFS.US)$制造，下一代芯片将来自台湾半导体制造公司和三星电子公司。该公司还开始为尖端的 2 纳米制造进行设计。$台积电 (TSM.US)$和三星将于明年开始大规模生产，Tenstorrent 正在与他们以及日本的 Rapidus 进行谈判，后者的目标是在 2027 年实现 2 纳米产量。

XTX Markets 首席技术官 Joshua Leahy 表示：“我们发现 Tenstorrent 的开源驱动方法令人耳目一新，尤其是在专有且通常保密的 AI 加速器领域。”

随着公司开始利用新资金扩大规模，它将在 Nvidia 占据优势的市场中面临阻力。然而，Jim Keller 仍然相信，通过提供更实惠、可以根据业务需求量身定制的 AI 芯片，并每两年发布一款新处理器，可以帮助该公司在 AI 芯片行业保持商业上可行的产品。

在接受媒体采访的时候，Jim Keller曾总结说：

Tenstorrent 是一家设计公司。我们设计CPU，我们设计人工智能引擎，我们设计人工智能软件堆栈。

因此，无论是软 IP、硬 IP chiplet还是完整芯片，这些都是实现。我们在这方面很灵活。例如，在 CPU 上，我们将在我们自己的chiplet流片之前对其进行多次许可。我们正在与六家想要从事定制内存芯片或 NPU 加速器等业务的公司进行交谈。我认为对于我们的下一代，无论是 CPU 还是 AI，我们将构建 CPU 和 AI chiplet。但随后其他人会做其他的小芯片。然后我们会将它们整合到系统中。

凭啥挑战英伟达？

从上面的介绍中，我们分享了Tenstorrent的愿景。接下来，我们了解一下这家公司的产品和路线图。

在2023年三月，Tenstorrent 的首席 CPU 架构师 Wei-Han Lien 在接受媒体采访的时候就表示，由于 Tenstorrent 着眼于解决广泛的 AI 应用问题，因此它不仅需要不同的片上系统或系统级封装，还需要各种 CPU 微架构实现和系统级架构，以实现不同的功率和性能目标。

Tenstorrent 表示，公司的CPU 团队开发了一种无序 RISC-V 微架构，并以五种不同的方式实现它，以满足各种应用的需求。

Tenstorrent 现在有五种不同的 RISC-V CPU 核心 IP，包括双宽、三宽、四宽、六宽和八宽解码，可用于自己的处理器或授权给感兴趣的各方。对于那些需要非常基本的 CPU 的潜在客户，该公司可以提供具有双宽执行能力的小核心，但对于那些需要更高性能用于边缘、客户端 PC 和高性能计算的客户，它有六宽 Alastor 和八宽 Ascalon 核心。

每个具有八宽解码的无序 Ascalon ( RV64ACDHFMV ) 核心都有六个 ALU、两个 FPU 和两个 256 位矢量单元，因此非常强大。考虑到现代 x86 设计使用四宽 (Zen 4) 或六宽 (Golden Cove) 解码器，我们看到的是一个功能非常强大的核心。

除了各种 RISC-V 通用核心外，Tenstorrent 还拥有专为神经网络推理和训练量身定制的专有 Tensix 核心。每个 Tensix 核心由五个 RISC 核心、一个用于张量运算的数组数学单元、一个用于矢量运算的 SIMD 单元、1MB 或 2MB 的 SRAM 以及用于加速网络数据包操作和压缩/解压缩的固定功能硬件组成。Tensix 核心支持多种数据格式，包括 BF4、BF8、INT8、FP16、BF16 甚至 FP64。

截止2023年三月，Tenstorrent 有两种产品：一种名为 Grayskull 的机器学习处理器，提供约 315 INT8 TOPS 的性能，可插入 PCIe Gen4 插槽；另一种是联网的 Wormhole ML 处理器，性能约为 350 INT8 TOPS，使用 GDDR6 内存子系统、PCIe Gen4 x16 接口，并与其他机器建立 400GbE 连接。

这两种设备都需要主机 CPU，可作为附加板使用，也可内置于预置的 Tenstorrent 服务器中。一台 4U Nebula 服务器包含 32 张 Wormhole ML 卡，可提供约 12 个 INT8 POPS 的性能，功率为 6kW。

在今年八月举办的 Hot Chips 上，Tenstorrent披露了Blackhole AI 加速器进行。与之前作为基于 PCIe 的加速器部署的 Greyskull 和 Wormhole 部件不同，Tenstorrent 的 Blackhole旨在作为独立的 AI 计算机运行。

他们声称，该加速器在原始计算和可扩展性方面可以胜过 Nvidia A100。据介绍，每个 Blackhole 芯片都拥有 745 teraFLOPS 的 FP8 性能（FP16 为 372 teraFLOPS）、32GB 的 GDDR6 内存和基于以太网的互连，能够在其 10 个 400Gbps 链路上实现 1TBps 的总带宽。

Tenstorrent 展示了其最新芯片如何在性能上比 Nvidia A100 GPU 略有优势，尽管在内存容量和带宽方面都落后。然而，就像 A100 一样，Tenstorrent 的 Blackhole 旨在作为横向扩展系统的一部分进行部署。这家 AI 芯片初创公司计划将 32 个 Blackhole 加速器以 4x8 网格的形式连接起来，塞进一个节点，并将其称为 Blackhole Galaxy。

总体而言，单个 Blackhole Galaxy 承诺 FP8 的 23.8 petaFLOPS 或 FP16 的 11.9 petaFLOPS，以及能够提供 16 TBps 原始带宽的 1TB 内存。此外，Tenstorrent 表示，该芯片的核心密集型架构（我们稍后会深入探讨）意味着这些系统中的每一个都可以用作计算或内存节点，或用作高带宽 11.2TBps 的 AI 交换机。

Tenstorrent 人工智能软件和架构高级研究员 Davor Capalija 表示：“你可以用它作为乐高积木来搭建整个训练集群。”

值得一提的是。Tenstorrent 使用板载以太网，这意味着它避免了在芯片到芯片和节点到节点网络中处理多种互连技术所带来的挑战，而 Nvidia 则必须使用 NVLink 和 InfiniBand/以太网。在这方面，Tenstorrent 的横向扩展策略与英特尔的Gaudi 平台非常相似，后者也使用以太网作为其主要互连。考虑到 Tenstorrent 计划在一个盒子里塞入多少个 Blackhole 加速器，更不用说一个训练集群，看看它们如何处理硬件故障将会很有趣。

Tenstorrent 表示，Blackhole之所以能作为独立的 AI 计算机运行，主要归功于 16 个“Big RISC-V”64 位、双发射、有序 CPU 核心，这些核心排列在四个集群中。至关重要的是，这些核心足够强大，可以作为运行 Linux 的设备主机。这些 CPU 核心与 752 个“Baby RISC-V”核心配对，后者负责内存管理、片外通信和数据处理。

然而，实际计算是由 Tenstorrent 的 140 个 Tensix 核心处理的，每个核心由五个“Baby RISC-V”核心、一对路由器、一个计算综合体和一些 L1 缓存组成。

计算综合体由一个用于加速矩阵工作负载的图块数学引擎和一个矢量数学引擎组成。前者将支持 Int8、TF32、BF/FP16、FP8 以及 2 到 8 位的块浮点数据类型，而矢量引擎则以 FP32、Int16 和 Int32 为目标。

据他们所说，这种配置意味着该芯片可以支持 AI 和 HPC 应用中的各种常见数据模式，包括矩阵乘法、卷积和分片数据布局。

总体而言，Blackhole 的 Tensix 核心占了 752 个所谓的板载 RISC-V 核心中的 700 个。其余核心负责内存管理（“D”代表 DRAM）、片外通信（“E”代表以太网）、系统管理（“A”）和 PCIe（“P”）。

除了新芯片之外，Tenstorrent 还公开了其加速器的 TT-Metalium 低级编程模型。

熟悉 Nvidia CUDA 平台的人都知道，软件可以成就或毁掉性能最高的硬件。事实上，TT-Metalium 有点让人联想到 CUDA 或 OpenCL 等 GPU 编程模型，因为它是异构的，但不同之处在于它是从“AI 和横向扩展”计算开始构建的，Capalija 解释道。

其中一个区别是内核本身是带有 API 的纯 C++。“我们认为不需要特殊的内核语言，”他解释道。

结合 TT-NN、TT-MLIR 和 TT-Forge 等其他软件库，Tenstorrent 旨在支持使用 PyTorch、ONNX、JAX、TensorFlow 和 vLLM 等常用运行时在其加速器上运行任何 AI 模型。

写在最后

替代英伟达是很多人的想法，但替代英伟达似乎是任何一个人都很难达成的目标。例如，大家都知道，英伟达能稳坐钓鱼台，除了得益于其领先的硬件外，包括CUDA在内的软件实力，是他们能垄断至今的根本。

但Jim Keller曾表示：“CUDA并不是护城河，而是沼泽。”他同时认为，GPU并不是运行人工智能的全部。

“我希望可以帮助客户构建自己的产品，这是一件很酷的事情，您可以拥有并控制它，而不用向其他人支付 60% 或 80% 的毛利率。因此，当人们告诉我们 Nvidia 已经赢了，并问为什么 Tenstorrent 会参与竞争时，那是因为只要存在利润率极高的垄断，就会创造商机。”Jim Keller说。

在笔者看来，亚马逊后续会如何与英伟达battle，也会是一个有意思的话题。

编辑/Rocky

Source: Semiconductor Industry Watch. At yesterday's Conputex conference, Dr. Lisa Su released the latest roadmap. Afterwards, foreign media morethanmoore released the content of Lisa Su's post-conference interview, which we have translated and summarized as follows: Q: How does AI help you personally in your work? A: AI affects everyone's life. Personally, I am a loyal user of GPT and Co-Pilot. I am very interested in the AI used internally by AMD. We often talk about customer AI, but we also prioritize AI because it can make our company better. For example, making better and faster chips, we hope to integrate AI into the development process, as well as marketing, sales, human resources and all other fields. AI will be ubiquitous. Q: NVIDIA has explicitly stated to investors that it plans to shorten the development cycle to once a year, and now AMD also plans to do so. How and why do you do this? A: This is what we see in the market. AI is our company's top priority. We fully utilize the development capabilities of the entire company and increase investment. There are new changes every year, as the market needs updated products and more features. The product portfolio can solve various workloads. Not all customers will use all products, but there will be a new trend every year, and it will be the most competitive. This involves investment, ensuring that hardware/software systems are part of it, and we are committed to making it (AI) our biggest strategic opportunity. Q: The number of TOPs in PC World - Strix Point (Ryzen AI 300) has increased significantly. TOPs cost money. How do you compare TOPs to CPU/GPU? A: Nothing is free! Especially in designs where power and cost are limited. What we see is that AI will be ubiquitous. Currently, CoPilot+ PC and Strix have more than 50 TOPs and will start at the top of the stack. But it (AI) will run through our entire product stack. At the high-end, we will expand TOPs because we believe that the more local TOPs, the stronger the AIPC function, and putting it on the chip will increase its value and help unload part of the computing from the cloud. Q: Last week, you said that AMD will produce 3nm chips using GAA. Samsung foundry is the only one that produces 3nm GAA. Will AMD choose Samsung foundry for this? A: Refer to last week's keynote address at imec. What we talked about is that AMD will always use the most advanced technology. We will use 3nm. We will use 2nm. We did not mention the supplier of 3nm or GAA. Our cooperation with TSMC is currently very strong-we talked about the 3nm products we are currently developing. Q: Regarding sustainability issues. AI means more power consumption. As a chip supplier, is it possible to optimize the power consumption of devices that use AI? A: For everything we do, especially for AI, energy efficiency is as important as performance. We are studying how to improve energy efficiency in every generation of products in the future-we have said that we will improve energy efficiency by 30 times between 2020 and 2025, and we are expected to exceed this goal. Our current goal is to increase energy efficiency by 100 times in the next 4-5 years. So yes, we can focus on energy efficiency, and we must focus on energy efficiency because it will become a limiting factor for future computing. Q: We had CPUs before, then GPUs, now we have NPUs. First, how do you see the scalability of NPUs? Second, what is the next big chip? Neuromorphic chip? A: You need the right engine for each workload. CPUs are very suitable for traditional workloads. GPUs are very suitable for gaming and graphics tasks. NPUs help achieve AI-specific acceleration. As we move forward and research specific new acceleration technologies, we will see some of these technologies evolve-but ultimately it is driven by applications. Q: You initially broke Intel's status quo by increasing the number of cores. But the number of cores of your generations of products (in the consumer aspect) has reached its peak. Is this enough for consumers and the gaming market? Or should we expect an increase in the number of cores in the future? A: I think our strategy is to continuously improve performance. Especially for games, game software developers do not always use all cores. We have no reason not to adopt more than 16 cores. The key is that our development speed allows software developers to and can actually utilize these cores. Q: Regarding desktops, do you think more efficient NPU accelerators are needed? A: We see that NPUs have an impact on desktops. We have been evaluating product segments that can use this function. You will see desktop products with NPUs in the future to expand our product portfolio.

Recently, Tenstorrent, led by industry renowned figure Jim Keller as CEO, announced the completion of a $0.693 billion Series D financing round led by Samsung Securities and AFW Partners. Following this round of financing, the valuation of this AI chip startup is approximately $2.6 billion.

Jim Keller, the founder of Tenstorrent and a pioneer in semiconductors, stated in an interview that the company hopes to develop a chip aimed at breaking the monopoly on $NVIDIA (NVDA.US)$ AI business. The company raised funds in a financing round led by south korea's AFW Partners and Samsung Securities. Bezos Expeditions partnered with LG Electronics Inc. and Fidelity to participate in this round of financing, bullish on Keller's strength and the flourishing opportunities in the ai technology field.

It is worth mentioning that the actual controller of Bezos Expeditions is $Amazon (AMZN.US)$founder Jeff Bezos. Considering the procurement volume of nvidia chips by AWS, the deeper meaning behind this investment can be seen.

In addition to the lead investors, many well-known investors participated in this round of financing, including XTX Markets, Corner Capital, MESH, Canada Export Development Agency, Ontario Medical Pension Plan, LG Electronics, Hyundai Motor Group, Fidelity Management & Research, Baillie Gifford, and Bezos Expeditions.

Tenstorrent stated that due to strong investor demand, this round of financing was oversubscribed. Jim Keller further expressed in an interview that the company aims to develop a chip to break Nvidia's monopoly in the AI business.

Who is Tenstorrent?

There has been much media coverage about who Jim Keller is, so there is no need to elaborate further. You can see his brilliant resume in 'The Legendary Journey of Jim Keller’s Chip Development.' As for Tenstorrent, it is a company supported by Jim Keller, who serves as its CEO.

Tenstorrent is headquartered in Santa Clara, California, and primarily develops and sells computing systems designed specifically for AI workloads, all built around the company's Tensix core. The company's vision is to break Nvidia's monopoly in the chip wafer market and develop more affordable AI training and deployment hardware, avoiding the use of expensive components like high-bandwidth memory used by Nvidia.

"If you use HBM, you cannot beat Nvidia, because Nvidia buys the most HBM and has a cost advantage," Jim Keller said during an interview with Bloomberg. "But they can never lower the price like HBM built into their products and slots."

It is well known that Nvidia provides developers with a full set of proprietary technologies that cover everything from chips to interconnections and even datacenter layouts, promising that all components work better together because they are co-designed. In contrast, competitors like AMD and Tenstorrent focus on achieving greater interoperability with other technology providers, whether through sharing industry standards or open designs for others to use.

To attract more potential customers, the company focuses on interoperable hardware design with other vendors. It utilizes the open standard RISC-V processor architecture, aiming to provide engineers and developers with a more open ecosystem to apply their processors and systems to their datacenter and server setups. "In the past, I used proprietary technology, which was really difficult," Jim Keller said. "Open source can help you build a larger platform. It attracts engineers. Yes, it’s a passionate project."

To achieve this goal, Tenstorrent will license ai and RISC-V intellectual property to customers who want to own and customize dedicated chips. RISC-V is an open-source instruction architecture for developing custom processors based on a so-called 'reduced instruction set' for different applications, making it very easy to use, customize, and optimize for power, performance, and functionality.

Like RISC-V and the japanese partner Rapidus, Tenstorrent still has many things to prove. So far, this emerging company has signed contracts totaling nearly 0.15 billion US dollars, which pales in comparison to nvidia's quarterly data center revenue of several billion dollars.

The company stated that it will use the new funds to build an open-source ai software stack and hire developers to expand global development and design centers. This will enable the company to build systems and clouds for ai developers to use and test models on their systems.

Tenstorrent announced that its first chips are manufactured by $GlobalFoundries (GFS.US)$Taiwan Semiconductor Manufacturing Company and Samsung Electronics will produce the next generation of chips. The company has also begun designs for cutting-edge 2-nanometer manufacturing.$Taiwan Semiconductor (TSM.US)$Samsung will begin mass production next year, and Tenstorrent is negotiating with them and japan's Rapidus, which aims to achieve 2-nanometer output by 2027.

Joshua Leahy, the Chief Technology Officer of XTX Markets, stated: "We find Tenstorrent's open-source driving method refreshing, especially in the proprietary and often secretive field of ai accelerators."

As the company begins to leverage new funding to scale up, it will face resistance in a market dominated by nvidia. However, Jim Keller still believes that by providing more affordable ai chips that can be customized to meet business needs, and releasing a new processor every two years, the company can maintain commercially viable products in the ai chip industry.

During a media interview, Jim Keller summarized:

Tenstorrent is a design company. We design CPUs, we design ai engines, we design ai software stacks.

Therefore, whether it's soft IP, hardware IP chiplets, or complete chips, these are all realizable. We are very flexible in this regard. For example, with the CPU, we will license it multiple times before our own chiplet tape-out. We are talking to six companies that want to engage in businesses like custom memory chips or NPU accelerators. I believe that for our next generation, whether CPU or ai, we will build CPU and ai chiplets. But then others will make other small chips. Then we will integrate them into the system.

Why challenge nvidia?

From the introduction above, we share Tenstorrent's vision. Next, let’s learn about the company's products and roadmap.

In March 2023, Tenstorrent's chief CPU architect Wei-Han Lien stated during a media interview that since Tenstorrent aims to address a wide range of ai application issues, it not only requires different system-on-chips or system-level packaging but also various CPU micro-architectures and system-level architectures to achieve different power and performance targets.

Tenstorrent stated that the company's CPU team has developed an unordered RISC-V microarchitecture and implemented it in five different ways to meet the needs of various applications.

Tenstorrent now has five different RISC-V CPU core IPs, including dual-width, triple-width, quadruple-width, six-width, and eight-width decoding, which can be used for its own processors or licensed to interested parties. For potential customers who require a very basic CPU, the company can provide small cores with dual-width execution capabilities, while for those needing higher performance for edge, client PCs, and high-performance computing, it offers the six-width Alastor and eight-width Ascalon cores.

Each unordered Ascalon core with eight-width decoding (RV64ACDHFMV) features six ALUs, two FPUs, and two 256-bit vector units, making it very powerful. Considering that modern x86 designs utilize four-width (Zen 4) or six-width (Golden Cove) decoders, what we have is a very powerful core.

In addition to various RISC-V general cores, Tenstorrent also has proprietary Tensix cores specifically tailored for neural network inference and training. Each Tensix core consists of five RISC cores, an array math unit for tensor operations, a SIMD unit for vector operations, 1MB or 2MB of SRAM, and fixed-function hardware for accelerating network packet operations and compression/decompression. Tensix cores support various data formats including BF4, BF8, INT8, FP16, BF16, and even FP64.

As of March 2023, Tenstorrent has two products: one is a machine learning processor called Grayskull, offering about 315 INT8 TOPS of performance and capable of being plugged into PCIe Gen4 slots; the other is the connected Wormhole ML processor, with performance around 350 INT8 TOPS, utilizing a GDDR6 memory subsystem and a PCIe Gen4 x16 interface, establishing 400GbE connections with other machines.

Both devices require a host CPU and can be used as additional boards or built into pre-configured Tenstorrent servers. A 4U Nebula server includes 32 Wormhole ML cards, providing around 12 INT8 POPS of performance, with a power draw of 6kW.

At Hot Chips held this August, Tenstorrent unveiled the Blackhole AI accelerator. Unlike previous components such as Greyskull and Wormhole that were deployed as PCIe-based accelerators, Tenstorrent's Blackhole is designed to operate as an independent AI computer.

They claim that this accelerator can outperform the Nvidia A100 in raw computation and scalability. Each Blackhole chip reportedly has 745 teraFLOPS of FP8 performance (372 teraFLOPS for FP16), 32GB of GDDR6 memory, and Ethernet-based interconnect, achieving a total bandwidth of 1TBps across its 10 400Gbps links.

Tenstorrent demonstrated how its latest chip has a slight advantage in performance over the Nvidia A100 GPU, although it falls behind in memory capacity and bandwidth. However, like the A100, Tenstorrent's Blackhole is designed to be deployed as part of a horizontally scaled system. This AI chip startup plans to connect 32 Blackhole accelerators in a 4x8 grid configuration in a single node, referring to it as the Blackhole Galaxy.

Overall, a single Blackhole Galaxy promises 23.8 petaFLOPS for FP8 or 11.9 petaFLOPS for FP16, along with 1TB of memory capable of providing 16 TBps of raw bandwidth. Additionally, Tenstorrent stated that the chip's core-intensive architecture (which we will explore later) enables each of these systems to function either as compute or memory nodes, or as a high-bandwidth AI switch with 11.2 TBps.

Tenstorrent AI software and architecture senior researcher Davor Capalija stated, 'You can build an entire training cluster with it like Lego blocks.'

It is worth mentioning that Tenstorrent employs onboard Ethernet, which means it avoids the challenges of handling multiple interconnect technologies in chip-to-chip and node-to-node networking, while Nvidia has to use NVLink and InfiniBand/Ethernet. In this regard, Tenstorrent's horizontal scaling strategy is quite similar to Intel's Gaudi platform, which also uses Ethernet as its primary interconnect. Considering how many Blackhole accelerators Tenstorrent plans to fit into a single box, let alone a training cluster, it will be interesting to see how they handle hardware failures.

Tenstorrent stated that the Blackhole's ability to function as an independent AI computer is mainly thanks to 16 'Big RISC-V' 64-bit, dual-issue, in-order CPU cores arranged in four clusters. Crucially, these cores are powerful enough to serve as the device host running Linux. These CPU cores are paired with 752 'Baby RISC-V' cores, which are responsible for memory management, off-chip communication, and data processing.

However, the actual computations are handled by Tenstorrent's 140 Tensix cores, each consisting of five 'Baby RISC-V' cores, a pair of routers, a compute complex, and some L1 cache.

The compute complex consists of a tile math engine for accelerating matrix workloads and a vector math engine. The former will support Int8, TF32, BF/FP16, FP8, as well as block floating-point data types ranging from 2 to 8 bits, while the vector engine targets FP32, Int16, and Int32.

According to them, this configuration means that the chip can support various common data patterns in AI and HPC applications, including matrix multiplication, convolution, and chunked data layouts.

Overall, Blackhole's Tensix core accounted for 700 of the 752 so-called onboard RISC-V cores. The remaining cores are responsible for memory management ("D" stands for DRAM), off-chip communication ("E" stands for Ethernet), system management ("A"), and PCIe ("P").

In addition to the new chip, Tenstorrent also unveiled its TT-Metalium low-level programming model for its accelerators.

Those familiar with the Nvidia CUDA platform know that software can make or break the highest performing hardware. In fact, TT-Metalium somewhat resembles GPU programming models like CUDA or OpenCL because it is heterogeneous, but the difference is that it is built from "AI and horizontal scaling" computing, Capalija explained.

One distinction is that the cores themselves are pure C++ with APIs. "We believe there is no need for a special core language," he explained.

Combined with other software libraries such as TT-NN, TT-MLIR, and TT-Forge, Tenstorrent aims to support running any AI model on its accelerators using popular runtimes like PyTorch, ONNX, JAX, TensorFlow, and vLLM.

In conclusion,

The idea of replacing nvidia is something many people think about, but it seems to be a goal that is difficult for anyone to achieve. For example, it is well known that nvidia can maintain its leading position not only due to its advanced hardware but also because of the software strength, including CUDA, which is fundamental to its monopoly to this day.

However, Jim Keller once stated, "CUDA is not a moat, but a swamp." He also believes that GPUs are not the entirety of running ai.

I hope to help clients build their own products, which is a really cool thing. You can own and control it without paying other people a gross margin of 60% or 80%. Therefore, when people tell us that nvidia has already won and ask why Tenstorrent is competing, it's because as long as there is a monopoly with extremely high profit margins, it will create business opportunities." said Jim Keller.

In my opinion, how amazon will subsequently battle with nvidia will also be an interesting topic.

Editor/Rocky

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.