Generative AI may usher in the next wave: the TTT model

支持Sora等模型的Transformers架构“大脑”是查找表、所谓隐藏状态。不同与Transformers，TTT不会随着处理更多数据而不断增长，它用机器学习模型取代隐藏状态，就像AI的嵌套娃娃，是一个模型中的模型。

下一代生成式人工智能（AI）的焦点可能是简称TTT的测试时间训练模型。

Transformers 架构是 OpenAI 视频模型 Sora 的基础，也是 Anthropic 的 Claude、谷歌的 Gemini 和 OpenAI旗舰模型GPT-4o 等文本生成模型的核心。但现在，这些模型的演进开始遇到技术障碍，尤其是与计算相关的障碍。因为Transformers 在处理和分析大量数据方面并不是特别高效，至少在现成的硬件上运行是这样。企业为了满足Transformers 的需求建设和扩展基础设施，这导致电力需求急剧增加，甚至可能无法持续满足需求。

本月斯坦福大学、加州大学圣地亚哥分校、加州大学伯克利分校和 Meta 的研究人员联合发布，他们耗时一年半开发了TTT架构。研究团队称，TTT 模型不仅可以处理比 Transformers 多得多的数据，而且不会消耗像Transformers那么多的计算电力。

为什么外界认为TTT模型相比Transformers更有前途？首先需要了解，Transformers 的一个基本组成部分是“隐藏状态”，它本质上是一个很长的数据列表。当 Transformer 处理某些内容时，它会将条目添加到隐藏状态，以便“记住”刚刚处理的内容。例如，如果模型正在处理一本书，隐藏状态值将是单词（或单词的一部分）的呈现形式。

参与前述TTT 研究的斯坦福大学博士后Yu Sun最近对媒体解释说，如果将Transformer 视为一个智能实体，那么查找表、它的隐藏状态就是 Transformer 的大脑。这个大脑实现了 Transformer 众所周知的一些功能，例如情境学习。

隐藏状态帮助 Transformers变得强大，但它也阻碍了Transformers的发展。比如Transformers 刚刚阅读了一本书，为了“说”出关于这本书中的哪怕一个字，Transformers 模型都必须扫描整个查找表，这种计算要求相当于重读整本书。

因此，Sun和TTT的其他研究人员想到，用机器学习模型取代隐藏状态——就像 AI 的嵌套娃娃，也可以说是一个模型中的模型。与 Transformers 的查找表不同，TTT 模型的内部机器学习模型不会随着处理更多数据而不断增长。相反，它将处理的数据编码，处理为被称为权重的代表性变量，这就是 TTT 模型高性能的原因。无论 TTT 模型处理多少数据，其内部模型的大小都不会改变。

Sun认为，未来的 TTT 模型可以高效处理数十亿条数据，从单词到图像、从录音到视频。这远远超出了现有模型的能力。TTT的系统可以对一本书说 X 个字，却不需要做重读这本书 X 遍的复杂计算。“基于 Transformers 的大型视频模型、例如 Sora，只能处理 10 秒的视频，因为它们只有一个查找表‘大脑’。我们的最终目标是开发一个系统，可以处理类似于人类生活中视觉体验的长视频。”

TTT 模型最终会取代 transformers吗？媒体认为，有这个可能，但现在下结论为时过早。TTT 模型现在并不是Transformers的直接替代品。研究人员只开发了两个小模型进行研究，因此目前很难将 TTT 与一些大型 Transformers模型实现的结果进行比较。

并未参与前述TTT研究的伦敦国王学院信息学系高级讲师 Mike Cook评论称，TTT是一项非常有趣的创新，如果数据支持它能提高效率的观点，那是个好消息，但他无法告诉判断，TTT是否比现有的架构更好。Cook说，他读本科的时候，有一位老教授经常讲一个笑话：你如何解决计算机学中的任何问题？再添加一个抽象层。在神经网络中添加一个神经网络就让他想起了这个笑话的解答。

The “brain” of the Transformers architecture that supports models such as Sora is a search table, a so-called hidden state. Unlike Transformers, TTT doesn't grow as more data is processed; it uses machine learning models to replace hidden states, like AI's nested dolls, which are models within a model.

The focus of the next generation of generative artificial intelligence (AI) may be a test time training model called TTT for short.

The Transformers architecture is the foundation of OpenAI's video model Sora, and the core of text-generation models such as Anthropic's Claude, Google's Gemini, and OpenAI's flagship model GPT-4O. But now, the evolution of these models is beginning to encounter technical hurdles, particularly those related to computation. Because Transformers aren't particularly efficient at processing and analyzing large amounts of data, at least they run on off-the-shelf hardware. Companies are building and expanding infrastructure to meet Transformers' needs, which has led to a sharp increase in electricity demand, and may not even be able to meet demand continuously.

This month, researchers from Stanford University, UC San Diego, UC Berkeley, and Meta made a joint announcement. It took them a year and a half to develop the TTT architecture. Not only can the TTT model process much more data than Transformers, but it also doesn't consume as much computational power as Transformers, the research team said.

Why do outsiders think the TTT model is more promising than Transformers? First, you need to understand that one of the basic components of Transformers is the “hidden state,” which is essentially a long list of data. When the Transformer processes something, it adds entries to a hidden state so it “remembers” what it just processed. For example, if the model is working on a book, the hidden state value would be a representation of a word (or part of a word).

Yu Sun, a postdoctoral fellow at Stanford University who participated in the aforementioned TTT study, recently explained to the media that if the Transformer is viewed as an intelligent entity, then the search table, its hidden state, is the Transformer's brain. This brain implements some of the features Transformers are well known for, such as contextual learning.

Hidden state helps Transformers become powerful, but it also hinders Transformers from developing. For example, Transformers just read a book, and in order to “say” even one word in this book, the Transformers model had to scan the entire search table. This calculation requirement is equivalent to rereading the entire book.

As a result, Sun and other TTT researchers thought of replacing hidden states with machine learning models — like AI's nested dolls, or models within a model. Unlike Transformers' lookups, the TTT model's internal machine learning model doesn't grow as more data is processed. Instead, it encodes the processed data and treats it as representative variables called weights, which is why the TTT model is so high. No matter how much data the TTT model processes, the size of its internal model doesn't change.

Sun believes future TTT models can efficiently process billions of pieces of data, from words to images, and from recordings to videos. This is far beyond the capabilities of existing models. TTT's system can say X words to a book without having to do complicated calculations by rereading the book X times. “Large video models based on Transformers, such as Sora, can only process 10 seconds of video because they only have one search table 'brain'. Our ultimate goal is to develop a system that can process long videos similar to the visual experience of human life.”

Will the TTT model eventually replace Transformers? The media believes this is possible, but it is too early to draw conclusions. The TTT model is now not a direct replacement for Transformers. The researchers developed only two small models for the study, so it is currently difficult to compare TTT with the results achieved by some large Transformers models.

Mike Cook, a senior lecturer in the Department of Informatics at King's College London, who did not participate in the TTT study mentioned above, commented that TTT is a very interesting innovation. If the data supports the idea that it can improve efficiency, that is good news, but he can't tell whether TTT is better than the existing architecture. Cook said that when he was an undergraduate, an old professor used to tell a joke: How can you solve any problem in computer science? Add another layer of abstraction. Adding a neural network to the neural network reminded him of the answer to this joke.

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

生成式AI可能迎来下一个风口：TTT模型

Generative AI may usher in the next wave: the TTT model

Risk Disclaimer

Statement