The Tencent version of 'Sora' joins the text-to-video battlefield.

wallstreetcn · 10:53

来源：华尔街见闻

作者：黄昱

仍在探索初期。

年初，“文生视频”模型Sora的出现，掀起了全球竞逐AI视频生成的热潮；近10个月过去，Sora迟迟没有对外开放，而作为后来者的腾讯混元，抢先加入了这一战场。

12月3日， $腾讯控股 (00700.HK)$ 混元大模型正式上线视频生成能力，C端用户通过腾讯元宝APP就可申请试用，企业客户通过腾讯云提供服务接入，目前API同步开放内测申请。

把文生视频摆上牌桌，这是继文生文、文生图、3D生成之后，腾讯混元大模型的又一新里程碑。与此同时，腾讯开源该视频生成大模型，参数量130亿，是当前最大的视频开源模型。

据华尔街见闻了解，腾讯混元的视频生成几乎没有门槛，用户只需要输入一段文字描述，腾讯混元生成视频大模型就可以生成一段五秒的视频。

相较于Sora分钟级别以及一些“类Sora”产品10s的视频生成时长，腾讯混元的视频生成时长不太让人振奋。

在当日的媒体沟通会上，腾讯混元多模态生成技术负责人表示，视频时长不是技术问题，而是纯算力和数据问题，因为时间扩长一倍，它的算力是一个平方级的上升，所以不是很划算。

此外，他指出，大部分人用视频的情况下都是一个镜头接一个镜头，所以混元视频生成模型第一版先放5s时长的出来，优先满足大部分的需求。“未来如果大家有很多强烈需求，要做很长的一镜到底，我们再去做升级。”

腾讯混元生成视频目前主要呈现四大特点：写实画质、语义遵从、动态流畅、原生转场。

在技术路线上，腾讯混元视频生成模型选择了跟Sora类似的DiT架构，并在架构设计上进行多处升级，包括引入多模态大语言模型作为文本编码器、基于自研Scaling Law的全注意力DiT、自研3D VAE等。

腾讯混元多模态生成技术负责人指出，混元算是业内首个或者特别少数拿多模态大语言模型来做文本编码器的视频生成模型。业界现在更多还是选用T5模型和CLIP模型作为文本编码器。

之所以这样选择，是因为腾讯混元是看中了这条技术路线的三大优点，包括增强复杂文本的理解能力、原生图文对齐能力以及支持系统提示词。

此外，腾讯混元多模态生成技术负责人提到，在做GPT之前，OpenAI花了很多心思去验证Scaling Law（用更多的数据训练更大的模型）在语言模型中的有效性，但在视频生成领域学术界或业界没有把Scaling Law是否有效公开出来。

在此背景下，腾讯混元团队自己把图像、视频生成的Scaling Law验证了一边，最终得出结论，图像DiT有，视频基于图像DiT做二阶段的训练一样是有Scaling Law的性质在的。

“所以我们首版腾讯混元视频生成模型是基于这套比较严格的Scaling Law的推论，做了一个130亿的模型。”腾讯混元多模态生成技术负责人说道。

与此同时，腾讯混元也在疾行视频生成生态模型的探索，包括图生视频模型、视频配音模型、驱动2D照片数字人等。

腾讯混元多模态生成技术负责人指出，相较于文生视频，图生视频模型在可用性的推进上进展会更快，可能在不到一个月之内混元会发布最新进展。

自去两年前ChatGPT掀起的AI大模型热潮以来，大语言模型技术路径已收敛，而视频生成模型仍处于探索期。

东方证券分析师指出，在OpenAI的技术方向引领下，目前语言模型的技术路径基本就是GPT这一条路。而多模态技术方面，目前没有一家公司处于绝对领先地位，技术路径仍存在探索的可能。

腾讯混元多模态生成技术负责人也表示，文生视频整体都处于不太成熟的阶段，综合合格率都不高。

作为多模态生成中难度最大的领域，视频生成对算力、数据等资源要求较高，目前相较于文本、图像成熟度更低，同时面临商业化、产品化进展较慢的挑战。

OpenAI也宣布由于算力短缺而推迟Sora的更新，导致至今尚未对外开放。

尽管如此，为了更快抢占市场，去年十一月以来，视频生成领域的成果密集落地。

截至目前，国内外不少大模型厂商都实现了类Sora产品落地，包括国内MiniMax、智谱、字节、快手、爱诗科技等，海外Runway、Pika、Luma。不过，由于算力和技术等因素，视频生成时长一般在10s以内。

为了推进商业化，大模型厂商必须找到更多视频生成的应用场景。这次，腾讯给出的思路是：混元视频生成模型画面具备高质感，可用于工业级商业场景例如广告宣传、动画制作、创意视频生成等场景。

视频AI是多模态领域的最后一环，也是更容易催生爆款应用的领域，但如何在算力投入与商业化之间缺的平衡，依然是目前“类Sora”的视频生成模型们必须解决的一大难题。

编辑/Rocky

Source: Wall Street See

Author: Huang Yu

Still in the exploration phase.

At the beginning of the year, the appearance of the 'Wensheng Video' model Sora triggered a global frenzy to compete in AI video generation; after nearly 10 months, Sora has not been opened to the public, and Tencent Mixin, as a latecomer, has taken the lead to join this battlefield.

On December 3rd, $TENCENT (00700.HK)$ the Hunyuan large model officially launched video generation capabilities, C-end users can apply for a trial through the Tencent Yuanbao APP, and corporate clients can access services provided by Tencent Cloud. Currently, the API is open for internal testing applications.

Placing Wensheng Video on the table is another new milestone for Tencent Mixin's large model after Wensheng Wen, Wensheng Image, and 3D generation. At the same time, Tencent has open-sourced this video generation large model, with a parameter volume of 13 billion, currently the largest open-source video model.

According to Wall Street news, Tencent Mixin's video generation has almost no threshold. Users only need to input a text description, and Tencent Mixin's video generation large model can generate a five-second video.

Compared to the minute-level Sora and some "Sora-like" products with a 10-second video generation time, the video generation time of Tencent Mixun is not very exciting.

During the day's media briefing, the head of Tencent Mixun's multimodal generation technology stated that video duration is not a technical issue, but a pure computing power and data issue, because doubling the time results in a quadratic increase in computing power, so it's not very cost-effective.

Furthermore, he pointed out that most people use a shot-by-shot approach when using videos, so the first version of the Mixun video generation model releases a 5-second duration to meet the needs of the majority first. "If in the future there is a strong demand for continuous long shots, we will upgrade."

Currently, Tencent Mixun's generated videos mainly exhibit four characteristics: realistic image quality, semantic compliance, smooth dynamics, and native transitions.

In terms of technical direction, Tencent Mixun's video generation model has chosen a DiT architecture similar to Sora's and made multiple upgrades in architecture design, including introducing multimodal large language models as text encoders, a full attention DiT based on its own Scaling Law, and self-developed 3D VAE.

The head of Tencent Mixun's multimodal generation technology pointed out that Mixun is one of the first in the industry, or among a small number, to use multimodal large language models as text encoders for video generation models. Currently, the industry mostly chooses T5 and CLIP models as text encoders.

The reason for this choice is that Tencent Mixun values the three main advantages of this technical direction, including enhanced understanding of complex texts, native image-text alignment abilities, and support for system prompts.

Furthermore, the head of Tencent Mixun's multimodal generation technology mentioned that before working on GPT, OpenAI spent a lot of effort verifying the effectiveness of Scaling Law (training larger models with more data) in language models, but academia or the industry has not publicly disclosed whether Scaling Law is effective in the field of video generation.

Against this background, the Tencent Mixed Reality team verified the Scaling Law of image and video generation themselves, and finally concluded that the image DiT has Scaling Law properties when video training is based on the image DiT in two stages.

"So our first version of the Tencent Mixed Reality video generation model is based on this rigorous Scaling Law inference, and we built a 13 billion model," said the Tencent Mixed Reality multimodal generation technology leader.

Meanwhile, Tencent Mixed Reality is also exploring the ecological model of rapid video generation, including image-to-video models, video voiceover models, and driving 2D photo realistic figures.

The Tencent Mixed Reality multimodal generation technology leader pointed out that compared to text-to-video, image-to-video models will progress faster in terms of usability and Mixed Reality may release the latest developments in less than a month.

Since the AI big model boom sparked by ChatGPT two years ago, the technology path of large language models has converged, while video generation models are still in the exploration phase.

Orient analyst pointed out that under the leadership of OpenAI's technological direction, the current technology path of language models is basically the GPT route. As for multimodal technology, currently no company is in an absolute leading position, and the technology path still has room for exploration.

The Tencent Mixed Reality multimodal generation technology leader also stated that text-to-video in general is in a relatively immature stage, with low overall success rates.

As the most challenging area in multimodal generation, video generation requires higher computational power, data, and other resources. Currently, compared to text and image, it is less mature, while facing slow progress in commercialization and productization.

OpenAI also announced the delay of Sora's update due to a shortage of computing power, resulting in it not being open to the public yet.

Nevertheless, in order to quickly seize the market, there have been intensive achievements in the field of video generation since November last year.

As of now, many large model manufacturers at home and abroad have implemented Sora-like products, including MiniMax, Zhipu, ByteDance, Kuaishou, Aishike Technology domestically, and Runway, Pika, Luma overseas. However, due to factors such as computing power and technology, the duration of video generation is generally within 10 seconds.

In order to promote commercialization, large model manufacturers must find more applications for video generation. This time, Tencent's idea is: the mixed-element video generation model screen has high quality, which can be used in industrial-level business scenarios such as advertising, animation production, creative video generation, and other scenarios.

Video AI is the final link in the multimodal field and is also an area that is more likely to generate popular applications. However, how to strike a balance between computing power input and commercialization remains a major challenge that current "Sora-like" video generation models must solve.

Editor/Rocky

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

腾讯版“Sora”加入文生视频战场

The Tencent version of 'Sora' joins the text-to-video battlefield.

Risk Disclaimer

Statement