The track is becoming crowded. Tencent's Huyan Da Model is entering the audiovisual video field, and the key is to let users 'use it'.

cls.cn · Dec 4 09:19

①腾讯混元大模型正式上线视频生成能力，并开源该视频生成大模型，参数量130亿，这是当前最大的视频开源模型。②腾讯认为，当下的视频生成，还未到大规模商用的阶段，还有很多技术难点需要克服，混元文生视频当下阶段更重要的是开源让更多人“用起来”，使模型的飞轮能快速转动带动优化模型本身。

《科创板日报》12月4日讯（记者张洋洋）昨日，腾讯混元大模型正式上线视频生成能力，这是继文生文、文生图、3D生成之后，混元大模型的最新业务进展。与此同时，腾讯开源该视频生成大模型，参数量130亿，是当前最大的视频开源模型。

“用户只需要输入一段描述，即可生成视频，”腾讯混元相关负责人透露，目前的生成视频支持中英文双语输入、多种视频尺寸以及多种视频清晰度。目前该模型已上线腾讯元宝APP，用户可在AI应用中的“AI视频”板块申请试用。企业用户通过腾讯云提供服务接入，目前API同步开放内测申请。

自从OpenAI 的Sora 基于 DiT（Diffusion Transformer）架构，把长视频生成的效果提高到了前所未有的水平，全球AI厂商加速赶来，掀起视频生成热潮。

2024年接近尾声，今年以来大模型领域最热闹的细分赛道要数视频生成。字节豆包正在推出文生视频内测，Minmax，快手，商汤等也先后推出了文生视频。由清华大学联合生数科技共同研发Vidu 则宣称是中国首个长时长、高一致性、高动态性视频大模型。

不过，做好文生视频这件事并不简单，这一点从OpenAI在今年初发布了Sora之后，仍未正式对外开放便可见一斑。

这主要是因为当前的视频生成技术产出的结果与用户期望之间仍存在较大差距，这些模型在理解和应用物理规则方面表现不足，并且在生成过程中缺乏有效的可控性。

按照腾讯的说法，混元文生视频大模型主要的优势能力在于，可以实现超写实画质、生成高度符合提示词的视频画面，画面流畅不易变形。

“比如，在冲浪、跳舞等大幅度运动画面的生成中，腾讯混元可以生成非常流畅、合理的运动镜头，物体不易出现变形；光影反射基本符合物理规律，在镜面或者照镜子场景中，可以做到镜面内外动作一致。同时，模型还可以实现在画面主角保持不变的情况下自动切镜头，这是业界大部分模型所不具备的能力。”

从技术角度来看，据腾讯混元相关负责人介绍，混元大模型基于跟Sora类似的DiT架构，在架构设计上进行了多处升级。

混元视频生成模型适配了新一代文本编码器提升语义遵循，其具备强大的语义跟随能力，更好地应对多个主体描绘，实现更加细致的指令和画面呈现；采用统一的全注意力机制，使得每帧视频的衔接更为流畅，并能实现主体一致的多视角镜头切换；通过先进的图像视频混合VAE（3D 变分编码器），让模型在细节表现有明显提升，特别是小人脸、高速镜头等场景。

比如写下这么一段提示词，一位中国美女穿着汉服，头发飘扬，背景是伦敦，然后镜头切换到特写镜头：

不过在视频生成领域，快手、抖音、智谱科技、生数科技等国内厂商均已推出相应的产品，甚至开启了商业化，腾讯混元此番的节奏并不算快。

对此，腾讯混元相关负责人在接受《科创板日报》记者采访时回应称，当下的视频生成技术，从可用度而言，还未到大规模商用的阶段，还有很多技术难点需要克服，混元大模型文生视频功能也并不急于一时，当下阶段更重要的是开源让更多人用起来，使模型的飞轮能快速转动带动优化模型本身。

在落地应用上，上述负责人表示，混元大模型生成的视频可用于工业级商业场景，例如广告宣传、动画制作、创意视频生成等场景。对于未来的商业化，腾讯暂时还没有详细的规划出来。

目前，腾讯宣布开源该视频生成大模型已在 Hugging Face平台及Github上发布，包含模型权重、推理代码、模型算法等完整模型，可供企业与个人开发者免费使用和开发生态插件。基于腾讯混元的开源模型，开发者及企业无需从头训练，即可直接用于推理，并可基于腾讯混元系列打造专属应用及服务。

Tencent's Hanyuan General Model officially launched video generation capabilities, and open-sourced the video generation large model with 13 billion parameters, which is currently the largest open-source video model. Tencent believes that the current video generation has not yet reached a large-scale commercial stage, and there are still many technical challenges to overcome. At this stage, the most important thing for the Hanyuan Wensheng video is to open-source it to let more people 'use it', so that the model's flywheel can quickly drive the optimization of the model itself.

"Star Market Daily" December 4th news (Reporter Zhang Yang) Yesterday, Tencent's Hanyuan General Model officially launched video generation capabilities, this is the latest business development of the Hanyuan General Model following Wenshengwen, Wenshengtu, and 3D generation. At the same time, Tencent open-sourced the video generation large model with 13 billion parameters, the current largest open-source video model.

"Users only need to input a description to generate a video," said the person in charge of Tencent's Hanyuan, currently the generated video supports bilingual input in Chinese and English, multiple video sizes, and multiple video resolutions. The model has been launched on the Tencent Yuanbao APP, where users can apply for a trial in the 'AI Video' section of the AI application. Enterprise users can access the service provided by Tencent Cloud, and the API is currently open for synchronized internal testing applications.

Since OpenAI's Sora based on the DiT (Diffusion Transformer) architecture has improved the effect of long video generation to an unprecedented level, global AI companies are rushing to join, sparking a wave of video generation trends.

As we approach the end of 2024, the most lively subdivision track in the field of large models this year is undoubtedly video generation. Byte Bean Bag is launching internal testing of Wensheng video, and companies like Minmax, Kuaishou, SenseTime, etc., have successively launched Wensheng video. Vidu, jointly developed by Tsinghua University and Livesense Technology, claims to be China's first long-duration, highly consistent, highly dynamic video large model.

However, doing well in Wensheng video is not easy, as seen from the fact that after OpenAI released Sora earlier this year, it has not yet officially opened to the public.

This is mainly because there is still a considerable gap between the results produced by current video generation technology and user expectations. These models fall short in understanding and applying physical rules, and lack effective controllability during the generation process.

According to Tencent, the main advantage of the Hanyuan Wensheng video large model lies in its ability to achieve super-realistic image quality, generate video images highly consistent with the prompt words, and ensure smooth and deformation-resistant images.

For example, in the generation of sports scenes such as surfing, dancing, Tencent Hybird can generate very smooth and reasonable motion shots, making objects less prone to distortion; the light and shadow reflections basically conform to physical laws, in mirror or mirror scenes, actions inside and outside the mirror can be consistent. At the same time, the model can automatically switch shots while keeping the main character in the frame unchanged, a capability that most models in the industry do not possess.

From a technical perspective, according to the person in charge of Tencent Hybird, the Hybird large model is based on a DiT architecture similar to Sora, and has undergone multiple upgrades in its architectural design.

The Hybird video generation model has adapted to the new generation of text encoders to enhance semantic compliance, with powerful semantic follow-up capabilities that better handle multiple subjects description, achieve more detailed instructions and visual presentation; using a unified full attention mechanism, making the transition between each frame of the video smoother, and enabling consistent multi-angle switching of subjects; through advanced image and video mixed VAE (3D variational encoder), allowing the model to significantly improve detail performance, especially in scenes with small faces, high-speed shots, and more.

For example, writing a hint like this: a Chinese beauty wearing Hanfu, hair fluttering, background in London, then switch to a close-up shot:

However, in the video generation field, domestic companies such as Kwai, Douyin, Zhipu Technology, and Shengshu Technology have all launched corresponding products, some of which have even commercialized. Tencent Hybird's pace in this regard is not considered fast.

In response to this, the person in charge of Tencent Hybird stated in an interview with the Star Market Daily that current video generation technology, in terms of availability, has not yet reached the stage of large-scale commercial use. There are still many technical challenges to overcome, and the functions of the Hybird large model are not urgent for the time being. At the current stage, it is more important to open source it for more people to use, so that the model's flywheel can quickly turn to optimize the model itself.

Regarding practical applications, the above-mentioned person stated that videos generated by the Hybird large model can be used in industrial-grade commercial scenarios, such as advertising, animation production, creative video generation, and other scenarios. As for future commercialization, Tencent has not yet detailed a specific plan.

Currently, Tencent has announced the open sourcing of this video generation large model on the Hugging Face platform and Github, which includes model weights, inference code, model algorithms, and other complete models, available for enterprises and individual developers to use and develop ecological plugins for free. Based on the open-source model of Tencent Hybird, developers and companies do not need to train from scratch and can directly use it for inference, as well as build exclusive applications and services based on the Tencent Hybird series.

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

赛道正在变得拥挤 腾讯混元大模型杀入文生视频 让用户 “用起来”是关键

The track is becoming crowded. Tencent's Huyan Da Model is entering the audiovisual video field, and the key is to let users 'use it'.

Risk Disclaimer

Statement

赛道正在变得拥挤腾讯混元大模型杀入文生视频让用户 “用起来”是关键