Tencent's Hanyuan General Model officially launched video generation capabilities, and open-sourced the video generation large model with 13 billion parameters, which is currently the largest open-source video model. Tencent believes that the current video generation has not yet reached a large-scale commercial stage, and there are still many technical challenges to overcome. At this stage, the most important thing for the Hanyuan Wensheng video is to open-source it to let more people 'use it', so that the model's flywheel can quickly drive the optimization of the model itself.
"Star Market Daily" December 4th news (Reporter Zhang Yang) Yesterday, Tencent's Hanyuan General Model officially launched video generation capabilities, this is the latest business development of the Hanyuan General Model following Wenshengwen, Wenshengtu, and 3D generation. At the same time, Tencent open-sourced the video generation large model with 13 billion parameters, the current largest open-source video model.
"Users only need to input a description to generate a video," said the person in charge of Tencent's Hanyuan, currently the generated video supports bilingual input in Chinese and English, multiple video sizes, and multiple video resolutions. The model has been launched on the Tencent Yuanbao APP, where users can apply for a trial in the 'AI Video' section of the AI application. Enterprise users can access the service provided by Tencent Cloud, and the API is currently open for synchronized internal testing applications.
Since OpenAI's Sora based on the DiT (Diffusion Transformer) architecture has improved the effect of long video generation to an unprecedented level, global AI companies are rushing to join, sparking a wave of video generation trends.
As we approach the end of 2024, the most lively subdivision track in the field of large models this year is undoubtedly video generation. Byte Bean Bag is launching internal testing of Wensheng video, and companies like Minmax, Kuaishou, SenseTime, etc., have successively launched Wensheng video. Vidu, jointly developed by Tsinghua University and Livesense Technology, claims to be China's first long-duration, highly consistent, highly dynamic video large model.
However, doing well in Wensheng video is not easy, as seen from the fact that after OpenAI released Sora earlier this year, it has not yet officially opened to the public.
This is mainly because there is still a considerable gap between the results produced by current video generation technology and user expectations. These models fall short in understanding and applying physical rules, and lack effective controllability during the generation process.
According to Tencent, the main advantage of the Hanyuan Wensheng video large model lies in its ability to achieve super-realistic image quality, generate video images highly consistent with the prompt words, and ensure smooth and deformation-resistant images.
For example, in the generation of sports scenes such as surfing, dancing, Tencent Hybird can generate very smooth and reasonable motion shots, making objects less prone to distortion; the light and shadow reflections basically conform to physical laws, in mirror or mirror scenes, actions inside and outside the mirror can be consistent. At the same time, the model can automatically switch shots while keeping the main character in the frame unchanged, a capability that most models in the industry do not possess.
From a technical perspective, according to the person in charge of Tencent Hybird, the Hybird large model is based on a DiT architecture similar to Sora, and has undergone multiple upgrades in its architectural design.
The Hybird video generation model has adapted to the new generation of text encoders to enhance semantic compliance, with powerful semantic follow-up capabilities that better handle multiple subjects description, achieve more detailed instructions and visual presentation; using a unified full attention mechanism, making the transition between each frame of the video smoother, and enabling consistent multi-angle switching of subjects; through advanced image and video mixed VAE (3D variational encoder), allowing the model to significantly improve detail performance, especially in scenes with small faces, high-speed shots, and more.
For example, writing a hint like this: a Chinese beauty wearing Hanfu, hair fluttering, background in London, then switch to a close-up shot:
However, in the video generation field, domestic companies such as Kwai, Douyin, Zhipu Technology, and Shengshu Technology have all launched corresponding products, some of which have even commercialized. Tencent Hybird's pace in this regard is not considered fast.
In response to this, the person in charge of Tencent Hybird stated in an interview with the Star Market Daily that current video generation technology, in terms of availability, has not yet reached the stage of large-scale commercial use. There are still many technical challenges to overcome, and the functions of the Hybird large model are not urgent for the time being. At the current stage, it is more important to open source it for more people to use, so that the model's flywheel can quickly turn to optimize the model itself.
Regarding practical applications, the above-mentioned person stated that videos generated by the Hybird large model can be used in industrial-grade commercial scenarios, such as advertising, animation production, creative video generation, and other scenarios. As for future commercialization, Tencent has not yet detailed a specific plan.
Currently, Tencent has announced the open sourcing of this video generation large model on the Hugging Face platform and Github, which includes model weights, inference code, model algorithms, and other complete models, available for enterprises and individual developers to use and develop ecological plugins for free. Based on the open-source model of Tencent Hybird, developers and companies do not need to train from scratch and can directly use it for inference, as well as build exclusive applications and services based on the Tencent Hybird series.