Source: Wall Street See
Author: Huang Yu
Still in the exploration phase.
At the beginning of the year, the appearance of the 'Wensheng Video' model Sora triggered a global frenzy to compete in AI video generation; after nearly 10 months, Sora has not been opened to the public, and Tencent Mixin, as a latecomer, has taken the lead to join this battlefield.
On December 3rd, $TENCENT (00700.HK)$ the Hunyuan large model officially launched video generation capabilities, C-end users can apply for a trial through the Tencent Yuanbao APP, and corporate clients can access services provided by Tencent Cloud. Currently, the API is open for internal testing applications.
Placing Wensheng Video on the table is another new milestone for Tencent Mixin's large model after Wensheng Wen, Wensheng Image, and 3D generation. At the same time, Tencent has open-sourced this video generation large model, with a parameter volume of 13 billion, currently the largest open-source video model.
According to Wall Street news, Tencent Mixin's video generation has almost no threshold. Users only need to input a text description, and Tencent Mixin's video generation large model can generate a five-second video.
Compared to the minute-level Sora and some "Sora-like" products with a 10-second video generation time, the video generation time of Tencent Mixun is not very exciting.
During the day's media briefing, the head of Tencent Mixun's multimodal generation technology stated that video duration is not a technical issue, but a pure computing power and data issue, because doubling the time results in a quadratic increase in computing power, so it's not very cost-effective.
Furthermore, he pointed out that most people use a shot-by-shot approach when using videos, so the first version of the Mixun video generation model releases a 5-second duration to meet the needs of the majority first. "If in the future there is a strong demand for continuous long shots, we will upgrade."
Currently, Tencent Mixun's generated videos mainly exhibit four characteristics: realistic image quality, semantic compliance, smooth dynamics, and native transitions.
In terms of technical direction, Tencent Mixun's video generation model has chosen a DiT architecture similar to Sora's and made multiple upgrades in architecture design, including introducing multimodal large language models as text encoders, a full attention DiT based on its own Scaling Law, and self-developed 3D VAE.
The head of Tencent Mixun's multimodal generation technology pointed out that Mixun is one of the first in the industry, or among a small number, to use multimodal large language models as text encoders for video generation models. Currently, the industry mostly chooses T5 and CLIP models as text encoders.
The reason for this choice is that Tencent Mixun values the three main advantages of this technical direction, including enhanced understanding of complex texts, native image-text alignment abilities, and support for system prompts.
Furthermore, the head of Tencent Mixun's multimodal generation technology mentioned that before working on GPT, OpenAI spent a lot of effort verifying the effectiveness of Scaling Law (training larger models with more data) in language models, but academia or the industry has not publicly disclosed whether Scaling Law is effective in the field of video generation.
Against this background, the Tencent Mixed Reality team verified the Scaling Law of image and video generation themselves, and finally concluded that the image DiT has Scaling Law properties when video training is based on the image DiT in two stages.
"So our first version of the Tencent Mixed Reality video generation model is based on this rigorous Scaling Law inference, and we built a 13 billion model," said the Tencent Mixed Reality multimodal generation technology leader.
Meanwhile, Tencent Mixed Reality is also exploring the ecological model of rapid video generation, including image-to-video models, video voiceover models, and driving 2D photo realistic figures.
The Tencent Mixed Reality multimodal generation technology leader pointed out that compared to text-to-video, image-to-video models will progress faster in terms of usability and Mixed Reality may release the latest developments in less than a month.
Since the AI big model boom sparked by ChatGPT two years ago, the technology path of large language models has converged, while video generation models are still in the exploration phase.
Orient analyst pointed out that under the leadership of OpenAI's technological direction, the current technology path of language models is basically the GPT route. As for multimodal technology, currently no company is in an absolute leading position, and the technology path still has room for exploration.
The Tencent Mixed Reality multimodal generation technology leader also stated that text-to-video in general is in a relatively immature stage, with low overall success rates.
As the most challenging area in multimodal generation, video generation requires higher computational power, data, and other resources. Currently, compared to text and image, it is less mature, while facing slow progress in commercialization and productization.
OpenAI also announced the delay of Sora's update due to a shortage of computing power, resulting in it not being open to the public yet.
Nevertheless, in order to quickly seize the market, there have been intensive achievements in the field of video generation since November last year.
As of now, many large model manufacturers at home and abroad have implemented Sora-like products, including MiniMax, Zhipu, ByteDance, Kuaishou, Aishike Technology domestically, and Runway, Pika, Luma overseas. However, due to factors such as computing power and technology, the duration of video generation is generally within 10 seconds.
In order to promote commercialization, large model manufacturers must find more applications for video generation. This time, Tencent's idea is: the mixed-element video generation model screen has high quality, which can be used in industrial-level business scenarios such as advertising, animation production, creative video generation, and other scenarios.
Video AI is the final link in the multimodal field and is also an area that is more likely to generate popular applications. However, how to strike a balance between computing power input and commercialization remains a major challenge that current "Sora-like" video generation models must solve.
Editor/Rocky