快手AI文生视频大模型体验:更偏商业端,“国产版Sora” 来了?

Experience the large AI text-to-video model of Kuaishou: More geared towards the commercial side, is the "domestic version of Sora" here? ·  Jun 21 22:35

① The large video model released by Kuaishou works, and supports various functions such as Wensheng video, Tusheng video, and video rewriting. ② Some organizations have pointed out that AIGC fields, such as literature, Wensheng maps, Wensheng videos, and digital people, which are not sensitive to “illusion” issues, are expected to take the lead in commercialization.

CIFA, June 21 (Reporter Tang Zhixiao), has the Chinese version of SORA arrived?

Recently, Kuaishou launched a large video generation model - Keling, which supports Wensheng video, Tucson video, and video rewriting functions.

The Financial Services Association reporter learned that the Keling Model's ability to understand text and video semantics based on the Diffusion Transformer architecture supports outputting video up to 2 minutes at 1080P 30 frames, and the generation time directly overtakes Sora.

On June 21, the video rewriting function launched by Keling supports one-click rewriting of the generated video and multiple times in a row. The video can last for about 5 seconds at a time, and the maximum length of the video can be generated for about 3 minutes.

Kuaishou Keling's closed beta application was in the “AI Creation” function module of Kuaishou's editing software, and the Finance Association reporter had an in-depth experience after the application was approved.

Currently, Keling's performance is quite good in terms of generation speed. Every Wensheng 5s video can basically be completed within 2-3 minutes every time. According to public information, in addition to Kuaishou and Sora, Luma AI released Dream Machine, a document generation graphics model, and began closed beta; Adobe's Firefly added generative extensions, and audio and video generation functions will soon be launched; Meitu has built an AI short film workflow and developed MOKI, an AI short film creation tool, which is expected to be launched on July 31 this year.

Some research institutes pointed out that in the current context where the output content of AI models cannot fully guarantee accuracy and accuracy, AIGC fields such as Wensheng Literature, Wensheng Map, Wensheng Video, and Digital People, which are not sensitive to “illusion” issues, are expected to take the lead in commercialization.

Wensheng videos are more natural, and there is still room for improvement in Tusheng videos

In order to show the power of Keling's AI, the Financial Services Association reporter tested the Keling model from the two levels of accurate language recognition and video expression accuracy. Among them, in terms of accuracy, it is mainly considered from two levels. The first is the presentation of light and shadow effects, and the second is object relationships (such as human-to-human interaction, human-object interaction), etc.

In order to make it easier to watch, the Financial Services Association reporter converted the video into an animated image, so it will have a certain impact on video quality and frame rate, but it can basically show smart video generation capabilities.


First, in terms of text recognition and processing, the Financial Services Association reporter tried to describe a scene in detail: “A middle-aged woman with white hair, wearing a dark blue suit, shows a blue moon laundry detergent with a white bottle and a green bottle cap on a Canon camera. The background is a sunset scene on the beach and ocean.”

Keling basically reproduced the demand for descriptive text as needed, but the camera in the text description did not appear in the video, and the hand sanitizer brand was mosaicked, probably due to copyright disputes.


Next, the reporter tried a shorter description: “A Bichon dog dances in a nightclub wearing a spacesuit and high heels.”

Although the costume on the Bichon dog is still different from the actual spacesuit, the degree of restoration this time is quite high.

Next, the Financial Services Association reporter described two more scenes to test the effects of Keling's light and shadow, and the degree of restoration of object relationships.


The video description above reads: “In a deep-sea tunnel with complicated lighting, a silver-white hood with a black-looking Maybach is driving through a pool of accumulated water at a speed of 120 kilometers per hour, and the water splashed all over the camera.”


The description of the video above reads: “On the deserted planet of death, a group of Kamen Riders fought with lightsabers and cut off each other's helmets.”


The description of the video above reads: “Two great men slap each other in the water cube.”


The description of the video above reads: “Kitten delivering takeout, abstract style, delivering pizza to people.”


The description of the video above reads: “The kitten takes off the helmet from its head with its front claws and puts it in the front frame of the electric car.”


The description text for the video above reads: “A little girl eats noodles.”


The description of the video above reads: “A woman pushes her bike back and forth, and a cherry blossom petal falls on her head.”

Currently, the function of Tusen videos is to make the subject of the picture move and perform actions through accurate keywords, but complex object interactions do not perform well.

For example, if a kitten takes off its helmet with its front claws, the AI does not correctly recognize the cat's front claws in the picture, but instead creates another front claw, and does not take off the helmet; instead, the generated front claws are attached to the helmet in the front frame.

The little girl basically had a “eat and spread” effect when eating noodles, and her facial features and food were clear.

However, the woman's stroller retracted and the cyclist went backwards. Although the direction of movement was correct, the petals only fell in front of the camera, not on the woman's head.

Through the above tests, we can basically deduce the following conclusions:

As for the relationship between light and shadow, there is generally no problem with Keling's presentation of the relationship between fluid and character, and the accuracy of Wensheng's video is not much different from Sora. For example, when the light sweeps through the roof of the car, the difference in the metal reflection changes between the front windshield and the hood; the splashes of water that splashes when the vehicle passes through accumulated water, etc.

There is still room for improvement in Keling's handling of object relationships. For example, in Duel Time Sword, there was a “mold piercing” situation.

Furthermore, Keling can basically conform to real movement rules. In the test, vehicle driving, girls eating, etc. basically met the actual rules and keyword input requirements.

Regarding the problem of unidentified keywords, there are opinions that the reason for this problem is that currently large models for video generation basically learn physical knowledge directly from video data, but actual videos often contain a lot of information, so it is difficult for large models to accurately distinguish and learn every physical law.

A CIFA reporter learned that the Kuaishou Big Model Team developed their own 3D VAE network and 3D Attention mechanism (3D Attention) to better implement spatio-temporal modeling using multi-modal technology.

Wan Pengfei, head of the Kuaishou Visual Generation and Interaction Center, publicly stated: “Kuaishou is a platform with massive video data that enables full-process, automated, and efficient training and evaluation of support models.”

He also added that Kuaishou has a multi-dimensional video tagging system, which can filter data in detail or adjust the distribution of data.

What are the commercial possibilities? Landing scenario or more commercial side

According to the Financial Services Association reporter, the number of applicants for Keling's closed beta has now exceeded 140,000, and many creators are among them.

A video content creator told the Financial Federation reporter that videos generated using AI tools look very cool, but such tools are of little significance in the hands of ordinary people. The cost of AI-generated videos is also not low. This can be seen from the degree of openness of ChatGPT and Sora. ChatGPT can be opened to 100 million users, and only a few people have tried Sora so far.”

However, some video platforms don't encourage AI-synthesized content. These types of videos don't receive much traffic, and some are even limited. Currently, all major content platforms have relevant restrictions, and AI-generated content will mark “the work is suspected to be AI synthesized, please check carefully”.

The content creator added that the true meaning of the AI video generation function is to simplify the video production process. It can not only help mature content creators generate material without copyright disputes, accelerate content creation, but also help traditional graphic creators to video based on existing content and accelerate content migration.

He believes that through accurate descriptions, content creators can save time in selecting the right materials. Professional content creators themselves will also buy their own video materials or activate corresponding memberships. The only question now is whether the charges are reasonable.

However, in the long run, neither Keling nor Sora will use C-side applications as the main development direction in the future; only the ability to implement applications based on scenarios is more imaginative.

Another film and television industry practitioner told the CIFA reporter that AIGC tools have already been used in the film and television industry. For example, the footage of the main character in “Instantaneous Entire Universe” traveling through multiple universes at high speed can be quickly generated using AI technology, which can reduce production costs. “Using traditional production processes, even just to produce a one-minute video, it would take a huge team of months to complete, involving multiple processes such as scripting, modeling, and post-rendering.”

A CIFA reporter learned that after accepting the Dream Machine closed beta invitation released by Luma, some filmmakers used this AI tool to make some micro-movies and teaser videos. Following the generation of skit scripts, AI video generation tools will likely be used to directly generate skits, and this experiment will also shorten the skit track link.

Currently, Keling's commercial service on the B-side has not yet started, but judging from previous AIGC technology applications, short video slicing, comment area interaction, and digital human anchors are all scenarios where content e-commerce can be implemented.

According to reports, e-commerce platforms including JD, Kuaishou, and Douyin are already using AI models to assist merchants in their operations. For example, JD's free digital human broadcast service can continuously broadcast live for 24 hours; Kuaishou's “AI script generation+intelligent highlight slicing+full-modal search large model” function, etc.

According to data provided by Kuaishou, the application of AIGC technology has begun to improve marketing conversion efficiency, increasing the overall marketing conversion rate by 33% through Pangu video AIGC materials.

Some industry insiders believe that in the future, Wensheng video tools may also be used by merchants to display short videos of product functions and scenes. “Compared to actually shooting each product, the time and labor costs of directly using AI tools to generate videos may be lower.”

As for the impact on the cost side, certain results can also be derived from the effects of current digital human anchor applications. Wang Sixun, head of investment promotion for the Kuaishou Magnetic Engine Project and head of the Magnetics Center, shared a set of data: “On the premise of filtering out distractions as much as possible, our tests revealed that the data performance of the live broadcast room and the digital live broadcast room was almost the same. AIGC technology automatically generates livestream/short video material, making the enterprise's live streaming risk manageable and operational efficiency improved.”

Analysts at Tianfeng Securities believe that generative AI will make significant strides in video creation and world models to penetrate downstream application scenarios such as video/3D/gaming. In downstream fields such as short videos, creative tools, and games, AI native products such as Keling and Sora are expected to be integrated into the workflow, enhance user experience, reduce user usage barriers, further reduce creation costs, and greatly expand the boundaries of creators' abilities.

The securities analyst added, “Unlike other OpenAI products, the DIT architecture path represented by Sora is relatively difficult to replicate under the premise of sufficient computing power, and the layout speed of leading domestic Internet companies on generative video tools may continue to exceed expectations.”

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment