China Merchants Securities: Wensheng's video model SORA's performance exceeds expectations, driving demand for computing power network construction

Zhitong Finance · Feb 20 13:57

智通财经APP获悉，招商证券发布研究报告称，Sora打开AIGC在视觉领域的应用空间，算力网络供给持续短缺拉动硬件基础设施建设需求。该行测算训练Sora模型需要约7.09万张H100一个月的训练量。在推理侧，根据相关研究测算生成一张图的算力消耗约为256个词的消耗。由此推算生成一个1分钟时长短视频的算力消耗约是生成一次文字对话的千倍以上。该行认为中短期算力将持续处于短缺不能充分满足推理侧需求。

事件：2月16日OpenAI推出文生视频模型Sora，可以根据文本指令创建现实且富有想象力的场景，能够生成具有多个角色、特定类型的运动，以及主体和背景的准确细节的复杂场景的高清视频，并且时长可以达到一分钟。Sora的超预期表明Transformer模型在视觉领域的有效，为视觉模型的加速迭代奠定基础。

招商证券观点如下：

Sora模型展示效果惊艳，创立视觉模型里程碑。

与之前的视觉模型不同，OpenAI的Sora是视觉数据的通用模型，通过一次为模型提供多帧的预测，解决了一个具有挑战性的问题，即确保主题即使暂时离开视野也保持不变。它可以生成不同时长、长宽比和分辨率的视频和图像，而且最多可以输出长达一分钟的高清视频。Sora的核心优势：一致性、灵活性、稳定性。Sora能够灵活的生成各种像素各种画幅的图像，同时能够根据图像生成视频或者将视频内容扩充出新的视频。与其他模型相比Sora生成的时长达到1分钟的情况下还能保持前后主题的一致性是之前视觉模型所不具备的。同时Sora还涌现出对物理规律的理解能力，在没有人为约束的情况下生成的画面中满足物理学规则使得画面更加逼真。

视觉模型的GPT3时刻，模型迭代进入加速期。

Sora之前，虽然大语言模型随着GPT的成功逐渐成为主要研究方向，不过扩散模型仍在大语言模型占据主导地位。DALL·E、StableDiffusion等广泛使用的视觉模型都采用扩散模型。2023年谷歌提出大语言模型之所以在视频领域表现不佳的主要原因不在于模型本身而在于没有好的表达形式来转化视频，也证明了大语言模型在文生视频领域的可行性。Sora的突破之处在于基于DiT结构，结合了大语言模型和扩散模型的共同优点。使得Diffusion模型也能够规模化，证明GTP4式的大力出奇迹也能在视觉领域出现同样的“涌现”效果。Sora标志了扩散+语言大模型融合路线的成功，未来具有很大的迭代潜力，类似于GPT3的里程碑意义，沿着这条道路持续迭代未来1-2年内有望出现能生成效果更加逼真的视觉模型。

Sora大幅拉动算力需求，拉动硬件建设投资。

根据DiT模型创立者谢赛宁博士粗略测算，Sora模型的参数规模大约为30亿。根据对可训练数据量的研究成果，海外大型视频网站每分钟大约上传500小时视频内容。由此该行测算训练Sora模型需要约7.09万张H100一个月的训练量。在推理侧，根据相关研究测算生成一张图的算力消耗约为256个词的消耗。由此推算生成一个1分钟时长短视频的算力消耗约是生成一次文字对话的千倍以上。中短期算力将持续处于短缺不能充分满足推理侧需求。

投资建议：Sora打开AIGC在视觉领域的应用空间，算力网络供给持续短缺拉动硬件基础设施建设需求。

光模块环节该行重点推荐北美光模块核心供应商：中际旭创(300308.SZ)、新易盛(300502.SZ)，及其上游核心供应商天孚通信(300394.SZ)，及国产光芯片龙头源杰科技(688498.SH);

交换机环节该行建议关注交换机国产替代龙头紫光股份(000938.SZ)、锐捷网络(301165.SZ)，同时建议关注国产交换机芯片龙头盛科通信(688702.SH)，同时推荐国内ICT巨头中兴通讯(000063.SZ);

视频编解码环节该行建议关注视频编解码优质公司当虹科技(688039.SH)、维海德(301318.SZ)。

风险提示：核心计算参数假设不准确，Sora模型落地进度不及预期，行业竞争格局恶化

The Zhitong Finance App learned that China Merchants Securities released a research report saying that Sora has opened up AIGC application space in the field of vision, and the continued shortage of computing power network supply is driving demand for hardware infrastructure construction. The line estimates that training the Sora model requires about 70,900 H100 sheets per month of training. On the inference side, according to relevant research estimates, the computational power consumption of generating a graph is about 256 words. From this, it is estimated that generating a one-minute short video consumes about a thousand times more computing power than generating a text conversation. The bank believes that there will continue to be a shortage of computing power in the short to medium term and will not be able to fully meet inference side needs.

Incident: On February 16, OpenAI launched the Wensheng video model Sora, which can create realistic and imaginative scenes based on text instructions, and can generate high-definition videos of complex scenes with multiple characters, specific types of motion, and accurate details of subjects and backgrounds, and can be up to one minute long. Sora's exceeded expectations showed the effectiveness of the Transformer model in the field of vision, laying the foundation for accelerated iteration of visual models.

The views of China Merchants Securities are as follows:

The Sora model was showcased with amazing results and established a visual model milestone.

Unlike previous visual models, OpenAI's Sora is a generic model for visual data. By providing multiple frames of predictions to the model at once, it solves the challenging problem of ensuring that the subject remains the same even when temporarily out of sight. It can generate videos and images of varying length, aspect ratio, and resolution, and can output up to a minute of high-definition video. Sora's core strengths: consistency, flexibility, stability. Sora can flexibly generate images of various pixels and formats, and can also generate videos based on images or expand video content into new videos. Compared to other models, when Sora is generated for up to 1 minute, it is possible to maintain the consistency of the previous visual model before and after the previous visual model. At the same time, Sora also showed an ability to understand the laws of physics, and satisfied the rules of physics in images generated without artificial constraints, making the images more realistic.

At the time of GPT3 of the visual model, model iteration entered a period of acceleration.

Before Sora, although the big language model gradually became the main research direction with the success of GPT, the diffusion model still dominated the big language model. Widely used visual models such as DALL·E and StableDiffusion all use diffusion models. The main reason why the big language model proposed by Google in 2023 did not perform well in the video field was not that the model itself did not have a good form of expression to convert videos, which also proved the viability of the big language model in the field of Wensheng video. Sora's breakthrough is that it is based on the DIT structure and combines the common advantages of the big language model and the diffusion model. The Diffusion model can also be scaled up, proving that the powerful miracle of GTP4 can also have the same “emerging” effect in the field of vision. Sora marks the success of the diffusion-language model fusion route. It has great potential for iteration in the future, similar to GPT3's landmark. Continued iteration along this path is expected to produce more realistic visual models within the next 1-2 years.

Sora greatly boosts demand for computing power and drives investment in hardware construction.

According to rough estimates by Dr. Xie Saining, the founder of the dIT model, the parameter scale of the Sora model is about 3 billion. According to research results on the amount of data that can be trained, major overseas video sites upload about 500 hours of video content every minute. From this, the line estimates that training the Sora model requires about 70,900 H100 sheets per month of training. On the inference side, according to relevant research estimates, the computational power consumption of generating a graph is about 256 words. From this, it is estimated that generating a one-minute short video consumes about a thousand times more computing power than generating a text conversation. There will continue to be a shortage of computing power in the short to medium term and will not be able to fully meet inference side needs.

Investment advice: Sora opens up AIGC's application space in the field of vision. Continued shortage of computing power networks is driving demand for hardware infrastructure construction.

In the optical module segment, the company focuses on recommending the core suppliers of optical modules in North America: Zhongji Xuchuang (300308.SZ), Xinyisheng (300502.SZ), their upstream core supplier Tianfu Communications (300394.SZ), and Yuanjie Technology (688498.SH), a leading domestic optical chip manufacturer;

In the switch section, the bank recommended focusing on Ziguang (000938.SZ) and Ruijie Network (301165.SZ), the domestic switch chip leader, Shengke Communications (688702.SH), and also recommending domestic ICT giant ZTE (000063.SZ);

In the video codec section, the company suggests focusing on high-quality video codec companies Danghong Technology (688039.SH) and Weihai (301318.SZ).

Risk warning: Assumptions of core calculation parameters are inaccurate, implementation progress of the Sora model falls short of expectations, and the competitive landscape of the industry deteriorates

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

招商证券：文生视频模型SORA表现效果超预期 带动算力网络建设需求

China Merchants Securities: Wensheng's video model SORA's performance exceeds expectations, driving demand for computing power network construction

Risk Disclaimer

Statement

招商证券：文生视频模型SORA表现效果超预期带动算力网络建设需求