来源:中信建投证券研究
GPT-4在理解能力、图片和文本的综合理解、定制个性等方面具有显著提升。对于应用领域而言,我们已经可以看到多模态模型帮助应用同时实现增收、降本增效的可能性。我们此前将现在类比为移动互联网爆发前夕,预计GPT-4将加速这一进程。
其中我们认为“多模态+图片/视频应用”是应用发展的基座,“+游戏”将从改善需求看实现增收,同时为大型游戏减少研发费用,为中小型游戏减少营销费用,“+虚拟人”将解决行业发展受限于套皮等“伪需求”问题。
OpenAI在3月15日正式发布GPT-4。据OpenAI,GPT-4是多模态模型,可以理解文字和图片,并反馈文字,其理解能力较GPT-3和ChatGPT更强。目前GPT-4的文字输入和反馈功能已经在ChatGPT更新,并开放了API接口,而图片输入功能将会和Be My Eyes合作。据Be My Eyes官网,Virtual Volunteer功能将会结合GPT-4,iOS和安卓应用已开放该功能的预约。
据OpenAI官网,相较ChatGPT和GPT-3,GPT-4主要在一下6个方面有较大改进:
1)GPT-4的理解能力有重大优化,我们预计将显著改善办公等生产力场景的用户体验。据OpenAI官网,在模拟AP、SAT、GRE和美国法考等大部分考试中,带视觉能力的GPT-4可以获得更好的成绩。26门模拟考试中,GPT-4在17门获得了更好的成绩,特别是在微积分、化学和物理等理科领域有近40%,甚至更高的排名提升。据The Verge报道,此前ChatGPT在数学推算过程中时常会出现错误。而从此次OpenAI展示的结果看,数学逻辑的推理能力已明显提升。此外排名提升最大的是美国司法考试,GPT-3.5排名仅列后10%,而GPT-4.0可以达到前10%的排名。
2)多模态模型可以综合理解文本和图片,从而优化反馈,我们预计更有助提升教育领域的用户体验。GPT-4的多模态模型可以提取图片和文字中的标签,以统一的数据进行处理,并给予文字反馈。因此在OpenAI的测试中,GPT-4可以理解搞笑图片中iPhone的数据线不合常理得大的梗。我们认为图片和文本的综合理解能力可以优化交互场景体验。比如在教育场景中,从此前单纯的文字/语言互动,发展成结合视觉和语言的理解,给予更好的反馈,预计丰富教育形式,从而提升教育质量。
3)GPT-4在非英语场景下表现得更好。OpenAI使用Azure Translate将57门学科的14,000道选择题翻译成了26种语言,并给予GPT-4测试。结果显示,GPT-4在其中24种语言的正确率高于GPT-3.5、Chinchilla、谷歌的PaLM等LLM的英文测试表现,包括拉脱维亚语、威尔士语、斯瓦西里语等预料资源稀缺的小语种。从另一层面可见GPT-4对于语言的理解能力也要由于其他LLM。
4)GPT-4的“可操纵性”(Steerability)将赋予AI不同的个性,预计将进一步推动虚拟人成为“人”的可能。相比ChatGPT固定的语言风格,GPT-4将允许接入API的用户定制AI的“性格”。我们预计将进一步优化虚拟人的反馈机制,类似于国内AI对话应用Glow可以让用户与“钢铁侠”Tony Stark等不同背景、场景的虚拟人对话,将相关技术带入虚拟人场景中,是虚拟人成为真正的“人”。
因此,我们认为ChatGPT使虚拟人摆脱动捕仅,获得AI反馈机制,成为“人”是第一步,而GPT-4解锁了虚拟人发展的第二部,使其成为个性迥异的“人”。这有助于解决虚拟人发展受限于套皮、性格等造成与虚拟人对话、互动是“伪需求”的问题。
5)在体验方面,GPT-4在安全、道德、法律等方面的防范意识更好。OpenAI的研发人员基于在开放后,用户不断提出的有害信息、诱导性问题,对模型进行了优化,因此现在GPT-4在安全、道德、法律等方面的防范意识更强。
6)GPT-4允许用户输入更长的内容。相较GPT-3.5和ChatGPT约4,096 tokens/约8,000单词的限制,GPT-4允许用户最多输入32,768 tokens/64,000单词,是过去的8倍。因此,GPT-4可以更持续性地与用户进行更多轮的对话,而不会很快“忘记”之前的对话内容。
以获得Y Combinator支持的生成式AI初创公司为例,多数应用主要是文本形式的输入和输出类应用,包括客服、办公辅助、科技金融等,其次为文本生成图片的变相应用,如生成不同艺术风格的短视频(无情节的艺术插画拼接)、游戏3D模型和素材生成等。
此次多模态的GPT-4发布,我们认为一方面,在生产力工具、教育和客服等交互应用,这类目前更容易落地的场景内,我们可以看到GPT-4的辅助能力进一步提升,优化了现有落地场景的用户体验;另一方面,我们也看到了多模态模型的可能性,此次升级在输入端,将文本理解,升级成文本和图片的综合理解,而未来我们也可以展望在输出端,也可以有文本结合图片、视频等形式的产出,从而推动图片/视频应用、游戏、虚拟人等应用场景有更丰富的功能落地。
我们认为“多模态+图片/视频应用”是应用领域的基座,提升生产效率、降低成本。目前已有的AIGC技术融合应用的形式还较为单一,多数仍是文字生成图片的变相应用。而多模态模型使文本、图片、视频等多种内容形式的综合理解,以及多种内容的结合输出成为可能。最终不仅可以在C端的场景中,为日常生活提供娱乐和生产工具,同时也可以在游戏、虚拟人等内容的生产中提供辅助工具。因此我们认为,“多模态+图片/视频应用”是应用领域落地的基础。
“多模态+游戏”:1)提升行业需求:互动感倍增,解决行业需求放缓的痛点。自疫情初期经历短暂需求增长后,市场整体出现需求疲软的情况。据游戏产业报告,22年中国游戏市场实际销售收入2,658.8亿元,同比下降10.3%,减少306.3亿元。而多模态AIGC模型的应用,预计可提升游戏的互动体验。如网易已在《逆水寒》中将AIGC技术应用于NPC,增加玩家互动体验。未来,我们预计AIGC可以改变游戏固定的故事模式,提升游戏的内容量,增强游戏的互动体验,最终通过技术改善游戏需求增长放缓的问题。
2)降低成本:大型游戏降低研发成本,中小型游戏降低营销成本。除增收以外,多模态也可以通过更低的生产成本制作出内容量更大的游戏,类似于TechCrunch报道的,哥本哈根大学的团队将AIGC技术应用于《超级马里奥》游戏中,生成无限关卡的MarioGPT,对于大型游戏而言可以降低研发成本。
中小型游戏的研发成本占比有限,而将本逻辑类似于广告营销公司。基于用户在微博、抖音等平台观看的内容,以及天气、地理位置等外部信息,生成“千人千面”的广告内容,最终提升广告的ROI。因此,多模态模型可以降低广告素材生产成本,提升广告效果,从而为中小型游戏将本。
“多模态+虚拟人”:成为真正的“人”,解决行业发展痛点。由于目前虚拟人存在套皮,或是AI生成的虚拟人性格单一等问题,导致虚拟人存在是否为“伪需求”的质疑。而从此次GPT-4的发布中可以看到,AI已可以拥有个性,同时多模态可以结合文本/语言、图片/视觉的理解,更好得让虚拟人理解人类的真实感受,并给出反馈,提升互动的体验感,解决行业发展痛点。
我们认为多模态的GPT-4为图片/视频应用、游戏和虚拟人在发展中落地 AIGC 技术,提供了更多可能性,有助同时实现增收、降本增效,最终改善行业和个股的估值弹性。
风险提示:
生成式AI技术发展不及预期、各领域技术融合进度不及预期的风险、算力支持程度不及预期、数据质量及数量支持程度不及预期、用户需求不及预期、技术垄断风险、原始训练数据存在偏见风险、算法偏见与歧视风险、算法透明度风险、增加监管难度风险、政策监管风险、商业化能力不及预期、相关法律法规完善不及预期、版权归属风险、深度造假风险、人权道德风险、影响互联网内容生态健康安全风险、企业风险识别与治理能力不足风险、用户审美取向发生变化的风险。
编辑/Somer
Source: CITIC Construction Investment Securities Research
GPT-4 has significantly improved comprehension ability, comprehensive understanding of images and text, and customized personality. As far as the application field is concerned, we can already see the possibility that multi-modal models can help applications increase revenue, reduce costs and increase efficiency at the same time. We have previously compared the present to the eve of the mobile internet explosion, and we expect GPT-4 to accelerate this process.
Among them, we believe that “multi-modal+picture/video applications” are the foundation for application development. “+ games” will increase revenue by looking at improved demand. At the same time, they reduce R&D costs for large-scale games and reduce marketing expenses for small and medium-sized games. “+ virtual people” will solve “false demand” problems such as industry development being limited by parody.
OpenAI officially released GPT-4 on March 15. According to OpenAI, GPT-4 is a multi-modal model that can understand text and images and feed back text. Its comprehension ability is better than GPT-3 and ChatGPT. Currently, GPT-4's text input and feedback functions have been updated to ChatGPT and an API interface has been opened, and the image input function will cooperate with Be My Eyes. According to the official website of Be My Eyes, the Virtual Volunteer function will be combined with GPT-4. The iOS and Android apps have already enabled this function to make reservations.
According to the official website of OpenAI, compared to ChatGPT and GPT-3, GPT-4 mainly has major improvements in the following 6 areas:
1) GPT-4's comprehension ability has been greatly optimized, and we expect it to significantly improve the user experience in productivity scenarios such as office.According to the official website of OpenAI, GPT-4 with visual ability can obtain better results in most of the simulated AP, SAT, GRE, and US law exams. Of the 26 mock exams, GPT-4 achieved better results in 17. In particular, in science fields such as calculus, chemistry, and physics, the ranking improved by nearly 40%. According to The Verge, previously, ChatGPT often made mistakes during mathematical estimation. Judging from the results shown by OpenAI this time, the ability to reason in mathematical logic has improved markedly. Furthermore, the biggest increase in rankings was in the US Bar Examination. GPT-3.5 ranked only in the bottom 10%, while GPT-4.0 can reach the top 10%.
2) The multi-modal model can comprehensively understand text and images to optimize feedback, which we anticipate will be more helpful in improving the user experience in the field of education.GPT-4's multi-modal model can extract tags from images and text, process them with unified data, and give text feedback. Therefore, in the OpenAI test, GPT-4 can understand that the iPhone's data cable in the funny picture is very unreasonable. We believe that the ability to comprehensively understand images and text can optimize the interactive scene experience. For example, in the education scenario, the previous simple text/language interaction evolved into a combination of visual and language understanding, giving better feedback, and is expected to enrich the forms of education, thereby improving the quality of education.

3) GPT-4 performs better in non-English scenarios.OpenAI used Azure Translate to translate 14,000 multiple choice questions from 57 subjects into 26 languages and gave them GPT-4 tests. The results showed that the accuracy rate of GPT-4 in 24 of these languages was higher than the English test performance of LLMs such as GPT-3.5, Chinchilla, and Google's Palm, including small languages where resources are expected to be scarce, such as Latvian, Welsh, and Swahili. From another level, it can be seen that GPT-4's ability to understand language is also due to other LLMs.
4) GPT-4's “maneuverability” (steerability) will give AI a different personality and is expected to further promote the possibility of virtual humans becoming “people.”Compared to ChatGPT's fixed language style, GPT-4 will allow users connected to the API to customize the “personality” of AI. We expect to further optimize the feedback mechanism for virtual people. Similar to the domestic AI conversation application Glow, which allows users to talk to virtual people with different backgrounds and scenarios such as “Iron Man” Tony Stark, brings relevant technology into the virtual person scene, and makes the virtual person a real “person.”
Therefore, we think that ChatGPT frees virtual people from being caught. Getting an AI feedback mechanism and becoming a “person” is the first step, while GPT-4 unlocks the second part of virtual human development, making them “people” with very different personalities. This helps solve the problem that conversation and interaction with virtual people is a “false demand” due to virtual human development limited by mannerism, personality, etc.
5) In terms of experience, GPT-4 has a better sense of prevention in terms of safety, morality, law, etc.The developers of OpenAI optimized the model based on harmful information and inductive questions continuously raised by users after opening up, so now GPT-4 has a stronger sense of prevention in terms of safety, morality, and law.
6) GPT-4 allows users to type longer content.Compared to GPT-3.5 and ChatGPT's limit of about 4,096 tokens/about 8,000 words, GPT-4 allows users to enter up to 32,768 tokens/64,000 words, 8 times that of the past. As a result, GPT-4 can have more rounds of conversations with users on a more continuous basis without quickly “forgetting” the content of previous conversations.
For example, a generative AI startup supported by Y Combinator is an example. Most applications are mainly text-based input and output applications, including customer service, office assistance, technology finance, etc., followed by disguised applications for text-generating images, such as generating short videos of different art styles (ephemeral art illustration collage), game 3D models and material generation, etc.

With the release of multi-modal GPT-4 this time, we believe that on the one hand, in interactive applications such as productivity tools, education, and customer service, we can see that GPT-4's auxiliary capabilities have been further improved and the user experience of existing landing scenarios has been optimized; on the other hand, we have also seen the possibility of a multi-modal model. This upgrade is on the input side, upgrading text understanding to comprehensive understanding of text and images, and in the future, we can also look forward to the output of text combined with images, videos, etc., to promote the output of photo/video applications, games, etc. Application scenarios such as virtual people, etc. have implemented richer functions.
We believe that “multi-modal+picture/video application” is the foundation of the application field, improving production efficiency and reducing costs.Currently, the forms of AIGC technology fusion applications that exist are still relatively uniform; most of them are still disguised applications for text generation images. However, the multi-modal model makes it possible to comprehensively understand various content forms such as text, images, and video, as well as the combined output of various contents. Ultimately, it can not only provide entertainment and production tools for everyday life in C-side scenarios, but also auxiliary tools in the production of content such as games and virtual people. Therefore, we believe that “multi-modal+picture/video application” is the foundation for implementation in the application field.
“Multi-modal+game”: 1) Increase industry demand: double the sense of interactivity and solve the pain points of slowing industry demand.After experiencing a brief increase in demand at the beginning of the pandemic, demand in the market as a whole was weak. According to the game industry report, the actual sales revenue of the Chinese game market in '22 was 265.88 billion yuan, a year-on-year decrease of 10.3% and a decrease of 30.63 billion yuan. The application of the multi-modal AIGC model is expected to enhance the interactive experience of the game. For example, NetEase has applied AIGC technology to NPCs in “Against the Cold Water” to enhance players' interactive experience. In the future, we expect AIGC to change the game's fixed story pattern, increase the amount of game content, enhance the game's interactive experience, and ultimately improve the slow growth in game demand through technology.
2) Cost reduction: large games reduce R&D costs, and small and medium games reduce marketing costs.In addition to increasing revenue, multi-mode can also create games with higher content through lower production costs. Similar to the TechCrunch report, the University of Copenhagen team applied AIGC technology to the “Super Mario” game to generate MarioGPT with unlimited levels, which can reduce R&D costs for large-scale games.
Small to medium games account for a limited share of development costs, and this logic is similar to that of an advertising marketing company. Based on the content that users watch on platforms such as Weibo and Douyin, as well as external information such as weather and geographical location, “one thousand people, one thousand faces” advertising content is generated, ultimately increasing the ROI of the advertisement. Therefore, the multi-modal model can reduce the production cost of advertising materials and improve advertising effectiveness, thereby making it a source for small and medium-sized games.
“Multi-modal+virtual person”: Become a real “person” and solve the pain points of industry development.Since virtual people currently have problems such as hypocrisy or AI-generated virtual people have a single personality, it has led to the question of whether virtual people have “pseudo-requirements.” As can be seen from the release of GPT-4 this time, AI can already have individuality. At the same time, multiple modes can combine text/language and picture/visual understanding to better enable virtual people to understand real human feelings and provide feedback, enhance the sense of interactive experience, and solve pain points in industry development.
We believe that the multi-modal GPT-4 provides more possibilities for image/video applications, games, and virtual people to implement AIGC technology in development, which will help increase revenue, reduce costs and increase efficiency at the same time, and ultimately improve the valuation flexibility of the industry and individual stocks.
Risk warning:
The development of generative AI technology falls short of expectations, the risk of technology integration in various fields falls short of expectations, the degree of computing power support falls short of expectations, data quality and quantity support falls short of expectations, user demand falls short of expectations, technology monopoly risk, risk of bias in the original training data, risk of algorithm bias and discrimination, risk of algorithm transparency, increased regulatory difficulty risk, policy supervision risk, commercialization capability falls short of expectations, and the improvement of relevant laws and regulations falls short of expectations, copyright ownership risk, risk of deep fraud, human rights risks affecting the ecological health and security of Internet content, enterprise identification and governance risks The risk of lack of competency and the risk of changes in the user's aesthetic orientation.
Editor/Somer