Shangtang-W (0020.HK): Shangtang released the streaming multi-modal interaction model “Nishi-Nissin 5.5” for the first time in China to benchmark GPT-4O

海通證券 · Jul 14, 2024 00:00

流式多模态交互大模型「日日新5.5」发布，国内首次对标GPT-4o。7 月5日，商汤科技在WAIC 2024 举办“大爱无疆·向新力”人工智能论坛，发布国内首个具备流式原生多模态交互能力大模型「日日新SenseNova 5.5」，综合性能较两个月前的「日日新5.0」提升30%，交互效果和多项核心指标实现对标GPT-4o。「日日新5.5」主要更新点包括：（1）6000 亿参数基模型性能全面提升。大量使用合成高阶思维链数据，提升推理思维能力，在数理逻辑、英文、指令跟随等方面能力增强明显。（2）率先推出国内首个“所见即所得”模型「日日新5o」，流式多模态交互，带来全新AI 交互模式。

（3）端侧模型全面升级，发布「日日新5.5 Lite」，相比4 月5.0 版模型精度提升10%，推理效率提升15%，首包延迟降低40%。特别是在多模态能力上，「日日新5.5」在大部分核心测试集指标上都对标甚至超过GPT-4o。

AI 大模型演进中，创新的交互模式将率先定义行业发展。通过整合跨模态信息，基于声音、文本、图像和视频等多种形式，「日日新5o」带来了实时的流式多模态AI 交互体验。使用感受如同人类自己交流一样直接，可以直接见客户所见，理解客户所需。这种交互模式多任务适应性强，能够在同一模型中自然处理多种任务，且根据不同上下文自适应调整行为和输出。从场景理解分析、物体信息描述、书籍图文总结，甚至粗糙的简笔画、面部情绪，「日日新5o」都能精准拿捏，丝滑交互，还能言语俏皮的与人做互动。

高度关注端侧AI 和行业应用，AI 商业落地加速中。在商汤看来，要让每个人都能用上AI 大模型就必须从终端开始。「日日新5.5 Lite」端侧大语言模型「商量SenseChat Lite-5.5」各维度全面升级，是目前综合性能最好的端侧模型。同时，配合端云模式，既保障性能，又保障速度。目前，商汤「日日新」端侧模型已深入各个行业，与超过150+客户启动商业对接，覆盖智能手机、平板电脑、VR 一体机、车载电脑、智能台灯等诸多IoT 设备部署应用。

接入商汤「日日新?商量」端侧大模型，单台设备成本低至9.9 元/每年。商汤端侧大模型拥有多项优势，包括：（1）可支持。多种垂直业务方向，如写作、百科知识等不同领域优化。（2）可用性。同时支持端侧部署及云侧调用。

（3）低门槛。端侧SDK 集成简易，可支持快速部署。目前，商汤“日日新”大模型体系已经在大量应用场景和垂直行业中发挥实际价值：编程领域，通过大模型提供智能代码补全等功能，可显著提升程序员日常工作效率；医疗领域，从诊前的预问诊，到健康咨询再到诊后随访，大模型的赋能改善患者就医全流程体验；金融领域，商汤已在多模态、多场景与银行、保险、券商和资管客户展开合作；消费领域，商汤与多家国内头部厂商合作，将大模型能力转化为场景化服务，例如通过Copilot 帮助用户进行表格生成、数据分析、文案写作，提升个人生产力。此外，为帮助更多企业用户低门槛接入，商汤于近期推出“大模型0 元Go”计划。凡「日日新」新注册用户，将获得涉及调用、迁徙、训练等多项免费服务大礼包，同时免费赠送5000 万Tokens包，并派出专属搬家顾问提供从OpenAI 到「日日新」的迁移系列培训。

可控人物视频生成大模型Vimi 发布，AI+视频2C 落地加速中。根据Vimi 相机官微，商汤在WAIC 2024 上发布了首个可控人物视频生成大模型——Vimi，并入选WAIC 展览展示最高荣誉“镇馆之宝”，成为本届大会最具创新展品。

Vimi 基于商汤日日新大模型的强大能力，仅通过一张任意风格的照片就能生成和目标动作一致的人物类视频，不仅能实现精准的人物表情控制，还可实现在半身区域内控制照片中人物的自然肢体变化。并支持多种驱动方式，可通过已有人物视频、动画、声音、文字等多种元素进行驱动。Vimi 模型的优势在于其多年积累的面部跟踪技术和对人脸细节的精准控制，使得人物的表情更加鲜活。与市场上其他模型相比，Vimi 在人脸和上半身的控制上更为精准，能够生成具有高一致性和光影和谐的视频。此外，Vimi 具备极强的稳定性，尤其在长视频的情景下，能够稳定保持人物的脸部可控，可生成长达1 分钟以上的单镜头人物类视频，画面效果不会随着时间的变化而劣化或失真，真正满足娱乐互动等需要长时间稳定视频生成需求。Vimi 在人物视频场景生成中，可以做到整个的环境都跟着肢体的控制去变化，包括生成合理的头发的抖动。甚至能够模拟输入镜头角度，比如输入镜头是逐渐拉近，输出也能有自然地逐渐拉近的效果。自然流畅的头发飘动、服饰变换以及背景环境的营造，Vimi 都能一一呈现，让生成的视频更加逼真、生动。此外，它还支持光影变化的模拟，让视频中的每一个场景都充满电影级的质感。Vimi 模型特别是在长视频情景下能够稳定保持人物脸部可控。此外，Vimi 模型还能够控制镜头角度和生成合理的头发抖动效果，为视频创作者提供了更多的创作自由度。Vimi 相机是Vimi 可控人物视频大模型体系的第一款C 端产品，能够满足广大女性用户的娱乐创作需求。用户只需上传不同角度的高清人物图片，即可自动生成数字分身和不同风格的写真视频，提供唯美写真风、奇幻风等多种生成风格，让用户仿佛穿越不同次元，享受大片质感的沉浸式视觉效果。对于热衷表情包的用户来说，Vimi 相机通过单张图片即可驱动生成各种趣味的人物表情包，玩法多样，实现创作自由。我们认为，此次Vimi 的发布，推动公司在AI+视频领域进入到了新时期， Vimi 的功能进一步拓宽了AI 大模型应用的边界，为公司业务的拓展奠定了坚实基础。

「Sensechat 」发布香港本土版，AI 落地愈发瞄准细分市场。7 月，商汤「Sensechat 」手机App 及网页版向香港用户免费开放。「Sensechat 」基于商汤今年5 月推出的“商量多模态大模型粤语版”。依托商汤“日日新”出色语言和多模态能力，以及对粤语及本地文化、热点的深入理解，「Sensechat」定位为“香港用户的贴心小棉袄”，用户可以直接用最熟悉的广东话跟它聊天，直接文字或语音输入，问问题、搜东西、生成图片、写文案。从生活、学习到工作，「Sensechat」都能带来真正地道的AI 体验，连本地最新信息和社会热门话题，它都十分清楚，甚至还会灵活使用本地流行语。通过App Store 下载「Sensechat」iOS 手机App，使用香港手机号或电子邮件注册，即可随时随地免费体验最智能、快捷、地道的AI 体验，Android 版本亦即将很快推出。「Sensechat」App 支持文字或语音输入，体验方便，主要功能包括：（1）本地化体验。「Sensechat」对香港本地文化、风俗习惯和社会热门话题都有深入了解。

用户可在手机App 中以广东话混合英文与「Sensechat」自然顺畅地进行问答。（2）多模态问答。用户可直接上载文件或图片「Sensechat」就能深入分析内容，生成摘要并解答用户对文件的问题。（3）实时搜索。「Sensechat」能整合多个信息来源，让用户能快速获取最新信息，包括实时新闻、天气状况等用户亦可进行进一步搜索。（4）图像生成。

只需简单描述，「Sensechat」即可快速生成各种风格图片，让用户实时与朋友分享，或上传到自己的社交平台，让创作变得更随心随意。（5）文案撰写。无论是广告文案、商业计划书还是学术写作，用户都能通过「Sensechat」获取专业的文案建议，激发写作灵感。此外，「Sensechat」网页版拥有强大的多模态文件处理能力，和超长文本理解、思考、生成能力，支持上载最多50 个文件。无论想问生活小窍门，还是解数学题、分析图片、编写代码，「Sensechat」网页版都能轻松搞定。我们认为，「Sensechat 」香港本土版的发布，是公司在细分市场落地的重要尝试，对粤语环境的适应也从侧面凸显了公司大模型领先的技术实力，公司未来AI 商业落地值得期待。

与华为积极合作，昇腾助力商汤AI 落地。WAIC 2024 期间，昇腾人工智能产业高峰论坛2024 成功举办，聚焦大模型推理和客户伙伴优秀实践，探索加速大模型创新与应用落地之路。商汤科技联合创始人、大装置事业群总裁杨帆受邀出席并发表主题演讲《生态互联引领大模型时代创新浪潮》，分享了商汤日日新大模型体系基于昇腾AI 基础软硬件平台的全栈技术能力原生开发实践，引领大模型时代创新浪潮。原生开发作为加速AI 创新的重要引擎，正逐渐成为行业焦点。商汤科技大模型研究总监龚睿昊受邀出席“昇腾AI 伙伴原生开发成果发布”，商汤大装置将与合作伙伴一起，共同推动技术创新与产业融合发展。值得一提的是，在WAIC 2024 商汤人工智能论坛上，举办了昇腾原生模型合作签约仪式，商汤科技与华为技术有限公司签署合作协议，推动大模型的原生开发迈向新的台阶。从基础设施的构建，到大模型的突破，再到应用的繁荣，都离不开上下游生态的紧密协同。在过去一年多时间里，商汤与昇腾、昇思团队进行了紧密的合作，共同构建下一代大模型底座和大模型训练新生态。例如，商汤能够在超过三千卡的集群上，跑出行业领先的算力利用率，从而能够以性能更高、效率更高的集群能力服务下游企业。此前，商汤大装置AI 云、日日新·商量大语言模型、商汤医疗大模型“大医”均通过了与Atlas 系列服务器的相互兼容性测试，能够为客户提供更为安全、高效、可靠的人工智能全栈解决方案和应用体验。杨帆表示，“商汤的平台、算法以及行业场景的软件能力与昇腾的硬件、底层基础软件能力的深度整合，将为未来人工智能服务各个行业和产业提供更大的价值和更多样化的解决方案。”未来，商汤还将继续深化与华为的合作，打造出更加高效、低成本、低门槛的人工智能基础设施，更好地服务更多行业、更多场景，给个人消费者和企业带来更多更好的智能化服务，推动中国人工智能技术和产业的持续发展。我们认为，公司通过跟华为积极合作，获得了重要的国产算力伙伴，伴随昇腾生态的发展，商汤AI 落地也有望得到重要助力。

盈利预测与投资建议。我们认为，此次流式多模态交互大模型「日日新5.5」的发布，在国内首次实现了对标GPT-4o，这进一步彰显了商汤强大的技术能力，这也奠定了AI 商业落地的坚实基础，而Vimi 的发布，更是带动公司AI+视频业务进入新时期，伴随未来日日新模型的持续迭代，带动公司相关AI 应用的持续发展，公司全新增长空间已经打开，未来发展值得期待。我们综合各类因素，给予商汤集团2024 年16-20 倍PS，对应公司合理价值区间2.27-2.84 港元（1港元= 0.9315 人民币），给予“优于大市”评级。

风险提示。AI 商业化不及预期的风险；公司国际化拓展不及预期的风险等。

The large-scale streaming multi-modal interaction model “Japan-Japan 5.5” was released, the first domestic benchmark against GPT-4o. On July 5, Shangtang Technology held the “Love Without Borders · Xiang Xinli” artificial intelligence forum at WAIC 2024 and released the first large-scale model “Nishi-Nissin SenseNova 5.5,” which has streaming native multi-modal interaction capabilities. The overall performance was 30% higher than “Nishi-Nisshin 5.0” two months ago, and the interaction effect and multiple core indicators were benchmarked against GPT-4O. The main update points of “Nishi-Nisshin 5.5” include: (1) Overall performance improvement of the 600 billion parameter-based model. Extensive use of synthetic high-level thought chain data enhances deductive thinking ability, and significantly enhances ability in mathematical logic, English, and instruction following. (2) It pioneered the launch of the first “what you see is what you get” model in China, the “Daily New 5o”, which provides streaming multi-modal interaction, bringing a new AI interaction model.

(3) The end-side model was fully upgraded, and the “Nishi-Nisshin 5.5 Lite” was released. Compared with the April 5.0 model, the accuracy of the model was increased by 10%, the inference efficiency was increased by 15%, and the first package delay was reduced by 40%. In particular, in terms of multi-modal capabilities, “Nichi-Nisshin 5.5” targets and even surpasses GPT-4o in most of the core test set indicators.

As the AI model evolves, innovative interaction models will take the lead in defining the development of the industry. By integrating cross-modal information, based on various forms such as sound, text, images, and video, the “Nishi-Nisshin 5o” brings a real-time streaming multi-modal AI interaction experience. The user experience is as direct as human communication itself. You can directly see what the customer sees and understand the customer's needs. This interactive mode is highly adaptable for multi-tasking, can handle various tasks naturally in the same model, and adaptively adjust behavior and output according to different contexts. From scene understanding and analysis, object information descriptions, book graphic summaries, and even rough stick figures and facial emotions, “Nichi-Nisshin 5o” can accurately grasp, interact smoothly, and interact with people in playful language.

We are paying close attention to end-side AI and industry applications, and AI commercial implementation is accelerating. According to Shang Tang, for everyone to be able to use the big AI model, it is necessary to start with a terminal. The “Nishi-Nisshin 5.5 Lite” end-side large language model “Discuss SenseHat Lite-5.5” has been fully upgraded in all dimensions, and is currently the end-side model with the best overall performance. At the same time, it is compatible with the cloud model to guarantee both performance and speed. At present, Shangtang's “Ririxin” end-side model has penetrated into various industries, initiated commercial connections with more than 150+ customers, covering many IoT device deployment applications such as smartphones, tablets, VR all-in-one machines, in-vehicle computers, and smart desk lamps.

Connect to the merchant soup “New day by day? “Discussion” End-side large model, the cost of a single device is as low as 9.9 yuan/year. The large model on the end side of Shangtang has many advantages, including: (1) It can be supported. Various vertical business directions, such as optimization in different fields such as writing and encyclopedia knowledge. (2) Availability. It also supports end-side deployment and cloud-side calls.

(3) Low threshold. The end-side SDK is easy to integrate and can support rapid deployment. At present, Shangtang's “Ri-Nissin” large model system has demonstrated practical value in a large number of application scenarios and vertical industries: in the field of programming, providing functions such as intelligent code completion through large models, which can significantly improve the daily work efficiency of programmers; in the medical field, from pre-diagnosis to health consultation to post-diagnosis follow-up, the empowerment of large models improves patients' experience throughout the medical treatment process; in the financial field, Shangtang has cooperated with banks, insurance, brokerage firms and asset management customers in multiple modes and scenarios; in the consumer sector, Shangtang has cooperated with many leading domestic manufacturers to transform the scenario into a large capacity model Chemical services, such as through Copilot helps users generate forms, analyze data, and write copywriting to improve personal productivity. Furthermore, in order to help more enterprise users access at a lower threshold, Shangtang recently launched the “Big Model 0 Yuan Go” plan. All newly registered “Nisshin” users will receive a number of free service packages involving mobility, migration, and training. At the same time, 50 MillionTokens packages will be given free of charge, and an exclusive moving consultant will be dispatched to provide a series of migration training from OpenAI to “Nisshin”.

Vimi, a large model for controllable character video generation, was released, and AI+ video 2C is being implemented at an accelerated pace. According to Vimi Camera's official WeChat account, Shang Tang released the first large-scale model for video generation of controllable characters, Vimi, at WAIC 2024, and was selected as the highest honor “Treasure of the Town Hall” at the WAIC exhibition, making it the most innovative exhibit at this conference.

Based on the powerful capabilities of Shang Tang Ri-ri's big model, Vimi can generate videos of people with the same movements as the target through just one photo of any style. Not only can it achieve accurate facial expression control, but it can also control the natural body changes of the person in the photo within the bust area. It also supports various driving methods, and can be driven by various elements such as existing character videos, animations, sounds, and text. The advantage of the Vimi model is that it has accumulated over many years of facial tracking technology and precise control of facial details, making the character's expressions more vivid. Compared to other models on the market, Vimi is more accurate in controlling the face and upper body, and can produce videos with high consistency and harmony of light and shadow. Furthermore, Vimi has excellent stability. Especially in long video scenarios, it can stably keep the character's face controllable, and can generate single-camera character videos for up to 1 minute or more. The picture effect does not deteriorate or distort over time, truly meeting the needs of stable video generation for entertainment and interaction over a long period of time. In character video scene generation, Vimi can change the entire environment according to body control, including generating reasonable hair shaking. It can even simulate the angle of the input lens. For example, the input lens is gradually getting closer, and the output can also have the effect of gradually getting closer naturally. The natural and smooth flow of hair, costume changes, and background environment creation can be presented by Vimi, making the resulting video more realistic and vivid. Additionally, it supports simulating changes in lighting and shadows, making every scene in the video full of cinematic quality. The Vimi model can stably maintain control over the character's face, especially in long video scenes. In addition, the Vimi model can also control the angle of the lens and generate reasonable hair shaking effects, providing video creators with more creative freedom. The Vimi camera is the first C-end product in Vimi's controllable character video model system, which can meet the entertainment needs of a wide range of female users. Users only need to upload high-definition character images from different angles to automatically generate digital aliases and photos and videos in different styles, providing various generation styles such as beautiful photography style and fantasy style, so that users can feel as if they have traveled through different dimensions and enjoy the immersive visual effects of a large film. For emoji users who are passionate about emoji packs, the Vimi camera can drive the generation of various fun character memes from a single image. The gameplay is diverse, and creative freedom is achieved. We believe that the release of Vimi has pushed the company into a new era in the field of AI+ video. Vimi's functions have further broadened the boundaries of AI large model applications and laid a solid foundation for the company's business expansion.

“Sensechat” released a local version in Hong Kong, and AI is increasingly targeting market segments. In July, Shang Tang's “Sensechat” mobile app and web version were offered free of charge to Hong Kong users. “Sensechat” is based on the Cantonese version of the “Discussion Multiple Modality Big Model” launched by Shang Tang in May of this year. Relying on Shang Tang's outstanding language and multi-modal abilities, as well as his in-depth understanding of Cantonese and local culture and hot topics, “Sensechat” is positioned as a “caring cotton jacket for Hong Kong users”. Users can directly chat with it in their most familiar Cantonese, directly type in text or voice, ask questions, search for things, generate images, and write copywriting. From life, study to work, “Sensechat” can bring a truly authentic AI experience. It is very clear about the latest local information and social topics, and even uses local buzzwords flexibly. Download the “Sensechat” iOS mobile app through the App Store and register with your Hong Kong phone number or email to experience the smartest, fastest, and authentic AI experience anytime, anywhere for free. The Android version will also be launched soon. The “Sensechat” app supports text or voice input, making the experience convenient. The main features include: (1) Localized experience. “Sensechat” has an in-depth understanding of local culture, customs and popular social topics in Hong Kong.

Users can naturally and smoothly answer questions in Cantonese mixed English and “Sensechat” in the mobile app. (2) Multimodal Q&A. Users can directly upload a file or image “Sensechat” to thoroughly analyze the content, generate a summary, and answer the user's questions about the file. (3) Real-time search. “Sensechat” can integrate multiple sources of information, so users can quickly obtain the latest information, including real-time news, weather conditions, etc. Users can also conduct further searches. (4) Image generation.

With just a simple description, “Sensechat” can quickly generate images in various styles, let users share them with friends in real time, or upload them to their own social platforms, making creations more casual. (5) Copywriting. Whether it's advertising copy, business plans, or academic writing, users can get professional copywriting advice through “Sensechat” to inspire writing. Furthermore, the web version of “Sensechat” has powerful multi-modal file processing capabilities, and the ability to understand, think, and generate ultra-long texts, and supports uploading up to 50 files. Whether you want to ask about life tips, solve math problems, analyze images, and write code, the online version of “Sensechat” can easily do it. We believe that the release of the Hong Kong version of “Sensechat” is an important experiment for the company to launch in the market segment. The adaptation to the Cantonese environment also highlights the company's leading technical strength in the big model. The company's future AI commercial implementation is worth looking forward to.

Actively cooperating with Huawei, Shengteng helped launch Shangtang AI. During WAIC 2024, Shengteng Artificial Intelligence Industry Summit 2024 was successfully held, focusing on big model reasoning and customer partner best practices, and exploring ways to accelerate big model innovation and application implementation. Yang Fan, co-founder of Shangtang Technology and president of the Big Device Business Group, was invited to attend and deliver the keynote speech “Ecological Connectivity Leads the Wave of Innovation in the Big Model Era”, sharing the native development practices of the full-stack technical capabilities of the Shangtang Rishin Big Model System based on Shengteng AI's basic software and hardware platform, leading the wave of innovation in the big model era. As an important engine for accelerating AI innovation, native development is gradually becoming the focus of the industry. Gong Ruihao, Research Director of Shangtang Science and Technology Model Research, was invited to attend the “Shengteng AI Partner Native Development Results Release”. Together with the partners, Shangtang University Device will work with partners to jointly promote technological innovation and integrated industrial development. It is worth mentioning that at the WAIC 2024 Shangtang Artificial Intelligence Forum, the Shengteng Native Model Cooperation Signing Ceremony was held. Shangtang Technology and Huawei Technology Co., Ltd. signed a cooperation agreement to push the native development of big models to a new level. From infrastructure construction, to breakthroughs in large-scale models, to application prosperity, everything is inseparable from close collaboration between upstream and downstream ecosystems. Over the past year, Shang Tang has worked closely with Shengteng and Shengsi teams to jointly build the next big model base and a new big model training ecosystem. For example, Shangtang can achieve industry-leading computing power utilization rates on clusters of more than 3,000 calories, so it can serve downstream enterprises with cluster capabilities with higher performance and higher efficiency. Previously, the Shangtang University Device AI Cloud, the Ririxinxin·Negotiation Language Model, and the Shangtang Medical Model “Big Doctor” all passed mutual compatibility tests with Atlas series servers, which can provide customers with safer, more efficient, and reliable artificial intelligence full-stack solutions and application experiences. Yang Fan said, “The deep integration of Shangtang's platforms, algorithms, and software capabilities in industry scenarios with Shengteng's hardware and underlying basic software capabilities will provide greater value and more diverse solutions for future artificial intelligence services to various industries and industries.” In the future, Shang Tang will continue to deepen cooperation with Huawei to create more efficient, low-cost, and low-threshold artificial intelligence infrastructure, better serve more industries and more scenarios, bring more and better intelligent services to individual consumers and enterprises, and promote the continuous development of artificial intelligence technology and industry in China. We believe that through active cooperation with Huawei, the company has obtained important domestic computing power partners. Along with the development of Shengteng's ecosystem, the implementation of Shangtang AI is also expected to receive significant support.

Profit forecasting and investment advice. We believe that the release of the streaming multi-modal interaction model “Japan-Japan 5.5” has been benchmarked for the first time in China, which further demonstrates Shangtang's strong technical capabilities, and has also laid a solid foundation for AI commercial implementation. The release of Vimi has also driven the company's AI+ video business into a new era, along with the continuous iteration of the new Japanese model in the future, driving the continuous development of the company's related AI applications. A new space for the company's growth has opened up, and future development is worth looking forward to. Combining various factors, we gave Shangtang Group a rating of 16-20 times PS in 2024, corresponding to the company's reasonable value range of HK$2.27-2.84 (HK$1 = RMB 0.9315), and gave it a “superior to the market” rating.

Risk warning. Risks that AI commercialization falls short of expectations; risks that the company's internationalization falls short of expectations, etc.

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

商汤-W(0020.HK)：商汤发布流式多模态交互大模型「日日新5.5」 国内首次对标GPT-4O

Shangtang-W (0020.HK): Shangtang released the streaming multi-modal interaction model “Nishi-Nissin 5.5” for the first time in China to benchmark GPT-4O

Risk Disclaimer

Statement

商汤-W(0020.HK)：商汤发布流式多模态交互大模型「日日新5.5」国内首次对标GPT-4O