OpenAI has made another big move! ChatGPT can now hear, speak, and see!

ChatGPT最新的高级语音模式现在可以实时对视频和屏幕共享内容作出响应。圣诞将至，语音功能还新增了圣诞老人模式。

ChatGPT的高级语音模式（AVM）现在有视频和屏幕共享功能了！该功能将于周四开始向付费的ChatGPT Plus和Pro订阅者推出，企业和教育客户则将于一月份获得该功能。

在“12 Days of OpenAI”活动的第六天，这家人工智能初创公司宣布，ChatGPT可以识别摄像头拍摄到的或设备屏幕上显示的物体，并通过其高级语音模式功能进行响应。用户可以使用手机摄像头与ChatGPT聊天，模型将“看到”您所看到的内容。

此前，OpenAI在5月份推出GPT-4o模型时就预告了该功能。该初创公司表示，AVM由OpenAI的原生多模式4o模型提供支持，这意味着它可以处理音频输入，并以自然的对话方式做出响应。

OpenAI的视频模式感觉就像视频通话，因为ChatGPT会实时响应用户在视频中显示的内容。它可以看到用户周围的事物，识别物体，甚至记住自我介绍的人。在直播中，该公司首席产品官（CPO）Kevin Weil和其他团队成员演示了ChatGPT协助如何制作手冲咖啡。他们通过将摄像机对准冲咖啡的动作，AVM引导团队完成了冲泡过程，证明它了解咖啡机的原理。

图片来自视频截图

另外，ChatGPT还可以识别屏幕上的内容。在演示中，OpenAI研究人员触发了屏幕共享，然后打开消息应用程序，请求ChatGPT帮助回复通过短信收到的照片。

这一期待已久的消息是在谷歌推出下一代旗舰模型Gemini 2.0的一天后发布的。新的Gemini 2.0还可以处理视觉和音频输入，并具有更多代理功能，这意味着它可以代表用户执行多步骤任务。 Gemini 2.0的代理功能目前有三个不同名称的研究原型：用于通用AI助手的Project Astra、用于特定AI任务的Project Mariner ，以及用于开发人员的Project Jules。

另外，上周，微软也发布了Copilot Vision的预览版，它可以让Pro订阅者在浏览网页时打开Copilot聊天。 Copilot Vision可以查看屏幕上的照片，甚至可以帮忙玩地图猜谜游戏。谷歌的Project Astra也可以用同样的方式读取浏览器。

OpenAI也不甘示弱，其演示展示了ChatGPT的视觉模式如何准确识别物体，甚至是可中断的，其中还包括语音模式下的圣诞老人语音选项，声音低沉、欢快，还有很多“ho-ho-hos（呵呵呵）”。用户可以通过点击ChatGPT中的雪花图标与OpenAI版本的圣诞老人聊天。媒体开玩笑说道，目前尚不清楚到底是真正的圣诞老人为AI训练贡献了自己的声音，还是OpenAI在未经事先同意的情况下使用了他的声音。

此前，具有视觉功能的高级语音模式已被多次推迟。据报道，部分原因是OpenAI在准备好之前就早早宣布了该功能。今年4月，OpenAI承诺将在“几周内”向用户推出高级语音模式。几个月后，该公司仍表示需要更多时间。

The latest advanced voice mode of ChatGPT can now respond in real time to video and screen sharing content. With Christmas approaching, the voice feature has also added a Santa Claus mode.

ChatGPT's Advanced Voice Mode (AVM) now has video and screen sharing features! This feature will be rolled out to paid ChatGPT Plus and Pro subscribers starting Thursday, while enterprise and education customers will gain access in January.

On the sixth day of the '12 Days of OpenAI' event, this AI startup announced that ChatGPT can recognize objects captured by the camera or displayed on the device screen and respond through its advanced voice mode feature. Users can chat with ChatGPT using their mobile camera, and the model will 'see' what you see.

Previously, OpenAI previewed this feature when it launched the GPT-4o model in May. The startup stated that AVM is powered by OpenAI's native multimodal 4o model, meaning it can handle audio input and respond in a natural conversational manner.

OpenAI's video mode feels like a video call because ChatGPT responds in real-time to what is shown in the video by the user. It can see things around the user, identify objects, and even remember the person who introduced themselves. During a live stream, the company's Chief Product Officer (CPO) Kevin Weil and other team members demonstrated how ChatGPT assisted in making pour-over coffee. By directing the camera at the action of making coffee, the AVM guided the team through the brewing process, proving it understands the principles of a coffee machine.

Image from video screenshot.

Additionally, ChatGPT can also recognize content on the screen. In the demo, OpenAI researchers triggered screen sharing and then opened a messaging application, asking ChatGPT to help reply to a photo received via text message.

This long-awaited news was released the day after Google launched its next-generation flagship model, Gemini 2.0. The new Gemini 2.0 can also process visual and audio inputs and has more agent capabilities, meaning it can perform multi-step tasks on behalf of users. Gemini 2.0's agent capabilities currently have three different named research prototypes: Project Astra for general AI assistants, Project Mariner for specific AI tasks, and Project Jules for developers.

Additionally, last week, Microsoft also released a preview version of Copilot Vision, which allows Pro subscribers to open Copilot chat while browsing the web. Copilot Vision can view photos on the screen and can even help play map guessing games. Google's Project Astra can read the browser in the same way.

OpenAI is not to be outdone, as its demonstration showcased how ChatGPT's visual mode accurately recognizes objects, even those that can be interrupted, including a holiday voice option for Santa Claus in voice mode, featuring a deep, cheerful voice and plenty of "ho-ho-hos." Users can chat with OpenAI's version of Santa Claus by clicking on the snowflake icon in ChatGPT. The media jokingly stated that it remains unclear whether the real Santa Claus contributed his voice for AI training or if OpenAI used his voice without prior consent.

Previously, the advanced voice mode with visual capabilities had been postponed several times. Reports suggest that part of the reason was that OpenAI announced the feature early before it was ready. In April of this year, OpenAI promised to roll out the advanced voice mode to users "within a few weeks." Months later, the company still stated that more time was needed.

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

OpenAI又放大招！ChatGPT现在能听能说又能看了！

OpenAI has made another big move! ChatGPT can now hear, speak, and see!

Risk Disclaimer

Statement