share_log

语音AI助手大战开启!OpenAI VS 谷歌——AI手机届的“iOS VS 安卓”

Voice AI Assistant Battle Begins! OpenAI vs. Google -- “iOS vs. Android” for AI phones

wallstreetcn ·  May 16 17:56

Author: Lee Siu-yin

Source: Hard AI

This week, the AI community continues to “explode”: Google announced that it has entered the Gemini era, “freaked out” a bunch of updates, and directly “cut off” OpenAI one day ahead of schedule.

OpenAI's GPT-4O's excellent real-time interaction capabilities are impressive, and Google's Project Astra also fought back with comparable capabilities, triggering a sharp rise in industry discussions about AI assistants.

According to publicly available information, also as AI voice assistants, GPT-4O and Project Astra are both built on a multi-modal model, which supports the reception/generation of text, images, and audio and video content, and enables ultra-short latency and real-time interaction.

Also, according to previous media reports, Apple has reached an agreement with OpenAI to introduce ChatGPT technology into the new operating system iOS 18, and Google controls the “lifeblood” of Android. I can't help but wonder: Will this duel between GPT-4O and Gemini be the next AI phone “iOS vs. Android”?

“Hard” on the front, who is better?

If you compare GPT-4O and Project Astra (which provides Gemini Live functionality in Gemini) one by one, you will find that there are indeed detailed differences between these two AI assistants.

1) Usage scenarios

GPT-4O has an average response delay of 320 ms. The fastest response time to audio input is 232 ms, which is close to the response time of human conversation. In the press conference presentation, GPT-4o's daily usage scenarios include: interpretation, reading and writing coding, mathematics teaching, summarizing and interpreting information, and video recognition of emotions.

Gemini Live's visual recognition and voice interaction effects are comparable to GPT-4O. It also provides a conversational natural language voice interface and the ability to perform real-time video analysis through a mobile phone camera. The response speed is also fast enough for natural everyday conversations. DeepMind CEO Demis Hassabis described it as “always hoping to create a general-purpose smart device useful in everyday life.”

Judging from the ease of use, there isn't much difference between the two.

However, one point that may cause a different market reaction is that the GPT-4o demo was completed live, while Google's demo was recorded before the press conference.

2) Multi-modal Capabilities

Multi-modal capabilities are the main promotion point of the two AI assistants. Currently, it seems that GPT-4o may be slightly ahead in terms of audio, while the visual features shown by Project Astra are superior.

In the demo, GPT-4o showed realistic sounds, a smooth conversation flow, singing, and even being able to guess emotions based on user expectations; while Project Astra showed more “advanced” visual features, such as being able to “remember” where you put your glasses.

In terms of multimodal models, Gemini relies on other models for output, including using Imagen 3 to process images and Veo to process video; GPT-4o uses native multimodals to generate images and sounds spontaneously.

3) Product Positioning

The launch of GPT-4o has sparked discussions on the reality version of “Her” in the market because its AI assistant has a female voice with plenty of emotional expression, and even has the ability to make small talk and joke. Although Project Astra is also a female voice, the tone is calmer and more realistic.

This shows the difference between the two positions of “AI Assistant” products. OpenAI wants them to be more “anthropomorphic”, while Google wants them to be more “proxy.”

Google has stated that it intends to avoid producing “Her” type artificial intelligence.

In a paper published by DeepMind last month, the company detailed the potential shortcomings of anthropomorphic AI, believing that such AI assistants will blur “human-machine boundaries” and may cause problems such as sensitive information disclosure, human emotional dependency, and weakening agency capabilities.

4) Access path

OpenAI said it will launch GPT-4O text and visual features on the web interface and GPT applications from now on. The company also said it will add voice functionality in the coming weeks, and developers can now access text and visual features in the API.

Google said Gemini Live will be launched “in the next few months” through Google's advanced AI program Gemini Advanced.

There are opinions that OpenAI introduced new features earlier, which may mean that its products have an advantage in acquiring new users.

5) Costs

GPT-4O is free for all ChatGPT users, and the API price is reduced by 50%.

However, the current official free limit is limited to a certain number of messages. After this amount of messages is exceeded, the free user model will be switched back to ChatGPT, or GPT 3.5, while paid users (starting at $20 per month) will have five times the GPT-4O message limit.

Gemini Advanced offers a two-month free trial, then costs $20 per month thereafter.

Will AI glasses be the next battleground?

With the advancement of end-side AI applications, AI assistants will actually be implemented and applied to everyday life, and their actual utility will only be revealed one by one at that time.

However, AI voice assistants seem to hint at a new trend in electronic technology: a shift from text to audio.

Next, the deep integration of visual abilities also seems to be on the way.

At the press conference, Google said that Project Astra's other potential is that it can be used in conjunction with Google Glass — blind people can get real-time audio explanations in their daily lives after wearing it.

Meta has also launched the voice robot MetaAI for its VR headsets and Ray-Ban smart glasses.

There are opinions that at this stage, the addition of AI voice assistants may push AI phones to become winners, but looking back, the ultimate form of these voice AI models will be smart glasses.

edit/new

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment