share_log

OpenAI推出语音模型全家桶:AI将说得更动情、听写更准确…

OpenAI has launched a complete suite of voice models: AI will speak more expressively and transcribe more accurately...

cls.cn ·  Mar 21 04:40

①OpenAI released three brand new voice models, among which, the text-to-speech model GPT-4o MiniTTS can offer more realistic voices, allowing developers to guide it to speak in natural language; ③The accuracy of the new speech-to-text model has greatly improved, with a word error rate of only about 2% in English and Spanish, and about 7% in Mandarin.

On Thursday Eastern Time, OpenAI held a significant technical live stream, releasing three brand new voice models: the speech-to-text models GPT-4o Transcribe and GPT-4o MiniTranscribe, as well as the text-to-speech model GPT-4o MiniTTS.

OpenAI claims that these models have made significant progress over previous versions and mark a further step towards OpenAI's vision of 'AI agents (AI AGENT)'.

A more realistic voice generation model.

OpenAI claims that its new text-to-speech model GPT-4o MiniTTS not only provides more detailed and realistic sounding voices but is also more 'controllable' than the previous generation of speech synthesis models.

Developers can guide the model on how to speak in natural language - for example, 'speak like a mad scientist', 'speak like an empathetic customer service agent', or 'use a calm voice like a mindfulness teacher'.

OpenAI provides six different tone examples on its official website.
OpenAI provides six different tone examples on its official website.

OpenAI product personnel Jeff Harris indicated that their goal is to enable developers to customize voice "experiences" and "environments."

Harris stated, "In different situations, what you want is not just a flat, monotonous voice... If you're in a customer support experience, you want that voice to express an apology for a mistake; you can make the voice convey that emotion... Our belief is that developers and users want not only real control over what to say but also how to say it."

The accuracy of the speech-to-text model has significantly improved.

As for OpenAI's new speech-to-text models "GPT-4o-transcript" and "GPT-4o-mini-transcript," their accuracy is markedly higher than that of OpenAI's previously released speech-to-text model, Whisper, achieving lower word error rates (WER) across multiple languages.

The new model shows a significantly lower error rate in multiple languages.
The new model shows a significantly lower error rate in multiple languages.

OpenAI claims that after training on a "diverse and high-quality audio dataset," the new model can better capture accents and different voices, even in chaotic environments.

OpenAI also stated that the probability of the new model hallucinating during work has decreased. Harris added that it is well-known that Whisper tends to fabricate vocabulary during conversations, even entire passages, while "the new model has significantly improved in this regard compared to Whisper."

Harris stated, "Ensuring the accuracy of the model is crucial for obtaining a reliable voice experience; in this case, accuracy means that the model accurately heard the words and did not fill in details it did not hear."

Of course, the model's accuracy is significantly related to the language it is transcribing.

According to OpenAI's internal benchmarking, GPT-4o-transcribe is one of the more accurate new transcription models, with a word error rate of only about 2% in English and Spanish, around 7% in Mandarin, while in Hindi and Dravidian languages (such as Tamil and Telugu), its "word error rate" is still close to 30%, meaning that 3 out of every 10 words in the model differ from human transcriptions in those languages.

Moving closer to AI agents.

OpenAI claims that these models align with its broader vision of "AI agents": building automated systems that can independently perform tasks on behalf of users.

Although the definition of an "agent" may be controversial, Olivier Godement, OpenAI's product head, described one interpretation as a chatbot that can converse with business clients.

"In the coming months, we will see more and more AI agents emerging," Godement stated, "Therefore, the overall theme is to help clients and developers utilize useful, usable, and accurate agents."

Unlike traditional approaches, OpenAI does not intend to make its new transcription models public. The company previously released a commercial version of Whisper with permission from MIT.

Harris stated that GPT-4o-transcribe and GPT-4o-mini-transcribe "are much larger than Whisper," therefore they are not suitable for public release.

"They are not the kind of models that can run locally on laptops, like Whisper," he continued, "We want to ensure that if we release something in an open-source manner, it is well thought out, and we have a model that truly addresses specific needs."

Editor/lambor

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
1
Comment Comment · Views 3403

Recommended

Write a comment

Statement

This page is machine-translated. Futubull tries to improve but does not guarantee the accuracy and reliability of the translation, and will not be liable for any loss or damage caused by any inaccuracy or omission of the translation.