OpenAI has launched a complete suite of voice models: AI will speak more expressively and transcribe more accurately...

cls.cn · Mar 21 04:40

①OpenAI发布三款全新语音模型，其中，文本转语音模型GPT-4o MiniTTS能提供更逼真的语音，开发人员可指导其用自然语言说话；③新语音转文本模型准确率大幅提升，在英语、西班牙语中的单词错误率仅有2%左右，在普通话中的错误率为7%左右。

美东时间周四，OpenAI举行了一场重磅的技术直播，发布了三款全新语音模型：语音转文本模型GPT-4o Transcribe和GPT-4o MiniTranscribe，以及文本转语音模型GPT-4o MiniTTS。

OpenAI声称，这些模型在之前版本的基础上取得了明显的进步，也标志着OpenAI距离其“AI智能体（AI AGENT）”的愿景更进一步。

更逼真的语音生成模型

OpenAI声称，其新的文本到语音模型GPT-4o MiniTTS不仅能提供更细致入微、听起来更逼真的语音，而且比前一代语音合成模型更“可操控”。

开发人员可以指导该模型如何用自然语言说话——例如，“像一个疯狂的科学家一样说话”、“像一个富有同理心的客服一样说话”或“像一个正念老师一样使用平静的声音”。

OpenAI产品人员杰夫哈里斯 (Jeff Harris) 表示，他们的目标是让开发者能够定制语音“体验”和“环境”。

哈里斯表示：“在不同的情况下，你想要的不会仅仅是一个平淡、单调的声音…如果你在客户支持体验中，你希望这个声音表达出犯错后的歉意，你可以让声音表达出那种情感……我们的信念是，开发者和用户不仅想要真正控制说什么，还想要控制怎么说。”

语音转文字模型准确率大幅提升

至于OpenAI的新语音转文本模型“GPT-4o-transcript”和“GPT-4o-mini- transcript”，它们的准确度明显高于 OpenAI之前发布的语音转文本模型Whisper，并在多种语言中实现更低的词错误率 (WER)。

OpenAI声称，经过“多样化、高质量音频数据集”的训练，新模型可以更好地捕捉口音和不同的语音，即使在混乱的环境中也是如此。

OpenAI还表示，新模型在工作中产生幻觉的概率也降低了。哈里斯补充道。众所周知，Whisper喜欢在谈话中编造词汇，甚至整段文字，而“新模型在这方面比Whisper有了很大的改进。”

哈里斯表示：“确保模型的准确性对于获得可靠的语音体验至关重要，（在这种情况下）准确性意味着模型准确地听到了单词，（并且）没有填写他们没有听到的细节。”

当然，模型的准确率和其被转录的语言有较大关系。

根据OpenAI的内部基准测试，GPT-4o-transcribe是两种新转录模型中更准确的一种，其在英语、西班牙语中的单词错误率仅有2%左右，在普通话中的错误率为7%左右，而在印度语和达罗毗荼语系（如泰米尔语、泰卢固语等）中，其“单词错误率”仍接近30%，这意味着模型中每10个单词中就有3个与这些语言的人类转录不同。

距离AI智能体更进一步

OpenAI声称，这些模型符合其更广泛的“AI智能体（AI AGENT）”的愿景：构建能够代表用户独立完成任务的自动化系统。

尽管“智能体（Agent）”的定义可能存在争议，但OpenAI的产品主管奥利维尔·戈德曼（Olivier Godement）将一种解释描述为可以与企业客户交谈的聊天机器人。

“在接下来的几个月里，我们会看到越来越多的AI智能体出现，”戈德蒙德表示，“因此，总的主题是帮助客户和开发者利用有用、可用和准确的智能体。”

与传统不同的是，OpenAI并不打算公开其新的转录模型。该公司此前在麻省理工学院的许可下发布了用于商业用途的新版Whisper。

哈里斯表示，GPT- 4o -transcribe和GPT- 4o -mini-transcribe“比Whisper大得多”，因此不适合公开发布。

“它们不是那种能在笔记本电脑上本地运行的模式，比如Whisper那种，”他继续说道，“我们想确保，如果我们以开源方式发布东西，我们是经过深思熟虑的，我们有一个真正针对特定需求的模型。”

编辑/lambor

①OpenAI released three brand new voice models, among which, the text-to-speech model GPT-4o MiniTTS can offer more realistic voices, allowing developers to guide it to speak in natural language; ③The accuracy of the new speech-to-text model has greatly improved, with a word error rate of only about 2% in English and Spanish, and about 7% in Mandarin.

On Thursday Eastern Time, OpenAI held a significant technical live stream, releasing three brand new voice models: the speech-to-text models GPT-4o Transcribe and GPT-4o MiniTranscribe, as well as the text-to-speech model GPT-4o MiniTTS.

OpenAI claims that these models have made significant progress over previous versions and mark a further step towards OpenAI's vision of 'AI agents (AI AGENT)'.

A more realistic voice generation model.

OpenAI claims that its new text-to-speech model GPT-4o MiniTTS not only provides more detailed and realistic sounding voices but is also more 'controllable' than the previous generation of speech synthesis models.

Developers can guide the model on how to speak in natural language - for example, 'speak like a mad scientist', 'speak like an empathetic customer service agent', or 'use a calm voice like a mindfulness teacher'.

OpenAI provides six different tone examples on its official website.

OpenAI product personnel Jeff Harris indicated that their goal is to enable developers to customize voice "experiences" and "environments."

Harris stated, "In different situations, what you want is not just a flat, monotonous voice... If you're in a customer support experience, you want that voice to express an apology for a mistake; you can make the voice convey that emotion... Our belief is that developers and users want not only real control over what to say but also how to say it."

The accuracy of the speech-to-text model has significantly improved.

As for OpenAI's new speech-to-text models "GPT-4o-transcript" and "GPT-4o-mini-transcript," their accuracy is markedly higher than that of OpenAI's previously released speech-to-text model, Whisper, achieving lower word error rates (WER) across multiple languages.

The new model shows a significantly lower error rate in multiple languages.

OpenAI claims that after training on a "diverse and high-quality audio dataset," the new model can better capture accents and different voices, even in chaotic environments.

OpenAI also stated that the probability of the new model hallucinating during work has decreased. Harris added that it is well-known that Whisper tends to fabricate vocabulary during conversations, even entire passages, while "the new model has significantly improved in this regard compared to Whisper."

Harris stated, "Ensuring the accuracy of the model is crucial for obtaining a reliable voice experience; in this case, accuracy means that the model accurately heard the words and did not fill in details it did not hear."

Of course, the model's accuracy is significantly related to the language it is transcribing.

According to OpenAI's internal benchmarking, GPT-4o-transcribe is one of the more accurate new transcription models, with a word error rate of only about 2% in English and Spanish, around 7% in Mandarin, while in Hindi and Dravidian languages (such as Tamil and Telugu), its "word error rate" is still close to 30%, meaning that 3 out of every 10 words in the model differ from human transcriptions in those languages.

Moving closer to AI agents.

OpenAI claims that these models align with its broader vision of "AI agents": building automated systems that can independently perform tasks on behalf of users.

Although the definition of an "agent" may be controversial, Olivier Godement, OpenAI's product head, described one interpretation as a chatbot that can converse with business clients.

"In the coming months, we will see more and more AI agents emerging," Godement stated, "Therefore, the overall theme is to help clients and developers utilize useful, usable, and accurate agents."

Unlike traditional approaches, OpenAI does not intend to make its new transcription models public. The company previously released a commercial version of Whisper with permission from MIT.

Harris stated that GPT-4o-transcribe and GPT-4o-mini-transcribe "are much larger than Whisper," therefore they are not suitable for public release.

"They are not the kind of models that can run locally on laptops, like Whisper," he continued, "We want to ensure that if we release something in an open-source manner, it is well thought out, and we have a model that truly addresses specific needs."

Editor/lambor

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

Recommended

Write a comment

Discussing

北水爆買！中國資產能否延續漲勢？

3月17日早盤，地產代理、物業服務及管理等板塊漲幅明顯，貝殼-W早盤漲逾4%，碧桂園服務漲逾9%。政策消息面上，兩部門發文落實專項債支持收地，中房協組織民營房企座談會。中國資產本輪火爆行情還能持續多久？你會如何投資？ Show More

北水狂掃港股！近期如何操作？

71%

29%

看好！繼續加倉

我恐高，逢高減倉

16K votes

年頭旺到年尾

Feb 27 16:09

Review on February 27...

$Hang Seng Index (800000.HK)$ $HSI Futures Current Contract (HSIcurrent.HK)$ The day before yesterday's review mentioned that the estimated previous top of 23,700 was not the peak. Yesterday it immediately broke through, and the increase was unexpectedly close to 1,000 points, as the short-term trading underestimated the extent of the rise. Therefore, many positions were previously entered in a bearish way, but in the end, the bears exited with stop losses at the close.

Today, after hitting the high near 24,000 in the early session and entering bearish positions, the index fell sharply by nearly over 600 points, immediately recouping yesterday's losses significantly.

Moreover, today it broke the new high again, reaching a maximum of 24,076, but by the end of the market, it fell back by about 70 points, producing a bearish candle. The current trend has not yet been broken, but from the previous low until now, it has risen close to 6,000 points. It is believed that those with positions can continue to hold until there is a clear trend reversal for profit-taking. Those without positions can wait for a pullback to get in. Actually, it is hoped for a quick pullback, as it allows for entry and also provides a healthy breath.

Currently, the outlook remains the same as before. It is believed that even if there is a pullback, it shouldn't be too deep. However, if Futures fail to stabilize and close below 22,350, there may still be room for decline. The chance of Futures falling below 21,400 in the short term should be low, so it is considered that if a significant pullback occurs, it presents a good opportunity to incrementally go long. Recently, there has been a consistent approach to not hold positions overnight, only focusing on immediate trades, as there is no high chasing and no casual short selling.
Support and resistance can be referenced based on spot prices.
Support levels are 23150, 23250, 2...

OpenAI推出语音模型全家桶：AI将说得更动情、听写更准确…