share_log

阿里云大模型上新!AI神器「通义听悟」公测中:长视频一秒总结,还能自动做笔记、翻字幕

Alibaba Cloud's big model is new! AI artifact “Listen to the Common Sense” is in public beta: a one-second summary of a long video can also automatically take notes and flip captions

量子位 ·  Jun 1, 2023 13:57

Source: Quantum Bit Author: Fish and Sheep

Another group meeting artifact that incorporates the power of big models, has started a free public beta!

The big model behind it is Ali's common sense of asking a thousand questions. As to why is it called a team meeting artifact --

Look, this is my mentor at Station B, Mr. Li Mu. He is taking his classmates to peruse a big model paper.

Unfortunately, at this point, my boss urged me to move the bricks quickly. I had no choice but to silently take off my earphones, click on the plug-in called “Common Sense of Understanding,” and then switch pages.

Guess what happened? Although I wasn't at the “group meeting” site, hearing it helped me to fully record the content of the group meeting.

It even helped me to summarize keywords, full text summaries, and study points in one click.

Simply put, this “common sense of understanding,” which has just been incorporated into the big model's abilities, is a big model's version of a work-learning AI assistant that focuses on audio and video content.

Unlike previous recording and transcription tools, it is not only as simple as turning recordings and videos into text. If you can summarize the full text with one click, you can also summarize the opinions of different speakers:

It can even be used as a real-time subtitle translator:

Seemingly, it's not only easy to set up a group, but it's also a new artifact of daily work for qubits that often have to handle a lot of recordings, stay up late, and various press conferences abroad.

We quickly tested the wave thoroughly as soon as possible.

Listen to the general meaning of understanding and start actual testing

The most basic and important part of the organization and analysis of audio content is the accuracy of the transcriptions.

In Round 1, let's first upload a 10-minute video in Chinese to see how accurate listening is compared to similar tools.

Basically, AI processes this kind of medium-length audio and video very fast, and the transcription can be completed in less than 2 minutes.

Let's first take a look at the performance of listening to enlightenment:

In this 200-word paragraph, I've heard that only two mistakes have occurred: strong → wall, all good → just right. Physical terms like nuclei, charge, and repulsion can be understood by anyone who hears or understands them.

We also tested it on Feishu Miao Ji using the same video. The basic problem isn't that big, but compared to listening, Feishu made two more mistakes. He wrote one of the “atoms” as “gardens,” and sounded “repulsive power” as “power of force.”

What's interesting is that when I learned the mistakes I made, Feishu was also replicated one by one. It seems that this pot still has to be taken back by someone who talks and swallows words (manual doggy style).

When iFLY heard it, it was able to tell the difference between “just right” that the top two players hadn't identified. However, I heard that iFLY basically transposed all of “wall” into “strong,” and an amazing combination of “strong sugar grains” also appeared. Also, among the three contestants, only iFLY heard “electromagnetic force” sound like “electronic force.”

Generally speaking, Chinese recognition is not very difficult for these AI tools. So how will they perform in the face of English materials?

We've uploaded a recent interview with Musk about his past grudges and grievances with OpenAI.

Let's first take a look at the results given by Satoru. In this section of Musk's answer, I heard that Larry Page's name was not distinguished; other than that, they were able to identify Larry Page correctly.

It is worth mentioning that I can directly translate the English transcription results into Chinese and compare them in a bilingual comparison, and the quality of the translation is also quite good.

Feishu Myoji succeeded in hearing Larry Page's name, but just like hearing it, there were some minor mistakes, such as writing “stay at his house” as “say this house” because Musk's overall speeches were fast and had some colloquial expressions.

When I heard about iFLY, the names and even the reading details were handled very well, but there were also situations where they were misled by Musk's colloquial expressions, such as “long into the evening” as “longing to the evening.”

Seen in this way, in terms of basic ability speech recognition, AI tools have all achieved a high accuracy rate. In the face of extremely high efficiency, some minor problems have gone uncovered.

So, let's take the difficulty level one more level, Round 2, to test their ability to sum up videos that are about 1 hour long.

The test video is a 40-minute roundtable discussion on New Opportunities for AIGC in China. A total of 5 people participated in the round table discussion.

Listening to this side, it took less than 5 minutes from the completion of the transcription to the AI extracting the keywords and providing a full text summary.

The result was Aunt Jang's:

Not only were the keywords given, but the content of the roundtable discussion was also well summarized, and key points were divided into the video.

Comparing the main topics excerpted by human editors, I smelled a hint of crisis...

It is worth mentioning that in response to the speeches of different guests, they can give a corresponding summary of their speeches.

The same topic was thrown to Fei Shu Miao Ji. Currently, when it comes to summarizing content, Feishu Miaoji can only give keywords.

Meeting minutes need to be manually marked on transcribed text.

iFLY heard that based on the Spark Cognitive Model, there are also products that can analyze the content of documents in closed beta, but they need to fill out an application and wait in line.

In basic iFLY audio, there is currently no similar summary function.

It looks like this round of testing:

However, in this actual test, the most surprising thing to hear in general is actually a “small” design:

Chrome plugin features.

Whether you're watching an English video, watching a live broadcast, or attending a class meeting, you can transcribe and translate audio and video in real time by clicking on the Enlightenment plug-in.

As shown at the beginning, it can be used for real-time captions, low latency, fast translation, and a bilingual comparison function. At the same time, recordings and transcripts can be saved with one click, making it easy to use later.

Mom no longer has to worry that I won't be able to read the video material in English.

Also, I have a bold idea...

Listen and listen when you have a group meeting, and you no longer have to be afraid of being suddenly sampled by your mentor.

Currently, Dengwu is connected to Alibaba Cloud Drive. The audio and video content stored on the Cloud Drive can be transcribed with one click, and subtitles can also be automatically displayed when the Cloud Drive video is played online. Also, in the enterprise version, audio and video files organized by AI can also be quickly shared internally in the future.

Tingwu officials also revealed that next, Heanwu will continue to develop new big model capabilities, such as directly extracting screenshots of PPT from videos and directly asking AI questions about audio and video content...

The technology behind it: big language model+voice SOTA

In fact, prior to the public beta, general understanding had already been carefully refined within Ali.

At the end of last year, some netizens got their hands on the closed beta experience card. At that time, the version already had offline voice/video transcription and real-time transcription functions.

In this public beta, I heard mainly the summary and conversation ability of the General Questionnaire Model. Specifically, it is based on the General Questionnaire model, incorporating the research results of the R&D team in reasoning, alignment, and dialogue and Q&A.

First, how to accurately extract key information is the key to improving work efficiency with this type of artifact. This requires the reasoning power of big models.

In 2022, the Alibaba AI team proposed a knowledge detection and reasoning utilization framework based on large language models (Probing Turning from Large Language Models). Relevant papers have been presented at international conferences such as KDD2022 and SIGIR2023.

The core idea of this framework is to detect the internal knowledge of the big model and use the thought chain as the carrier for knowledge flow and utilization.

Proton has taken first place in the three major lists of Common Sense Reasoning CommonSenseQA 2.0, Physical Common Sense Reasoning PIQA, and Numbersense Numbersense for Numbersense Numbersense for Numbersense.

On the TabFact (fact-checking) list, Proton achieved results that surpassed humans for the first time through knowledge decomposition and trusted thought chain technology.

Second, in order to ensure that the summary content and format meet user expectations, in terms of alignment, listening also uses ELHF, an efficient alignment method based on human feedback.

The method requires only a small number of high-quality human feedback samples to achieve alignment. In subjective evaluation of model effects, ELHF can increase model win rate by 20%.

In addition to this, the R&D team behind Listening also released Doc2Bot, the first large-scale document conversation data set in Chinese. The team's Re3G method for improving the model's question-and-answer ability has been selected for ICASSP 2023: the method can improve the model's ability to understand user questions, retrieve knowledge, and generate responses through the four stages of Retrieve (Retrieve), Relank (reorder), Refine (fine tuning), and Generate (generate), and won first place in the two major document conversation lists of Doc2Dial and Multi Doc2Dial.

In addition to being a big model, Heanwu is also the culmination of Ali's voice technology.

The speech recognition model behind it, Paraformer, comes from Alidamo Institute. For the first time, it solved the problem of balancing end-to-end recognition effectiveness and efficiency at the industrial level of applications:

Not only is the inference efficiency 10 times higher than the traditional model, but it also “slaughtered” multiple authoritative data sets when it was first launched, refreshing the accuracy rate of voice recognition SOTA. Currently, ParaFormer-Large is still the Chinese speech recognition model with the highest accuracy rate in the speeChio TIOBE white box test for professional third-party public cloud Chinese speech recognition evaluations.

Paraformer is a single-round non-autoregressive model composed of five parts: an encoder, predictor, sampler, decoder, and loss function.

Through the innovative design of the predictor, Paraformer achieved accurate prediction of the number of target words and corresponding acoustic implicit variables.

In addition, the researchers also introduced the idea of browsing language models (GLM) in the field of machine translation and designed a GLM-based sampler to enhance the model's contextual semantics modeling.

At the same time, Paraformer also used tens of thousands of hours of hyperscale industrial data sets covering rich scenarios for training, further improving recognition accuracy.

However, the accurate classification of speakers for multi-person discussions is due to the Dharma Institute's CAM++ speaker recognition basic model. The model uses a delay network D-TDNN based on intensive connections. The input of each layer is spliced by the output of all previous layers. This multiplexing of hierarchical features and one-dimensional convolution of the delay network can significantly improve the computational efficiency of the network.

In the industry's mainstream Chinese and English test sets, Voxceleb and CN-celeb, CAM++ both updated the optimal accuracy rate.

Big models open, users benefit

According to a report by the China Institute of Science and Technology Information, according to incomplete statistics, 79 major models have been released domestically.

Under this trend of big models opening up, the pace of evolution of AI applications has once again entered a sprint stage.

From the user's point of view, a welcome situation is gradually taking shape:

Under the “coordination” of the big model, various AI technologies began to flourish on the application side, making the tools more efficient and more intelligent.

From an intelligent document that can automatically complete a work plan from a slash to an audio and video recording and analysis tool that quickly helps you summarize the elements, the spark of AGI in generative big models is making more and more people feel the magic of AI.

At the same time, for technology companies, new challenges and new opportunities have undoubtedly also emerged.

The challenge is that all products will be swept away by a storm of big models, and technological innovation has become an unavoidable key issue.

Chances are, the time has come to rewrite the market landscape for new killer apps. However, who can come out on top depends on whose technology is better prepared and whose technology evolves faster.

In any case, users will benefit from the opening of technology.

edit/lambor

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment