Google released the revenge killer Gemini late at night, and the strongest native multi-modal epic crushes GPT-4! Language understanding surpasses humans

新智元 ·  12/07/2023 11:53

Google's revenge killer, Gemini, suddenly went live late at night!

After being pressured by ChatGPT for a whole year, Google chose this day in December to launch the strongest counterattack.

The multi-modal Gemini, the largest and most capable Google model to date, surpassed GPT-4 in various fields such as text, video, and voice, which is a real disgrace.

Humans have five senses. The world we build and the media we consume are all presented in this way.

And the advent of Gemini is the first step towards a truly universal AI model!

The birth of Gemini represents a huge leap forward in the AI model, and all Google products will be transformed along with it.

Search engines, advertising products, Chrome browsers... this is the future that Google has given us.

Epic multi-modal innovation

In the past, a multi-modal model was a splicing of plain text, pure vision, and audio-only models, just like OpenAI's GPT-4, Dall·e, and Whisper. However, this is not an optimal solution.

In contrast, at the beginning of the design, multi-modality was part of Gemini's plan.

Since the beginning, Gemini has trained on different modes. The researchers then used additional multi-modal data to fine-tune the model, further improving the effectiveness of the model. Ultimately, it is possible to “seamlessly” understand and reason about the input content of various modes.

Judging from the results, Gemini's performance is far superior to existing multi-modal models, and its functionality is at the SOTA level in almost every field.

And this largest, most capable model also means Gemini can understand the world around us in the same way as humans, and absorb any type of input and output — whether it's text, code, audio, images, or video.

Gemini guessed the paper ball was in the leftmost cup

Demis Hassabis, CEO and co-founder of Google DeepMind, said that Google has always been interested in a very versatile system.

What matters most here is how to mix all of these patterns, how to gather as much data as possible from any number of inputs and senses, and then give the same diverse responses.

After DeepMind and Google Brain merged, of course, they came up with the real thing.

The reason it is named Gemini is because of the merger of Google's two major AI laboratories. Another explanation is that it refers to NASA's Gemini project, which paved the way for the Apollo lunar landing program.

It surpassed humans for the first time and drastically crushed GPT-4

Although it has not been officially announced, according to internal sources, Gemini has trillion parameters, and the computing power used for training is even five times that of GPT-4.

Since it is a rigid GPT-4 model, of course, Gemini has to go through the most stringent tests.

Google evaluated the performance of the two models on various tasks and was pleasantly surprised to discover: from natural image, audio, and video understanding to mathematical reasoning, Gemini Ultra has surpassed GPT-4 on 30 of the 32 commonly used academic benchmarks!

In the MMLU (Massive Multitasking Language Understanding) test, Gemini Ultra surpassed human experts for the first time with a high score of 90.0%.

Gemini is the first model to surpass human experts at MMLU (Massive Multitasking Language Understanding)

The MMLU test includes 57 disciplines, such as mathematics, physics, history, law, medicine, and ethics, and aims to examine world knowledge and problem solving skills.

In each of these 50+ different subject areas, Gemini is as good as the best experts in these fields.

Google's new benchmark for MMLU allows Gemini to use reasoning skills more carefully before answering complex questions. Compared to relying only on intuitive responses, this approach has brought significant improvements.

Gemini Ultra also received a high score of 59.4% in the new MMMU benchmark test, which included multi-modal tasks across different fields that required an in-depth reasoning process.

Gemini Ultra also outperformed previous leading models in image benchmarking, and this achievement was achieved without the help of an OCR system!

Various tests have shown that Gemini has shown strong ability in multi-modal processing and has great potential in more complex reasoning.

More information can be found in the Gemini technical report:

Report address:

Medium cup, big cup, super big cup!

Gemini Ultra is the most powerful LLM ever created by Google, capable of completing highly complex tasks, mainly for data centers and enterprise applications.

The Gemini Pro is the best performing model for a wide range of tasks. It will power many of Google's AI services and will be the backbone of Bard from now on.

The Gemini Nano is the most efficient model for device-side tasks. It can run locally and offline on Android devices, and Pixel 8 Pro users can experience it right away. Among them, the parameter of Nano-1 is 1.8B, and Nano-2 is 3.25B.

The most basic Gemini model can do text input and text output, but more powerful models like the Gemini Ultra can process images, video, and audio at the same time.

Not only that, Gemini can even learn to move and touch, which is more like a robot!

In the future, Gemini will gain more senses, be more conscious, and more accurate.

While the illusion problem is still unavoidable, the more the model knows, the better the performance will be.

Accurate understanding of text, images, and audio

Gemini 1.0 is trained to simultaneously recognize and understand various forms of input content such as text, images, and audio, so it can also better understand subtle information and answer various questions related to complex topics.

For example, a user first uploaded an audio clip in non-English, then recorded an audio clip in English to ask questions.

You need to know that in general, the summary of audio design is to use text to enter prompts. Gemini, on the other hand, can simultaneously process two segments of audio in different languages and accurately output the required summary content.

What's even more amazing is that if I want to make an omelette, I can not only ask Gemini by voice, but I can also take a picture of the ingredients I have on hand.

Then, Gemini will teach you how to make an omelette step by step, combining the requirements sent in the audio with the ingredients in the picture.

You can even take a photo every time you complete one step, and Gemini can continue to guide what to do next based on actual progress.

People with hand cancer and those who can't cook have been saved!

Furthermore, this ability also makes Gemini particularly good at explaining reasoning problems in complex subjects such as mathematics and physics.

For example, if parents want to save some trouble while tutoring their children on homework, what should they do?

The answer is very simple. Just take a picture and upload it. Gemini's reasoning ability is sufficient to solve various science problems such as mathematics and physics.

For any of these steps, ask Gemini for a more specific explanation.

You can even consolidate the point where the error occurred and ask Gemini to output a question similar to the type of error.

Complex reasoning is easy to solve

Additionally, Gemini 1.0 has multi-modal reasoning capabilities to better understand complex written and visual information. This gives it superior performance in uncovering hard-to-discern knowledge buried in massive amounts of data.

By reading, filtering, and understanding information, Gemini 1.0 is also able to extract unique opinions from thousands of documents, thereby helping to achieve new breakthroughs in many fields, from science to finance.

AlphaCode 2: Over 85% coding ability for human players

Of course, after all, benchmarking is just testing. The real test for Gemini is users who want to use it to write code.

Writing code is a killer feature that Google created for Gemini.

The Gemini 1.0 model can not only understand, explain, and generate high-quality code for the world's most popular programming languages, such as Python, Java, C++, and Go. At the same time, it can work across languages and reason complex information.

From this point of view, there is no doubt that Gemini will become one of the world's leading basic programming models.

Two years ago, Google launched a product called AlphaCode, which was also the first AI code generation system to reach a competitive level in a programming competition.

Based on a customized version of Gemini, Google has launched a more advanced code generation system — AlphaCode 2.

When faced with problems involving not only programming, but also complex mathematics and computer science theory, AlphaCode 2 showed excellent performance.

Google developers also tested AlphaCode 2 on the same testing platform as the original AlphaCode.

The results showed that the new model showed significant progress and solved almost twice as many problems as the previous AlphaCode.

Among them, AlphaCode 2's programming performance exceeds 85% of human programmers. In contrast, AlphaCode only surpassed about 50% of programmers.

Furthermore, when human programmers collaborate with AlphaCode 2, and human programmers set specific requirements for code samples, the performance of Alphacode 2 will be further improved.

AlphaCode 2 operates on powerful LLM and incorporates advanced search and reordering mechanisms designed specifically for competition programming.

As shown in the figure below, the new model mainly consists of the following parts:

- Multiple strategy models to generate individual code samples for each problem;

- Sampling mechanism, capable of generating diverse code samples to search among possible program solutions;

- A filtering mechanism to remove code samples that do not match the problem description;

- Clustering algorithms to group semantically similar code samples to reduce repetition;

- A scoring model to select the best solution from a cluster of 10 code samples.

For details, please refer to the Alpha Code 2 technical report:

Report address:

More reliable, more efficient, and scalable

Equally important to Google, Gemini is clearly a more efficient, reliable, and scalable model.

It's trained on Google's own tensor processing unit and runs faster and cheaper than Google's previous models, such as PalM.

The developers used the tensor processing units TPU v4 and v5e developed in-house by Google to conduct large-scale training on Gemini 1.0 on AI-optimized infrastructure.

A reliable and scalable training model and the most efficient service model are Google's important goals for Gemini.

On TPU, Gemini runs significantly faster than earlier, smaller, less capable models. These custom-designed AI accelerators are at the core of Google's big model products.

You know, these products serve billions of users such as Search, YouTube, Gmail, Google Maps, Google Play, and Android. They've also helped tech companies around the world train big models economically and efficiently.

In addition to Gemini, Google today unveiled its most powerful, efficient, and scalable TPU system to date, Cloud TPU v5p, designed specifically for training cutting-edge AI models.

The next generation TPU will accelerate the development of Gemini, help developers and enterprise customers train large-scale generative AI models faster, and develop new products and features.

Gemini, making Google great again?

Clearly, in Pichai and Hassabi's view, the Gemini launch is just the beginning — a larger project is about to begin.

Gemini is the model Google has been waiting for. After OpenAI and ChatGPT took over the world, Gemini was the conclusion reached by Google Explore a year ago.

Google has been catching up since the “red alert” was issued, but both said they don't want to go too fast to keep up, especially as we get closer to AGI.

Will Gemini change the world? In the best case, it can help Google catch up with OpenAI in the generative AI race.

But Chopping Firewood, Hassabis, and others all seem to think that this is the beginning of Google's true greatness.

The technical report released today did not reveal architecture details, model parameters, or training data sets.

Oren Etzioni, former CEO of the Allen Institute for Artificial Intelligence, said, “There is no reason to doubt that Gemini is better than GPT-4 on these benchmarks, but GPT-5 may do better than Gemini.”

Building a massive model like Gemini could cost hundreds of millions of dollars, but for companies that dominate the delivery of AI through the cloud, the final return could be billions or even trillions of dollars.

“This is a war that cannot be defeated; it must be won.”


The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment