share_log

GPT-4现场被端侧小模型“暴打”,商汤日日新5.0:全面对标GPT-4 Turbo

The GPT-4 scene was “beaten” by a small model on the end side. Shangtang Rixin 5.0: fully benchmarking GPT-4 Turbo

量子位 ·  Apr 25 20:24

Source: Quantum Bits

Exciting enough, GPT-4 was actually “beaten” in public, and they didn't even have a chance to fight back:

Yes, it was during a live PK game of “Street Fighter” that a famous scene like this happened.

Moreover, the two are still not in the “heavyweight” category:

  • Green Man: Manipulated by GPT-4

  • Celebrity: Manipulated by a small model on the end side

So what is the origin of this small and powerful player?

Without getting away with it, it is just SenseHat Lite (Negotiable Lightweight Edition), a large Japanese and Japanese terminal model recently released by Shangtang Technology.

In “Street Fighter” alone, this little model has quite a “world martial arts, only quick and unbreakable” attitude:

GPT-4 is still thinking about how to make decisions, and SenseHat Lite's fist is already up.

Moreover, Shang Tang CEO Xu Li also made it more difficult on site and started testing directly by disconnecting from the internet on his phone!

For example, in offline mode, an application for an employee to take a week's leave of absence is as follows:

△ Original speed on site
△ Original speed on site

(Of course, Xu Li jokingly said, “The fake was too long, I won't criticize it~”)

You can also make a quick summary of a long text:

△ Original speed on site
△ Original speed on site

And it is possible to do this because SenseHat Lite has reached the SOTA level in terms of performance at the same scale.

Furthermore, they defeated Llama 2-7B and even 13B in multiple tests using a “small, big” attitude.

In terms of speed, SenseHat Lite uses a MoE framework “linked” to the cloud. In some scenarios, end-side inference accounts for 70%, which makes inference costs even lower.

Specifically, compared to the human eye's reading speed of 20 words per second, SenseHat Lite can reach an inference speed of 18.3 words per second on medium performance phones.

If it's a high-end flagship phone, then the inference speed can directly soar to 78.3 words per second!

However, in addition to text generation, Xu Li also demonstrated the multi-modal capabilities of the Shangtang side model on site.

For example, it's also an enlarged image. When Shang Tang's large end side model starts at a slow half-shot, the speed of expanding 3 different images is faster than the speed of the friend Shang zooming in 1:

The students who demonstrated even took pictures directly on site, reduced the photos a lot, and then expanded them freely:

Um, I have to say that Shang Tang dared to act real on the scene.

However, looking at the entire event, the big model on the end side was only a small part of this press conference.

In terms of the “big base,” Shang Tang also upgraded his new daily model to a major version — SenseNova 5.0. And it directly positions it to a new level:

Fully benchmarked GPT-4 Turbo!

So what is the strength of the 5.0 version of Japan's new big model, let's do a wave of actual tests~

Please, “be mentally retarded”!

Since playing big models became popular, “let's be mentally retarded” has always been one of the standards for testing the logical ability of big models. Jianghu is dubbed “Benchmark for being mentally retarded.”

(“Let's be mentally retarded” comes from Baidu Tieba, a Chinese community full of ridiculous, bizarre, and unreasonable statements.)

Also, not long ago, “Let's Be Mentally Handicapped” was published in a serious AI paper and became the best Chinese training data, which sparked quite a bit of controversy.

So when the high-discussion model 5.0 of text conversation encounters a “retarded mind,” what kind of fireworks will the two ignite?

Logical reasoning: “Let's be mentally retarded”

Please listen to the first question:

Why didn't my parents call me when they got married?

The discussed answer is different from other AI. It uses “I” as a more anthropomorphic answer, and judging from the answer results, there isn't too much redundant content, but instead it accurately answers and explains, “You weren't born when they got married.”

Please listen to question 2:

Internet cafes can use the Internet, so why can't mentally retarded bars go mentally retarded?

Similarly, the discussion directly and accurately pointed out “this is a joke question” and stated “'Let's be mentally retarded' is not an actual place.”

It's easy to see that with regard to the “let's be mentally retarded,” the logic of “let's be mentally retarded,” the 5.0 discussion can already be held up.

Natural Language: College Entrance Examination “Dream of the Red Mansion”

In addition to logical reasoning ability, in terms of natural language generation, we can directly use the 2022 college entrance examination essay topics to compare GPT-4 and the big discussion model 5.0.

Judging from the results, the GPT-4 article is still an “AI template”; the discussion on 5.0 is quite poetic. Not only is it a complete battle of sentences, but it can also refer to the classics.

Uh-huh, the idea of AI has been opened up and diverged.

Math Skills: Simplifying Complexity

Also, let GPT-4 compete on the same stage as Negotiation 5.0. Let's test their mathematical abilities this time:

Mom brewed Yuanyuan a cup of coffee. After drinking half a cup, Yuanyuan filled it with water, then she drank another half cup, then filled it with water, and finally finished drinking it all. Q: Does Yuanyuan drink more coffee or more water? How many cups of coffee and water did you drink?

For humans, this question is actually a relatively simple one, but GPT-4 made a meticulous deduction that seemed like a serious one, and the results were still wrong.

The reason for this is that the logical construction of the thought chain behind the big model is not complete, and it is very easy to make mistakes if you run into niche problems; on the other hand, the idea and results are correct on the side of discussion 5.0.

Also, like the “eagle catches the chick” question below, GPT-4 probably doesn't understand the rules of this kind of game, because the calculated answer is still wrong:

Not only can you sense one or two from the effects of actual experience, but it also reflects the ability to directly evaluate the list data --

Conventional objective evaluations have reached or surpassed GPT-4 Turbo.

So how did Nishi-Nisshin 5.0 do it? In a nutshell, left-handed data, right-handed computing power.

First, in order to break the bottleneck at the data level, Shang Tang used more than 10t tokens, making it complete with high-quality data, and giving the big model a basic understanding of objective knowledge and the world.

In addition, Shang Tang also synthesized thought chain data of up to hundreds of billions of tokens. This is also the key point of this effort at the data level, which can activate the ability of large models to make strong inferences.

Second, at the computing power level, Shang Tang has carried out joint optimization of algorithm design and computing power facilities: the topological limit of the computing power facility is used to define the next stage of the algorithm, and new advances in algorithms require a new understanding of the construction of computing power facilities.

This is where the core capability of Shangtang's AI device for joint iteration of algorithms and computing power lies.

Overall, the highlights of the Nishi-Nisshin 5.0 update can be summarized as follows:

  • Adopt MoE architecture

  • Based on training over 10TB tokens, it has a large amount of synthetic data

  • Inference context window reaches 200K

  • Knowledge, reasoning, math, and coding skills are comprehensively benchmarked against GPT-4 Turbo

In addition to this, in the multi-modal field, Nichixin 5.0 also achieved leading results in a number of core indicators:

The old rule, let's continue to look at the multi-modal generation effect.

I'll read pictures even more

For example, “Feed” is a super-long image (646*130000) to discuss 5.0. Just let it recognize it, and you can get an overview of all the content:

Also, if you throw in an interesting picture of a cat for discussion 5.0, it can infer that the cat is celebrating her birthday based on details such as the party hat, cake, and “happy birthday.”

More practical ones, such as uploading a complex screenshot, can accurately extract and summarize key information with discussion 5.0, but GPT-4 made a mistake in the recognition process:

Seconds drawing 5.0: and the top three PK

In terms of Wenshengtu, the new Nichi-Nippon Sekkei 5.0 directly competed on the same stage as Midjourney, Stable Diffuison, and Dall·E 3.

For example, in terms of style, the image generated by the seconds drawing might be closer to the “National Geographic” mentioned in the prompt:

In terms of character images, you can show more complex skin textures:

Even text can be accurately embedded in images:

There is also a big anthropomorphic model

In addition to this, Shang Tang also launched a relatively special large model in this release - the anthropomorphic model.

Judging from the experience, it can already imitate various groundbreaking characters such as movie and TV characters, real-life celebrities, and Genshin Sekai, and have high-emotional conversations with you.

From a functional point of view, the Anthropomorphic Model supports character creation and customization, knowledge base construction, long conversation memories, etc., and even the kind where three or more people can chat in groups.

It is also based on so many modal abilities that another big member of the Shangtang Big Model Family, Little Raccoon, has also experienced an ability upgrade.

Office and programming have become easier

Little raccoons in Shangtang are currently divided into two categories: office raccoons and programming raccoons. As the name suggests, they act on office scenes and programming scenes respectively.

With Office Raccoon, processing forms, documents, and even code files is now a “one lose+one question” thing.

Taking the procurement scenario as an example, we can first upload supplier list information from different sources and then tell the office raccoon:

Unit, unit price, notes. Because the header information in different sheets is not consistent, similar header content can be merged. The table results are displayed in a dialog box and a local download link is generated, thank you.

It only takes a few moments for us to get the results after processing.

Moreover, in the left column, Office Raccoon also gives the Python code for the analysis process, focusing on a “traceable” one.

We can also upload multiple documents such as inventory information and purchase requirements at the same time:

Then keep making requests; the little office raccoon is still able to complete tasks quickly.

And even if the data format is not standardized, it can detect and resolve it on its own:

Of course, data calculation is not a matter of course; it is still a matter of request.

In addition to this, the little office raccoon can also do visualization work based on data files and directly display difficult heat maps:

In summary, Office Raccoon can process multiple and different types (such as Excel, csv, json, etc.), and has very strong abilities in terms of understanding Chinese, mathematical calculation, and data visualization. Furthermore, it enhances the accuracy and controllability of content generated by large models through the form of a code interpreter.

In addition, at the press conference, Little Office Raccoon also demonstrated the ability to combine complex databases for analysis on the spot.

Last week, China's first F1 driver Zhou Guanyu finished his race at the F1 China Grand Prix. Shang Tang directly “fed” the office raccoon a database file with a huge amount of data at the press conference, so that Little Raccoon could analyze Zhou Guanyu and the F1 race on the spot.

For example, counting Zhou Guanyu's race information, how many F1 drivers have won the championship, and ranked them from highest to lowest in terms of number of awards. These calculations involved larger, more logically complex data tables, and detailed information in more dimensions such as number of laps and number of awards, all of which ultimately gave completely correct answers.

In the programming scene, Code Raccoon can also directly Pro Max the efficiency of programmers.

For example, just install the extension's plugin in VS Code:

Then every aspect of programming became a matter of entering a sentence of natural language.

For example, throw the requirements document to Code Little Raccoon, and then just say:

Help me write a detailed PRD document for WeChat code scanning payments on public clouds. The PRD format and content should follow the requirements of the “Product Requirements Document PRD Template”, and the generated content is clear, complete and detailed.

Then the little code raccoon “started” the work of requirements analysis:

Code Little Raccoon can also do architecture design for you:

You can also write code to request through natural language, or use one-click comments, test, generate code, translate, restructure, or modify the code, etc.:

The final software testing process can also be left to Code Little Raccoon to execute.

All in all, with Code Raccoon, it can help you handle some repetitive and complicated programming tasks on weekdays.

Moreover, Shang Tang not only released such a move this time, but also “packaged” Little Code Raccoon to launch a lightweight all-in-one computer.

An all-in-one computer can support the development of a team of 100 people, and the cost is only 4.5 yuan per person per day.

The above is the main content of Shang Tang's announcement.

Well, finally, we need to talk about one topic in summary.

The number of large model roads in Shangtang

Looking at the entire press conference, the first thing that gave people the most intuitive feeling was that it was comprehensive enough.

Whether it's an end-side model or the “big base” Nishi-Nisshin 5.0, it is a cloud, edge, and full-stack release or upgrade; in terms of capabilities, it also covers almost all mainstream AIGC “tags” such as language, knowledge, reasoning, mathematics, code, and multi-modality.

Second, it's enough to fight.

Take the comprehensive strength of Nisshin 5.0 as an example. Currently, looking at the domestic big model players, it can be said that there are only a few that can fully compete against GPT-4; moreover, Shang Tang dared to directly test multiple abilities on site, and also dared to open up the experience as soon as possible. Confidence in his own strength can be seen.

Finally, it's fast enough.

The speed of Shangtang is not limited to the speed of operation of large models on the end side; from a broader perspective, it is its own speed in the iterative optimization process. If we lengthen the timeline, this speed will be particularly obvious:

  • Day by day 1.0 → 2.0:3 months

  • Nishi-Nisshin 2.0 → 4.0:6 months

  • Day by day 4.0 → 5.0:3 months

On average, there is a major version upgrade almost once a quarter, and its overall capabilities will also be greatly improved.

So the next question is, why can Shangtang do this?

First, from a general perspective, this is the “big model+big device” style of play that Shang Tang has always emphasized.

Big model refers to the new big model system in Japan, which can provide various large models and capabilities such as natural language processing, image generation, automated data labeling, and custom model training.

The large device refers to the high-efficiency, low-cost, and large-scale next-generation AI infrastructure built by Shangtang, with AI large-scale model development, generation, and application as the core; the total computing power is as high as 12,000 PetaFlops, and it already has more than 45,000 GPUs.

The difference between the two is that they have already been laid out. They are not a product of the AIGC boom, but are two forward-looking jobs that can be traced back several years.

Second, it goes deeper into the big model level. Based on his own actual testing and practice process, Shang Tang has a new understanding and interpretation of the basic law of scaling (scaling law) that the industry has agreed upon.

The law of scale usually means that as the amount of data, number of parameters, and training time increases, the performance shown by the large model will be better, which is a feeling of striving to do a miracle.

This law also contains two hidden assumptions:

  • Predictability: Can maintain accurate predictions of performance across 5-7 orders of magnitude

  • Orderability: Performance advantages have been verified on a small scale and maintained on a larger scale

Therefore, the law of scale can guide the search for optimal model architectures and data formulations within limited R&D resources, so that large models can learn efficiently.

It is also based on Shang Tang's observation and practice that an “small and playable” end-side model was born.

In addition to this, Shang Tang also has a unique understanding of the three-tier architecture (KRE) for large models.

Xu Li gave an in-depth interpretation of this.

The first is knowledge, which means the complete infusion of the world's knowledge.

Currently, new productivity tools such as big models are almost all based on this to solve problems, that is, to answer your questions based on solutions that have already been solved by previous people.

This can be considered the basic skill of large model abilities, but higher level knowledge should be based on new knowledge obtained by reasoning under this ability. This is also the second level of this architecture — reasoning, that is, qualitative improvement in rational thinking.

This layer of ability is the key and core that can determine whether the big model is smart enough and whether it can be trivialized.

Furthermore, execution refers to the interactive transformation of world content, that is, how to interact with the real world (as far as the present is concerned, embodied intelligence exists like a potential stock at this level).

Although the three are independent of each other, they are also closely linked from layer to layer. Xu Li used a more graphic analogy:

Knowledge to reasoning is like the brain, and reasoning to execution is like the cerebellum.

According to Shang Tang, this three-tier architecture is the ability that the big model should have, and this is the key that inspired Shang Tang to build high-quality data; not only that, but it is also based on KRE's logic that many of the products in this launch are available.

Well, the last question is, based on KRE, based on the “big model+big device” route, to what extent has Nisshin recently “taken up jobs” in the industry?

As the saying “practice is the only standard for testing truth”, user feedback from customers is probably the most realistic answer.

Meanwhile, Shang Tang also handed over a job with a high score — on site, Huawei, WPS, Xiaomi, Reading, and Haitong Securities all shared the cost reduction and efficiency they brought to their own business after using the Shangtang Ri-Ri New Big Model System, from office to entertainment, from finance to terminals.

All in all, with technology, computing power, methodology, and scenarios, the next development of Shangtang Rixin in the AIGC era is worth looking forward to.

edit/lambor

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment