share_log

GPT-4震撼发布:多模态大模型,直接升级ChatGPT、必应,开放API,游戏终结了?

GPT-4 Shocking Release: Big Multi-modal Model, Direct Upgrades to ChatGPT, Bing, Open API, Game Over?

機器之心 ·  Mar 15, 2023 16:04

Source: Heart of the Machine

ChatGPT has ignited the bright light of the tech industry. Can GPT-4 flourish?

Who can save ChatGPT's life? Now it seems like it's still OpenAI itself.

After ChatGPT exploded the field of technology, people have been discussing what the “next” development of AI will be. Many scholars have mentioned multi-modality, and we haven't waited long. Early this morning, OpenAI released the multi-modal pre-trained large model GPT-4.

The GPT-4 has achieved dramatic improvements in the following areas: strong graphic literacy; increased text input limit to 25,000 words; significant improvement in response accuracy; and the ability to generate lyrics and creative text to achieve style changes.

“GPT-4 is the world's first advanced AI system with high experience and power, and we hope to bring it to everyone soon,” the OpenAI engineer said in the introduction video.

Seems like they want to end this game in one go. OpenAI not only published a paper (more like a technical report), a System Card, directly upgraded ChatGPT to the GPT-4 version, and opened the GPT-4 API.

Also, the Microsoft marketing director immediately after GPT-4 was released said, “If you've used the new Bing preview at any time in the past six weeks, you've learned the powerful features of OpenAI's latest model in advance.” Yes, Microsoft's new Bing already uses GPT-4.

Next, let's take a closer look at this shocking release.

GPT-4: I can be a lawyer if I take the SAT 710

The GPT-4 is a large multi-modal model that can accept image and text input and then output the correct text response. Experiments have shown that the performance of GPT-4 on various professional tests and academic benchmarks is comparable to that of humans. For example, it passed the mock bar exam and scored in the top 10% of test takers; in comparison, GPT-3.5 scored in the bottom 10%.

OpenAI spent 6 months iteratively adjusting GPT-4 using adversarial testing procedures and lessons learned from ChatGPT to achieve the best results ever in terms of authenticity, controllability, etc.

Over the past two years, OpenAI has rebuilt the entire deep learning stack and worked with Azure to design a supercomputer for its workload from scratch. A year ago, OpenAI first tried to run this supercomputing system while training GPT-3.5, and since then they discovered and fixed some bugs one after another, improving its theoretical foundation. As a result of these improvements, the training operation of GPT-4 was unprecedentedly stable, so much so that OpenAI was able to accurately predict the training performance of GPT-4 in advance, and it was also the first big model to achieve this. OpenAI said they will continue to focus on reliable scaling and further refining the approach to help it achieve a stronger ability to anticipate performance and plan for the future ahead of time, which is critical to security.

OpenAI is releasing GPT-4's text input functionality through chatGPT and an API (with a waitlist). In terms of image input functionality, OpenAI is collaborating with other companies to achieve wider usability.

OpenAI is also open sourcing OpenAI Evals today, its framework for automatically evaluating the performance of AI models. OpenAI said the move was to allow everyone to point out shortcomings in its model to help OpenAI further improve the model.

Interestingly, the differences between GPT-3.5 and GPT-4 are subtle. The difference comes when the complexity of the task reaches a sufficient threshold — GPT-4 is more reliable, more creative, and able to handle more nuanced instructions than GPT-3.5. To understand the differences between these two models, OpenAI experimented with various benchmarks and some simulated tests designed for humans.

OpenAI also evaluated GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 is vastly superior to existing large-scale language models, as well as most SOTA models:

Many of the existing machine learning benchmarks are written in English. To get an initial understanding of GPT-4's capabilities in other languages, the research team used Azure Translate to translate the MMLU benchmark — a set of 14,000 multiple choice questions covering 57 topics — into multiple languages. In 24 of the 26 languages tested, GPT-4 outperformed GPT-3.5 and the English language performance of other big language models (Chinchilla, PalM):

Like many companies using ChatGPT, OpenAI says they are also using GPT-4 internally, so OpenAI is also looking at the application effects of large-scale language models in content generation, sales, and programming. OpenAI also uses GPT-4 to help people evaluate AI output, which is the second phase of OpenAI's strategy. OpenAI is both the developer and user of GPT-4.

GPT-4: I can play infarct

The GPT-4 can accept prompts in the form of text and images. The new capability goes hand in hand with plain text settings, allowing users to specify any visual or linguistic task.

Specifically, it generates corresponding text output (natural language, code, etc.) given an input composed of scattered text and images by humans. In a range of areas — including documents, diagrams, or screenshots with text and photos — GPT-4 demonstrated functionality similar to plain text input. Additionally, it can be enhanced with test time techniques developed for plain text language models, including small samples and thought chain prompts.

For example, give the GPT-4 a picture of a strange-looking charger and ask why is this ridiculous?

GPT-4 answered, “The VGA cable is charging the iPhone.

Average per capita consumption of meat per capita in Georgia and West Asia:

Seemingly, the current GPT is no longer computational nonsense:

It's still too simple; let's just ask it a question; it's still a physics question:

GPT-4 understood the French question and gave a complete answer:

GPT-4 can understand “what's wrong” in a picture:

GPT-4 can also read papers at quantum speed. If you give it an InstructGPT paper and let it summarize the summary, it will look like this:

What if you're interested in one of the pictures in the paper? GPT-4 can also explain:

Next, ask what the GPT-4 infographic means:

It gave a detailed answer:

What about comics?

Asking GPT-4 to explain why it adds layers to neural networks seems to have a slightly doubled sense of humour.

However, as OpenAI said here, the image input is a preview of the study and is still not open to the public.

The researchers used an academic Benchmark perspective to interpret GPT-4's ability to see images, but that wasn't enough; they were able to continue to discover that the model can handle new tasks in an exciting way — the current paradox is between AI's ability and human imagination.

Seeing this, some researchers must have lamented: CVs don't exist anymore.

controllability

Unlike classic ChatGPT personalities with a fixed, lengthy, calm tone, and style, developers (and ChatGPT users) can now define the style and mission of their AI by describing these directions in “system” messages.

System messages allow API users to customize different user experiences within a certain range. OpenAI knows you guys are letting ChatGPT play cosplay and encourages you to do the same.

limitation

Despite being very powerful, the GPT-4 still has similar limitations to earlier GPT models, the most important of which is that it is still not completely reliable. According to OpenAI, GPT-4 still creates illusions, generates wrong answers, and makes inference errors.

Currently, language models should be used to carefully review output content, using the exact protocol that matches the needs of a particular use case (such as manual review, additional context, or complete avoidance of use).

Overall, GPT-4 has significantly reduced the hallucination problem compared to previous models (which have gone through many iterations and improvements). In OpenAI's internal adversarial reality assessment, GPT-4 scored 40% higher than the latest GPT-3.5 model:

GPT-4 has also made progress in external benchmarks such as TruthfulQA, where OpenAI tested the model's ability to distinguish facts from adversarial choices that were misstated. The results are shown in the figure below.

Experimental results showed that the GPT-4 basic model was only slightly better than GPT-3.5 in this task; however, after post-RLHF training, the gap between the two was huge. Here's an example of testing GPT-4 — it doesn't always make the right choice.

The model may have various biases in its output, and OpenAI has made progress in these areas, with the goal of making established artificial intelligence systems have reasonable default behavior to reflect a broad range of user values.

GPT-4 generally lacks understanding of, and does not learn from, the events that occurred after the vast majority of its data ended (September 2021). It sometimes makes simple reasoning mistakes, which don't seem to match its abilities in so many fields, or takes its users' obvious misstatements too lightly. Sometimes, like humans, it also fails on difficult issues, such as introducing security holes in the code it generates.

There may also be mistakes when predicting GPT-4, but I'm very confident that I won't double-check when I realize that something might go wrong. Interestingly, the underlying pre-trained model is highly calibrated (its predictive confidence in the answer usually matches the correct probability). However, through OpenAI's current post-training (post-training) process, calibration has been reduced.

Risks and mitigation measures

OpenAI said the research team has been iterating on GPT-4 to make it more secure and consistent from the beginning of training, with efforts including selection and filtering of pre-training data, evaluation and expert participation, model security improvements, and monitoring and enforcement.

GPT-4 has risks similar to previous models, such as generating harmful recommendations, erroneous code, or inaccurate information. At the same time, the additional capabilities of GPT-4 have led to a new level of risk. To understand the extent of these risks, the team hired more than 50 experts from the fields of artificial intelligence alignment risk, cybersecurity, biological risk, trust and security, and international security to test the model's behavior in high-risk areas. These areas require expertise to assess, and feedback and data from these experts provide a basis for mitigation measures and model improvements.

Risk prevention

According to the OpenAI engineers in the demo video, GPT-4 training was completed in August of last year, and the rest of the time was spent fine-tuning improvements and, most importantly, removing dangerous content generation.

GPT-4 added an additional safety bonus signal to RLHF training to reduce harmful output by training the model to reject requests for such content. The rewards are provided by GPT-4's zero-sample classifier, which determines security boundaries and how safety-related prompts are completed. To prevent the model from rejecting valid requests, the team collected diverse data sets from various sources (e.g., labeled production data, human red teams, model-generated prompts) and applied safety reward signals (with positive or negative values) to permitted and disallowed categories.

These measures have greatly improved the safety performance of GPT-4 in many ways. Compared to GPT-3.5, the model's responsiveness to requests that disallow content was reduced 82%, while GPT-4's response to sensitive requests (such as medical advice and self-harm) complied with policies 29% more frequently.

Training process

Like previous GPT models, the GPT-4 base model is trained to predict the next word in a document. OpenAI uses publicly available data (such as internet data) and licensed data for training. Training data is a network-scale data corpus that includes correct and incorrect solutions to mathematical problems, weak and strong reasoning, contradictory and consistent statements, and diverse ideologies and ideas.

As a result, when the question is asked, the underlying model's response may be far from what the user intended. To keep it consistent with user intent, OpenAI still uses Reinforcement Learning Human Feedback (RLHF) to fine-tune the model's behavior. Note that the model's power seems to come mostly from the pre-training process — RLHF won't improve exam performance (and may even reduce it). But control of the model comes from the post-training process — the underlying model even requires timely engineering to answer the questions.

A major focus of GPT-4 is to establish a deep learning stack that can be predictably scaled. The main reason is that for large-scale training such as GPT-4, extensive model-specific adjustments are not feasible. The team developed infrastructure and optimizations that have predictable behavior at multiple scales. To verify this scalability, they accurately predicted the final loss of GPT-4 in the internal codebase (not part of the training set) in advance by extrapolating from a model trained using the same method, but using 1/10000 of the computational volume.

OpenAI can now accurately predict metrics (losses) optimized during training. For example, the pass rate for a subset of the HumanEval data set was inferred and successfully predicted from a model with a computational volume of 1/1000:

Some abilities are still difficult to predict. For example, the Inverse Scaling competition aimed to find an indicator that got worse as the amount of model computation increased, and the Hindsight Scaling task was one of the winners. GPT-4 reverses this trend.

The ability to accurately predict the future of machine learning is critical to technical security, but it has not received enough attention. OpenAI said it is investing more effort in developing relevant methods and is calling on the industry to work together.

OpenAI said it is open sourcing the OpenAI Evals software framework, which is being used to create and run benchmarks to evaluate models such as GPT-4, while at the same time checking model performance locally on a case-by-case basis.

Direct upgrade of ChatGPT to GPT-4

After GPT-4 was released, OpenAI directly upgraded ChatGPT. ChatGPT Plus subscribers can get GPT-4 access with a usage limit on chat.openai.com.

To access the GPT-4 API (which uses the same chatCompletions API as gpt-3.5-turbo), users can register and wait. OpenAI will invite some developers to experience it.

Once access has been granted, users can now make plain text requests to the GPT-4 model (image input is still in limited alpha). As for price, the pricing is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens. The default rate limit is 40k tokens per minute and 200 requests per minute.

The context length of GPT-4 is 8,192 tokens. OpenAI also provides limited access to 32,768 token-context (approximately 50 pages of text) versions, which will also be automatically updated over time (the current version, gpt-4-32k-0314, is also supported until June 14). Pricing is $0.06 per 1K prompt token and $0.12 per 1k completion token.

That's all for today's OpenAI content on GPT-4. What is dissatisfying is that the technical report published by OpenAI does not include any more information on model architecture, hardware, computing power, etc., so it can be said that it is not open.

Anyway, users who can't wait are probably already testing the experience.

Finally, I'd like to ask our readers how they feel after watching the GPT-4 release.

Editor/Somer

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment