share_log

AI产业的灰色暗面:OpenAI、谷歌、META如何搞训练语料

The dark side of the AI industry: How OpenAI, Google, and META use training materials

cls.cn ·  Apr 8 08:10

① As early as two or three years ago, all major tech giants had already encountered bottlenecks in training materials and began the path of “taking shortcuts” one after another;

② These acts have also caused them to get involved in a large number of disputes related to copyright now;

③ Even with various measures, the depletion of high-quality Internet data has become an urgent problem for major tech giants.

Various indications suggest that these companies, which are currently at the top of the trend in the AI field around the world, fell into a “hopeless” pursuit of training materials as early as a few years ago — for this reason, they did not hesitate to revise policy provisions and ignore the rules for using Internet information, just to make their products more advanced.

In an investigation report published this weekend, the “New York Times” revealed some “shortcut” measures taken by companies such as OpenAI, Google, and Meta to obtain training materials, while also showing the imminent plight of the entire industry.

US tech giants each take “shortcuts”

At the end of 2021, OpenAI, which is training GPT-4, ran into a difficult problem. The company had exhausted all reliable English-language text resources on the Internet, and they needed more and more large-scale data to train more powerful models.

To deal with this problem, OpenAI's Whisper speech recognition tool was created — used to transcribe video audio from Google's video platform YouTube and generate large amounts of conversation text.

The report said that the team, including OpenAI President Brockman, has transcribed more than 1 million hours of YouTube videos in total. This data was then entered into the GPT-4 system and became the basis for the chatbot ChatGPT.

According to Google's policy, users are prohibited from using videos on the platform for “standalone” apps, and accessing their videos through any automated means (crawlers, etc.) is prohibited.

Interestingly, while OpenAI was sneaking up on YouTube videos, Google was also transcribing content from its own streaming platform to train big models — also at the risk of copyright infringement. Because of this, even though some Google employees know that OpenAI is doing this, they haven't stopped them. Because once Google protests against OpenAI, it is also possible to “set itself on fire”.

In response to a question about whether to use YouTube videos to train AI, OpenAI said they used data from “multiple sources.” Google spokesman Matt Bryant said that the company was unaware of OpenAI's actions and prohibited anyone from “grabbing or downloading Youtube videos without authorization.” However, Bryant also said that the company will only act when there is a clear legal or technical basis.

Google's own terms allow platforms to use these videos to develop new features for video platforms, but there are also quite a few questions about whether this wording means that Google can use this data to develop commercial AI.

Meanwhile, Meta's internal meeting minutes show that engineers and product managers discussed plans to buy large American publisher Simon & Schuster to obtain long-text materials. In addition, they also discussed collecting copyrighted content from the Internet, and stated that “it takes too much time to negotiate licenses with publishers, artists, musicians, and the news industry.”

According to reports, some Meta executives said that OpenAI appears to be using copyrighted materials, so the company can also follow this “market precedent.”

The more obvious change is that Google revised its terms of service last year. According to internal data, one of the motivations driving the change in privacy policy includes allowing Google to use public Google documents, restaurant reviews on Google Maps, and more online materials to develop AI products. Finally, Google released the revised privacy policy on July 1, just before the US National Day (July 4) holiday, incorporating “training AI models using public information” for the first time.

Bryant responded that the company will not use their Google Docs to train AI without the user's “explicit permission,” which refers to an experimental feature experience program with voluntary participation.

Even so, it's not enough

Precisely because of these operations, along with people's astonishment of AI capabilities in recent years, more and more copyright holders have begun to realize that their data has been secretly taken away to train AI. Including The New York Times, some filmmakers and writers have taken these technology companies to court, and the US Copyright Office is also developing guidelines for the application of copyright law in the AI era.

The problem is that even if some writers and producers refer to the actions of technology companies as “the biggest theft in American history,” there is still not enough data for technology companies to develop the next generation of AI.

At the beginning of 2020, Jared Kaplan, a theoretical physicist at Johns Hopkins University (currently Anthropic's Chief Scientific Officer), published a paper clearly stating that the more data used to train big language models, the better their performance will be. Since then, “scale is everything” has become the creed of the AI industry.

GPT-3, which was released in November 2020, contains training data for approximately 300 billion tokens. In 2022, Google DeepMind tested 400 artificial intelligence models. Among them, the best-performing model (one), a model called Chinchilla, used 1.4 trillion tokens of data. By 2023, the Skywork model developed by Chinese scientists used 3.2 trillion English and Chinese tokens for training, while the amount of training data for Google Palm 2 reached 3.6 trillion tokens.

Research agency Epoch said bluntly that now technology companies are using data faster than the speed of data production, and these companies will run out of high-quality data on the Internet as soon as 2026.

Faced with this problem, Altman has proposed a solution: companies like OpenAI will eventually switch to using AI-generated data (also known as synthetic data) to train AI. In this way, developers can create more powerful technology while also reducing their reliance on copyrighted data.

Currently, OpenAI and a series of institutions are also studying whether two different models can be used together to generate more useful and reliable synthetic data — one system generates data, and the other system judges the information. Of course, the viability of this technological path is still debated.

Jeff Clune, a former OpenAI researcher, believes that the data required for these AI systems is like a path through the jungle. If these companies only train on synthetic data, AI may get lost in the jungle.

edit/new

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment