Track the latest AI trends

Topic 578 news 14951 Subscribers

The dark side of the AI industry: How OpenAI, Google, and META use training materials

cls.cn · Apr 8 08:10

①早在两、三年前，各大科技巨头都已经碰到训练语料的瓶颈，并先后开启了“走捷径”的道路；

②这些行为，也令现在的他们陷入大量与版权有关的纠纷中；

③即便采取了种种手段，高质量互联网数据的耗尽已经成为各大科技巨头迫在眉睫的问题。

种种迹象显示，目前站在全世界AI领域潮头浪尖的这些公司，早在几年前就已经陷入对训练语料的“绝望”追逐中——为此他们不惜修改政策条款、无视互联网信息的使用规则，只为了让自家的产品更加先进一些。

《纽约时报》在本周末刊发的调查报道中，揭露了OpenAI、谷歌、Meta等公司为了获取训练语料所采取的一些“走捷径”措施，同时也展现了整个行业迫在眉睫的困境。

美国科技巨头各走“捷径”

2021年末，正在训练GPT-4的OpenAI遇到了一个棘手的问题，公司已经耗尽了互联网上所有可靠的英文文本资源，而他们需要更多、更大规模的数据来训练更强大的模型。

为了处理这个问题，OpenAI的Whisper语音识别工具诞生了——用来转录谷歌旗下视频平台Youtube的视频音频，生成大量的对话文本。

报道称，包括OpenAI总裁布洛克曼在内的团队总共转录了超过一百万小时的Youtube视频。随后这些资料被输入到GPT-4系统中，并成为聊天机器人ChatGPT的基础。

根据谷歌的政策，禁止用户将平台上的视频用于“独立”应用，同时禁止通过任何自动化手段（爬虫等）访问其视频。

有趣的是，在OpenAI偷偷扒Youtube视频时，谷歌也在转录自家流媒体平台的内容训练大模型——同样冒着侵犯版权的风险。正因如此，虽然有谷歌员工知道OpenAI在这么干，也没有出手阻止。因为一旦谷歌对OpenAI提出抗议，也有可能“引火烧身”到自己身上。

对于是否采用Youtube视频训练AI的询问，OpenAI方面回应称，他们使用了“多个来源”的数据。谷歌发言人Matt Bryant则表示，公司对OpenAI的行为一无所知，且禁止任何人“未经授权抓取或下载Youtube视频”。不过Bryant也表示，公司只会在有明确法律、技术依据时才会采取行动。

谷歌自家的条款，则允许平台使用这些视频开发视频平台的新功能，但这样的措辞是否意味着谷歌能用这些资料开发商用AI，也存在不小的疑问。

与此同时，Meta的内部会议纪要显示，工程师和产品经理讨论了购买美国大型出版商Simon & Schuster以获取长文本资料的计划，另外他们还讨论了从互联网上收集受版权保护的内容，并表示“与出版商、艺术家、音乐家和新闻行业谈判授权需要的时间太多了”。

据悉，有Meta的高管表示，OpenAI似乎正在使用受版权保护的材料，所以公司也可以遵循这个“市场先例”。

更显性的变化是，谷歌去年修改了服务条款。根据内部资料显示，推动隐私政策变化的动机之一，包括允许谷歌利用公开的谷歌文档、谷歌地图上的餐厅评论，以及更多在线资料开发AI产品。最终谷歌赶在美国国庆节（7月4日）放假前的7月1日发布了修改后的隐私条款，将“使用公开信息训练AI模型”首次纳入其中。

Bryant回应称，公司不会在没有用户“明确许可”的情况下使用他们的谷歌文档来训练AI，这里指的是自愿参与的实验性功能体验计划。

即便如此还是不够

正因为这些操作，近些年来伴随着人们对AI能力的惊叹，越来越多的版权方也开始意识到自己的数据被偷偷拿走训练AI了。包括《纽约时报》、一些电影制作人和作家已经将这些科技公司告上法庭，美国著作权局也正在制定版权法在AI时代的适用指南。

问题在于，即便一些作家、制片人将科技公司的行为称为“美国史上最大盗窃案”，科技公司用来发展下一代AI的数据依然还是不够。

2020年初，约翰霍普金斯大学的理论物理学家（现Anthropic首席科学官）Jared Kaplan发布了一篇论文，明确表示训练大语言模型用的数据越多，表现就会越好。自那以后，“规模就是一切”成为了人工智能行业的信条。

2020年11月发布的GPT-3包含约3000亿个Token的训练数据。2022年，谷歌DeepMind对400个人工智能模型进行测试，其中表现最好的模型（之一），一个名为Chinchilla的模型用了1.4万亿个Token的数据。到2023年，中国科学家开发的Skywork大模型在训练中使用了3.2万亿个英文和中文Token，谷歌PaLM 2的训练数据量则达到3.6万亿个Token。

研究机构 Epoch直白地表示，现在科技公司使用数据的速度已经超过数据生产的速度，这些公司最快会在2026年就耗尽互联网上的高质量数据。

面对这样的问题，奥尔特曼已经提出了一种解决方法：像OpenAI这样的公司，最终会转向使用AI生成的数据（也被称为合成数据）来训练AI。这样开发人员在创建愈发强大的技术同时，也会减少对受版权保护数据的依赖。

目前OpenAI和一系列机构也正在研究使用两个不同的模型，能否共同生成更有用、更可靠的合成数据——一个系统产生数据，另一个系统对信息进行评判。当然，这种技术路径是否可行，目前仍存争议。

前 OpenAI 研究员Jeff Clune认为，这些AI系统所需的数据就像是穿越丛林的路径，如果这些公司只是在合成数据上训练，AI可能会在丛林里迷失。

编辑/new

① As early as two or three years ago, all major tech giants had already encountered bottlenecks in training materials and began the path of “taking shortcuts” one after another;

② These acts have also caused them to get involved in a large number of disputes related to copyright now;

③ Even with various measures, the depletion of high-quality Internet data has become an urgent problem for major tech giants.

Various indications suggest that these companies, which are currently at the top of the trend in the AI field around the world, fell into a “hopeless” pursuit of training materials as early as a few years ago — for this reason, they did not hesitate to revise policy provisions and ignore the rules for using Internet information, just to make their products more advanced.

In an investigation report published this weekend, the “New York Times” revealed some “shortcut” measures taken by companies such as OpenAI, Google, and Meta to obtain training materials, while also showing the imminent plight of the entire industry.

US tech giants each take “shortcuts”

At the end of 2021, OpenAI, which is training GPT-4, ran into a difficult problem. The company had exhausted all reliable English-language text resources on the Internet, and they needed more and more large-scale data to train more powerful models.

To deal with this problem, OpenAI's Whisper speech recognition tool was created — used to transcribe video audio from Google's video platform YouTube and generate large amounts of conversation text.

The report said that the team, including OpenAI President Brockman, has transcribed more than 1 million hours of YouTube videos in total. This data was then entered into the GPT-4 system and became the basis for the chatbot ChatGPT.

According to Google's policy, users are prohibited from using videos on the platform for “standalone” apps, and accessing their videos through any automated means (crawlers, etc.) is prohibited.

Interestingly, while OpenAI was sneaking up on YouTube videos, Google was also transcribing content from its own streaming platform to train big models — also at the risk of copyright infringement. Because of this, even though some Google employees know that OpenAI is doing this, they haven't stopped them. Because once Google protests against OpenAI, it is also possible to “set itself on fire”.

In response to a question about whether to use YouTube videos to train AI, OpenAI said they used data from “multiple sources.” Google spokesman Matt Bryant said that the company was unaware of OpenAI's actions and prohibited anyone from “grabbing or downloading Youtube videos without authorization.” However, Bryant also said that the company will only act when there is a clear legal or technical basis.

Google's own terms allow platforms to use these videos to develop new features for video platforms, but there are also quite a few questions about whether this wording means that Google can use this data to develop commercial AI.

Meanwhile, Meta's internal meeting minutes show that engineers and product managers discussed plans to buy large American publisher Simon & Schuster to obtain long-text materials. In addition, they also discussed collecting copyrighted content from the Internet, and stated that “it takes too much time to negotiate licenses with publishers, artists, musicians, and the news industry.”

According to reports, some Meta executives said that OpenAI appears to be using copyrighted materials, so the company can also follow this “market precedent.”

The more obvious change is that Google revised its terms of service last year. According to internal data, one of the motivations driving the change in privacy policy includes allowing Google to use public Google documents, restaurant reviews on Google Maps, and more online materials to develop AI products. Finally, Google released the revised privacy policy on July 1, just before the US National Day (July 4) holiday, incorporating “training AI models using public information” for the first time.

Bryant responded that the company will not use their Google Docs to train AI without the user's “explicit permission,” which refers to an experimental feature experience program with voluntary participation.

Even so, it's not enough

Precisely because of these operations, along with people's astonishment of AI capabilities in recent years, more and more copyright holders have begun to realize that their data has been secretly taken away to train AI. Including The New York Times, some filmmakers and writers have taken these technology companies to court, and the US Copyright Office is also developing guidelines for the application of copyright law in the AI era.

The problem is that even if some writers and producers refer to the actions of technology companies as “the biggest theft in American history,” there is still not enough data for technology companies to develop the next generation of AI.

At the beginning of 2020, Jared Kaplan, a theoretical physicist at Johns Hopkins University (currently Anthropic's Chief Scientific Officer), published a paper clearly stating that the more data used to train big language models, the better their performance will be. Since then, “scale is everything” has become the creed of the AI industry.

GPT-3, which was released in November 2020, contains training data for approximately 300 billion tokens. In 2022, Google DeepMind tested 400 artificial intelligence models. Among them, the best-performing model (one), a model called Chinchilla, used 1.4 trillion tokens of data. By 2023, the Skywork model developed by Chinese scientists used 3.2 trillion English and Chinese tokens for training, while the amount of training data for Google Palm 2 reached 3.6 trillion tokens.

Research agency Epoch said bluntly that now technology companies are using data faster than the speed of data production, and these companies will run out of high-quality data on the Internet as soon as 2026.

Faced with this problem, Altman has proposed a solution: companies like OpenAI will eventually switch to using AI-generated data (also known as synthetic data) to train AI. In this way, developers can create more powerful technology while also reducing their reliance on copyrighted data.

Currently, OpenAI and a series of institutions are also studying whether two different models can be used together to generate more useful and reliable synthetic data — one system generates data, and the other system judges the information. Of course, the viability of this technological path is still debated.

Jeff Clune, a former OpenAI researcher, believes that the data required for these AI systems is like a path through the jungle. If these companies only train on synthetic data, AI may get lost in the jungle.

edit/new

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

Track the latest AI trends

AI产业的灰色暗面：OpenAI、谷歌、META如何搞训练语料

The dark side of the AI industry: How OpenAI, Google, and META use training materials

美国科技巨头各走“捷径”

即便如此还是不够

US tech giants each take “shortcuts”

Even so, it's not enough

Risk Disclaimer

Statement