share_log

AI“数据荒”怎么办?微软、谷歌等公司正使用“合成数据”训练AI

What about AI's “data shortage”? Microsoft, Google and others are using “synthetic data” to train AI

cls.cn ·  May 12 10:39

① The availability of data on the Internet is limited, which is a headache for technology companies that need large amounts of data to train models; ② AI companies are looking for an alternative solution — synthetic data; ③ synthetic data is data generated by artificial intelligence systems. Some large companies have already begun to use it, but this method is facing huge differences.

Financial Services Association, May 12 — Behind artificial intelligence chatbots, massive amounts of high-quality data are needed to support them. Traditionally, artificial intelligence systems rely on large amounts of data extracted from various web sources such as articles, books, and online reviews to understand user queries and generate responses.

For a long time, how to obtain more high-quality data has become a major challenge for artificial intelligence companies. Due to the limited availability of data on the internet, this has prompted artificial intelligence companies to seek an alternative solution — synthetic data (synthetic data).

Synthetic data, that is, artificial data generated by artificial intelligence systems. Technology companies use their artificial intelligence models to generate synthetic data (which is also considered false data) and then use this data to train future iterations of their systems.

Talking about how synthetic data is generated, the process includes setting specific parameters and prompts for the AI model to create content. This method can more accurately control the data used to train the AI system.

For example, Microsoft researchers listed 3,000 words that four-year-olds can understand to an artificial intelligence model, then they asked the model to create a children's story using a noun, a verb, and an adjective from the glossary. Through millions of repeated prompts over a few days, the model ended up generating millions of short stories.

Although synthetic data in computation is not a new concept, the rise of generative artificial intelligence has spurred the creation of higher quality synthetic data on a large scale.

Dario Amodei, CEO of artificial intelligence startup Anthropic, called this method an “infinite data generation engine,” which aims to avoid some copyright and privacy issues associated with traditional data collection methods.

Existing use cases and divergent views

Currently, major artificial intelligence companies such as Meta, Google, and Microsoft have begun using synthetic data to develop advanced models, including chatbots and language processors.

For example, Anthropic uses synthetic data to power its chatbot Claude; Google DeepMind uses this method to train models that can solve complex geometric problems; at the same time, Microsoft has disclosed small language models developed using synthetic data.

Some proponents believe that, if properly implemented, synthetic data can produce accurate and reliable models.

However, some AI experts are concerned about the risks associated with synthetic data. Researchers at famous universities have observed examples of “model collapse,” where artificial intelligence models trained on synthetic data showed irreversible defects and produced ridiculous output. Additionally, there are concerns that synthetic data may exacerbate biases and errors in the data set.

Dr. Zakhar Shumaylov of Cambridge University wrote in an email, “Synthetic data can be useful if handled properly. However, there are currently no clear answers on how to handle them properly; some biases may be difficult for humans to detect.”

Furthermore, there is a philosophical debate surrounding the reliance on synthetic data, and people are questioning the nature of artificial intelligence—if machine-synthesized data is used, is artificial intelligence still a machine that mimics human intelligence?

Stanford professor Percy Liang emphasized the importance of incorporating real human intelligence into the data generation process and emphasized the complexity of creating synthetic data at scale. He believes, “Synthetic data isn't real data; it's like if you dream of climbing Mount Everest and not actually climbing the summit.”

There is currently no consensus on best practices for generating synthetic data, which highlights the need for further research and development in this area. As the field continues to evolve, collaboration between AI researchers and domain experts is critical to harnessing the potential of artificial intelligence to develop synthetic data.

edit/emily

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment