mobilecryptogamesios| Microsoft, Google and Meta are betting on synthetic data to build AI models

Author:editor|Category:Travel

Every ingenious response of a chatbot is supported by vast amounts of data-in some cases, trillions of words need to be extracted from articles, books and online comments to teach artificial intelligence systems to understand users' queries. The traditional view in the industry is that building the next generation of artificial intelligence products will require more and more information.

However, there is a big problem with this plan: the high-quality data that can be provided on the network is limited. To obtain the data, artificial intelligence companies typically either pay publishers millions of dollars to license content or download data from websites, putting themselves at risk of copyright disputes. More and more top artificial intelligence companies are exploring another way to divide the industry: the use of synthetic data is essentially fake data.

Here's how it works: technology companies can use their own artificial intelligence systems to generate text and other media. This data can then be used to train future versions of the same system, which Anthropic CEO Dario Amodey (Dario Amodei) calls a potential "unlimited data generation engine." In this way, artificial intelligence companies can avoid causing many legal, moral and privacy issues.

mobilecryptogamesios| Microsoft, Google and Meta are betting on synthetic data to build AI models

The idea of synthesizing data in computing is not new-the technology has been used for decades, ranging from the anonymization of personal information to road simulation of self-driving technology. However, the rise of production artificial intelligence makes it easier for people to build higher-quality synthetic data on a large scale, and it also makes this practice a new urgency.

Anthropic says it uses composite data to build the latest model to support its chat robot Claude. Meta and Google have used this data to develop their latest open source models. GoogleDeepMind recently said it relies on this approach to help train a model that can solve geometric problems at the Olympian level. Many people speculate whether OpenAI is using this kind of data to train its text to the video image generator Sora. (OpenAI revealed that it is exploring the use of composite data, but would not confirm further details. )

At Microsoft, the production artificial intelligence research team used synthetic data in a recent project. They hope to build a smaller, less resource-intensive artificial intelligence model, but still have effective language and reasoning capabilities. To do this, they try to imitate the way children learn language by reading stories.

Instead of providing a large number of children's books to the artificial intelligence model, the team listed 3000 words that four-year-olds could understand. They then asked the artificial intelligence model to use a noun, a verb and an adjective in the vocabulary to create a children's story. The researchers repeated this tip millions of times in a few days, producing millions of short stories and eventually helping to develop another more powerful language model. Microsoft has made this new "small" language model series Phi-3 open source and open to the public.

"all of a sudden," said S é bastien Bubeck, vice president of production artificial intelligence at Microsoft.MobilecryptogamesiosYou have far more control than you used to. You can decide on a more subtle level what you want your model to learn. "

With synthetic data, you can also better guide artificial intelligence systems through the learning process by adding more explanations to the data, otherwise the machine may be confused in the process, Bubeck said.

However, some artificial intelligence experts are concerned about the risks of this technology. A team of researchers at Oxford, Cambridge and several other leading universities published a paper last year explaining why using synthetic data generated by ChatGPT to build new artificial intelligence models can lead to what they call a "model crash".

In their experiments, artificial intelligence models based on the output of ChatGPT began to show "irreversible defects" and seemed to lose the memory of the original training. For example, researchers use text about British historic buildings to suggest a large language artificial intelligence model. When they used synthetic data to retrain the model many times, the model began to produce meaningless gibberish about long-eared rabbits.

The researchers are also concerned that synthetic data may amplify biases and toxicity in the data set. Some proponents of composite data say that by taking appropriate measures, models developed in this way can be as accurate or even better as those based on real data.

Dr Zakhar Shumaylov of the University of Cambridge (University of Cambridge) said in an email: "if handled properly, synthetic data can be very useful. However, there is no clear answer as to how to handle it properly; some prejudices may be difficult for humans to detect. " Schumelov is one of the co-authors of the above-mentioned paper on model collapse.

There is also a more philosophical debate: if large language models are trapped in an endless cycle of training based on their own content, will artificial intelligence eventually become less a machine that mimics human intelligence? and more machines that mimic other machine languages?

To generate useful synthetic data, companies still need real human intelligence, such as books, articles and program code, says Percy Liang, a computer science professor at Stanford University. "the synthetic data are not real data, just like you didn't really climb Mount Everest in a dream," Liang said in an email. "

Pioneers in the fields of synthetic data and artificial intelligence agree that you cannot exclude humans from this process. We still need real people to build and refine artificial datasets.

"compositing data is not simply pressing a button and saying to it," Hey, help me generate some data, "Bubeck said." This is a very complicated process. A lot of manpower is needed in the process of building synthetic data on a large scale. "

12 05

2024-05-12 05:05:07

Back to
Category Back to
Homepage

pennslammer3sizes| The China Securities Regulatory Commission agrees that Xusheng Group will register the issuance of convertible bonds to unspecified objects freespinhouseoffun2019| WTI oil prices rise steadily: OPEC+ production cuts resonate with U.S. demand for replenishment, focusing on investment opportunities such as PetroChina

mobilecryptogamesios| Microsoft, Google and Meta are betting on synthetic data to build AI models

Related content