Is it possible to train an AI using only data generated by another AI? It may sound like a crazy idea. But this is data that has been around for quite some time and is gaining traction because new, real data is becoming increasingly difficult to obtain.
Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta used AI-generated data to fine-tune the Llama 3.1 model. And OpenAI is said to be sourcing synthetic training data from o1, its “inference” model, for the upcoming Orion.
But why does AI need data in the first place? And what kind of data does it need? And can this data really be replaced with synthetic data?
Importance of annotations
AI systems are statistical machines. They are trained on many examples and learn patterns in those examples to make predictions, such as “to whom” in an email usually comes before “may be of concern.”
Annotations are typically text that labels the meaning or portions of the data that these systems capture. Annotations are the important part of these examples. They serve as guideposts and “teach” models for distinguishing objects, places, and ideas.
Consider a photo classification model that displays many photos of kitchens with the word “kitchen” attached. During training, the model starts associating “kitchen” with general characteristics of a kitchen, such as containing a refrigerator and countertop. After training, the model should be able to identify it as such if it is given a picture of a kitchen that is not included in the first example. (Of course, if the kitchen photo is labeled “cow,” it will be identified as a cow, thus highlighting the importance of proper annotation.)
The demand for AI and the need to provide labeled data for its development has expanded the market for annotation services. Dimension Market Research estimates that it is currently worth $838.2 million and will be worth $10.34 billion over the next 10 years. There are no precise estimates of how many people work on labeling jobs, but a 2022 paper puts the number in the “millions.”
Companies large and small rely on workers employed by data annotation companies to create labels for AI training sets. Some of these jobs can pay quite well, especially if labeling requires specialized knowledge (such as mathematical expertise). Others may find it difficult. Annotators in developing countries are paid, on average, just a few dollars an hour, with no benefits or guarantees of future work.
Drying data wells
There is therefore a humanitarian reason to seek alternatives to human-created labels. But there are also practical ones.
Humans can only label something very quickly. Annotators also have biases that can show up in their annotations and then in models trained on them. Annotators make mistakes and stumble when labeling instructions. And paying humans to do something is expensive.
What's more, data is generally expensive. Shutterstock charges AI vendors tens of millions of dollars for access to its archives, while Reddit makes hundreds of millions of dollars licensing its data to the likes of Google and OpenAI.
Finally, data acquisition is also becoming more difficult.
Most models are trained on large collections of public data. Data owners are increasingly choosing to gate their data for fear of it being stolen or not receiving credit or attribution. Over 35% of the world's top 1,000 websites currently block OpenAI's web scrapers. Additionally, one recent study found that approximately 25% of data from “high-quality” sources is restricted from the primary datasets used to train models.
Research group Epoch AI predicts that if current trends in blocking access continue, developers will run out of data to train their generative AI models between 2026 and 2032. Add to this concerns about copyright lawsuits and objectionable material invading open datasets. , forcing AI vendors into liquidation.
synthetic substitutes
At first glance, it appears that synthetic data can solve all these problems. Need annotations? Generate them. More sample data? No problem. The sky is the limit.
And this is true to some extent.
“If data is the new oil, then synthetic data is being touted as a biofuel that can be created without any real negative externalities,” he says, researching the ethical implications of emerging technologies. Os Keyes, a doctoral candidate at the University of Washington, told TechCrunch. . “You can take a small starting set of data and then simulate and extrapolate new entries from there.”
The AI industry has adopted this concept and is running with it.
Writer, an enterprise generative AI company, this month debuted Palmyra X 004, a model trained almost entirely on synthetic data. According to Writer, its development cost was just $700,000, compared to an estimated cost of $4.6 million for a comparable OpenAI model.
Microsoft's Phi open model was partially trained using synthetic data. Google's Gemma model was similar. Nvidia announced this summer a family of models designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is the largest AI training dataset of synthetic text.
Synthetic data generation has become a business in itself, potentially reaching a value of $2.34 billion by 2030. Gartner predicts that 60% of the data used for AI and analytics projects will be generated synthetically this year.
Luca Soldaini, a senior researcher at the Allen Institute for AI, said synthetic data techniques can be used to generate training data in a format that cannot be easily obtained through scraping (or content licensing). For example, in training the video generator Movie Gen, Meta used Llama 3 to create captions for footage in the training data, which humans then adjusted to add details such as lighting descriptions.
Along these same lines, OpenAI says it used synthetic data to fine-tune GPT-4o and build sketchpad-like Canvas functionality for ChatGPT. And Amazon said it is generating synthetic data to supplement the real-world data used to train Alexa's voice recognition models.
“Synthetic data models allow us to quickly extend human intuition about what data is needed to achieve a particular model behavior,” Soldaini said.
synthetic risk
However, synthetic data is not a panacea. All AI suffers from the same “garbage in, garbage out” problem. Models create synthetic data, but if the data used to train these models has biases or limitations, their output will be contaminated as well. For example, groups that are not well represented in the base data are well represented in the synthetic data.
“The problem is, there are limits to what you can do,” Keyes said. “Say there are only 30 black people in your data set. Extrapolating may be helpful, but if those 30 people are all middle class or all light-skinned, then you have a representative ” All the data will look like this. ”
At this point, a 2023 study by researchers at Rice University and Stanford University found that relying too heavily on synthetic data during training can create models with “gradually decreasing quality and diversity.” It turns out that there is. Sampling bias (a poor representation of the real world) worsens model diversity after several generations of training, the researchers say (though they also found that mixing in a bit of real-world data can reduce this) ).
Keyes believes that complex models such as OpenAI's o1 have an additional risk, that they can create hallucinations that are difficult to detect in synthetic data. These can reduce the accuracy of models trained on the data, especially when it is difficult to determine the cause of the hallucinations.
“Complex models create hallucinations. The data produced by complex models contains hallucinations,” Keyes added. “And with models like o1, developers themselves can't necessarily explain why artifacts appear.”
Combined hallucinations can create a model spouting gibberish. A study published in the journal Nature explores how models trained on erroneous data generate even more erroneous data, and how this feedback loop influences future generations of models. It reveals whether it causes deterioration. The researchers found that over generations, models became less able to grasp more esoteric knowledge, became more general, and often produced answers that were unrelated to the questions being asked.
Image credit: Ilia Shumailov et al.
Follow-up studies have shown that other types of models, such as image generators, are also immune to this type of collapse.
Image credit: Ilia Shumailov et al.
Soldaini agrees that “raw” synthetic data cannot be trusted, at least when the goal is to avoid training forgetful chatbots or homogeneous image generators. To use it “safely”, he says, it needs to be thoroughly reviewed, curated, filtered, and ideally combined with new, real-world data, just like any other data set. .
Failure to do so may ultimately lead to model collapse, making the model's output less “creative” and more biased, and ultimately severely impairing its functionality. Masu. This process can lead to identification and arrest before things become serious, but it does come with risks.
“Researchers need to examine the data generated, repeat the generation process, and identify safeguards to remove low-quality data points,” Soldaini said. “Synthetic data pipelines are not self-improving machines; their output must be carefully inspected and improved before being used for training.”
OpenAI CEO Sam Altman once argued that AI will one day generate enough synthetic data to effectively train itself. But even if it were possible, the technology doesn't exist yet. No major AI lab releases models trained solely on synthetic data.
At least for the time being, it seems like humans will need to be involved somewhere to ensure that the model training doesn't fail.