Elon Musk, like other AI experts, says there is little real-world data left to train AI models.
“We're basically depleting the human reservoir of knowledge right now… in AI training,” Musk said in a livestreamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. spoke. “That's basically what happened last year.”
Musk, who owns the AI company xAI, echoed a theme touched on by former OpenAI chief scientist Ilya Satskeva in a December speech at the machine learning conference NeurIPS. Sutskever said the AI industry has reached what he calls “peak data,” predicting that the lack of training data will force a shift from current model development methods.
In fact, Musk suggested that synthetic data, or data generated by the AI models themselves, is the way forward. “With synthetic data… [AI] It’s like a grading in itself and you go through this self-study process,” he said.
Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train their flagship AI models. Gartner estimates that 60% of data used for AI and analytics projects in 2024 will be synthetically generated.
Microsoft's Phi-4, which was open sourced early Wednesday, was trained on synthetic data in parallel with real-world data. Google's Gemma model was similar. Anthropic used some synthetic data to develop Claude 3.5 Sonnet, one of its highest performing systems. Meta then used the AI-generated data to fine-tune its latest Llama series models.
Training on synthetic data also has other benefits, such as cost savings. AI startup Writer claims that its Palmyra X 004 model, developed almost entirely using synthetic sources, cost just $700,000 to develop. In comparison, the estimated development cost for a comparable OpenAI model is $4.6 million.
However, there are also disadvantages. Some studies suggest that synthetic data can lead to model breakdown, making the model's output less “creativity” and more biased, and ultimately severely impairing its functionality. Suggests. Models create synthetic data, so if the data used to train these models has biases or limitations, the output will be contaminated as well.