Elon Musk agrees with other AI experts that there is too little real-world data left to train AI models.
“We have now essentially exhausted the growing body of human knowledge…. in AI training,” Musk said during a live-streamed conversation with Stagwell Chairman Mark Penn that was streamed on X late Wednesday. “It basically happened last year.”
Musk, who owns AI company xAI, echoed the themes of former OpenAI chief scientist Ilya Sutskever. the touch During a talk at NeurIPS, Machine Learning Conference, Dec. Sutskever, who said the AI industry has reached what it calls “peak data,” predicted that a lack of training data would force a move away from the way models are built today.
In fact, Musk has suggested that synthetic data — data created by AI models — is the way forward. “Supplementation is the only way [real-world data] Including synthetic data, where AI creates [training data],” he said. “With synthetic data… [AI] He will choose his own grades and go through this process of self-learning.”
Other companies, including tech giants like Microsoft, Meta, OpenAI and Anthropic, are already using synthetic data to train flagship AI models. Gartner guess In 2024 60% of data used for AI and analytics projects will be synthetically generated.
Microsoft's Fee-4which was open-sourced early Wednesday, was trained on real-world data as well as synthetic data. So was Google Gemma Model Anthropology uses some synthetic data to develop one of its most performing systems, Claude 3.5 Sonnet. And Meta has fine-tuned its latest the llama Series of models Using AI-generated data.
Training on synthetic data has other advantages, such as cost savings. AI startup Writer claims its Palmyra X004 model, built almost entirely using synthetic sources, cost just $700,000 to develop — comparison $4.6 million estimate for a comparable sized OpenAI model.
But there are disadvantages as well. some research suggests that synthetic data can lead to model collapse, where a model becomes less “creative” – and more biased – in its outputs, eventually seriously compromising its performance. Because models generate synthetic data, their outputs will be similarly tainted if the data used to train these models has biases and limitations.