Elon Musk, owner of X and CEO of xAI, among other companies, says AI systems are nearing the exhaustion of all available online data for training.
His solution involves crossing the Rubicon of model training by using synthetic data, meaning AI models will generate the data they learn from.
Why it matters. The scarcity of training data will mark a pivotal moment in the development of AI tools. However, it could also slow technological progress.
Context. Large language models require vast amounts of data to improve their performance. The depletion of real data, generated by humans through traditional means, is pushing the industry to seek alternatives to enhance products like chatbots and image generators.
- The idea isn’t new. Other AI projects have already adopted it. Gartner predicts that by 2024, 60% of the data used in AI projects will be synthetically generated. Companies such as Microsoft, OpenAI, Anthropic, and Meta are turning to synthetic data.
- Palmyra X 004, a model designed to power existing AI applications, was trained this way at a cost of $700,000.
- By comparison, training a similarly sized OpenAI model costs an estimated $4.6 million.
What’s different about Musk’s proposal? So far, synthetic data has supplemented real data, not replaced it. Musk believes synthetic data will soon become the only viable training source.
Between the lines. Musk isn’t alone in raising concerns. In December, Ilya Sutskever, a former chief scientist at OpenAI, issued a similar warning: “We have reached the peak of data, and there will be no more data in the future.”
- The issue with synthetic data lies in the risk of creating a closed loop, where biases and limitations become amplified.
- This could result in model collapse through a gradual loss of creativity and accuracy.
Despite these risks, the industry continues to embrace synthetic data.
Image | Xataka On with Grok
View 0 comments