OpenAI Has Blatantly Used Anything Online to Train Its AI Models. It’s Now Accusing DeepSeek of Stealing Its Data

  • OpenAI officials believe that DeepSeek has distilled their models.

  • This technique is common in the AI field but is prohibited by OpenAI’s terms of use.

  • Additionally, OpenAI has been accused of training its models on data, including copyrighted materials, without the consent of the original owners.

Sam Altman
No comments Twitter Flipboard E-mail
javier-pastor

Javier Pastor

Senior Writer

Computer scientist turned tech journalist. I've written about almost everything related to technology, but I specialize in hardware, operating systems and cryptocurrencies. I like writing about tech so much that I do it both for Xataka and Incognitosis, my personal blog. LinkedIn

DeepSeek’s AI models are impressive. Recent benchmarks place them on par with leading models like ChatGPT, Claude, and Gemini. While this has garnered praise, it’s also raised suspicions. Some people question how DeepSeek could achieve these results with a training cost of only $5.6 million.

Interestingly, OpenAI has now made new accusations against DeepSeek.

Data theft accusations. OpenAI representatives told the Financial Times that they’ve found evidence suggesting that DeepSeek is using OpenAI’s data without permission. They allege that DeepSeek has employed “distillation” techniques to replicate the capabilities of OpenAI’s models.

What is “distillation” in AI? DeepSeek developers have used several techniques to create the company’s efficient models. One primary method is reinforcement learning, but they’re also known to use LLM distillation. This technique involves training a smaller “student model” to mimic the behavior of a larger, more advanced “teacher model.” By using data from the teacher model, the student model becomes faster and more efficient while retaining similar intelligence for specific tasks.

Unauthorized use of data. While distilling models is a common practice in the industry, OpenAI’s terms of use explicitly prohibit the use of its models for this purpose. Users aren’t allowed to “copy” any of OpenAI’s services or “use [the] Output [of OpenAI models] to develop models that compete with OpenAI.”

OpenAI and Microsoft have investigated this issue. According to Bloomberg, both companies investigated accounts last fall that were suspected of being used by DeepSeek developers to exploit their chatbots. These developers were reportedly using the OpenAI API, and there were concerns that they had violated the terms of use by using that access to distill their own models.

DeepSeek isn’t the only one. David Sacks, the White House AI advisor, informed President Donald Trump about the situation, claiming there was evidence DeepSeek had used data from OpenAI. OpenAI representatives said, “We [China] based companies–and others–are constantly trying to distill the models of leading U.S. AI companies.”

A thief believes everybody steals. The irony in this situation is that OpenAI has also been criticized for harvesting data from the Internet to train its models, often violating the terms of service of other platforms. For instance, it was revealed in 2024 that OpenAI transcribed a million hours of YouTube videos to train GPT-4.Computer scientist Timnit Gebru, who was dismissed from Google amid public controversy, wrote on LinkedIn, “OpenAI has to be the most insufferable company in the world.” She added, “They can steal from the whole world and guzzle all possible resources. But no one can give them a taste of their own medicine even a little bit.”

If it’s online, you’re free to use it, right? Companies often employ the “fair use” argument to justify their actions, collecting public content from the Internet without seeking permission from users or platforms. Moreover, there are suspicions that, in many instances, these companies train their models using copyrighted works, leading to numerous lawsuits.

Image | TechCrunch

Related | OpenAI Launches Operator: The Future Where AI Can Perform Online Tasks for You Is Here

Home o Index
×

We use third-party cookies to generate audience statistics and display personalized advertising by analyzing your browsing habits. If you continue browsing, you will be accepting their use. More information