Some Companies Are Training Their AI Models on Reddit Data Without Permission, And Reddit Is Declaring War

Artificial intelligence companies rely on data to train their models. To gather this data, they often use web scraping, a technique that involves extracting and storing public information from web pages. This is typically done without the consent of the content creators or licensees, and no payment is involved.

This week, Reddit announced a new measure to stop prevent companies from scraping its site. The platform, which hosts millions of conversations on various topics within subreddits, will prevent unauthorized companies from using its public content. To do so, it’s going to implement a backend-level change to the robots.txt file exclusion protocol “in the coming weeks.”

Reddit Is at War With Web Scrapers

The recent announcement aims to limit access to the content from the company led by Steve Huffman for individuals who don’t have an agreement with the platform. In recent months, major tech companies such as OpenAI, the owner of ChatGPT, and Google, the creator of Gemini, have entered into formal partnerships with Reddit for access to its data. Essentially, access to the data is restricted without an agreement.

The modifications announced on Wednesday have been incorporated into the platform’s Public Content Policy. While the company is cracking down on web scrapers, it states that it’ll continue to provide researchers and academics access to its content. Reddit added that it’ll ensure access for moderators and organizations like the Internet Archive, which is dedicated to preserving online content.

In today’s AI-driven world, the importance of text, images, music, and videos can’t be overlooked. Companies have been “scraping” the web to supply their models with various content for a while now. However, entities like OpenAI aren’t forthcoming about the sources of the data they use, claiming to utilize licensed and “publicly available” content through agreements.

OpenAI Is Making It Hard to Identify the Advantages of ChatGPT Plus, a Risky Strategy That Could Be Visionary

Some Companies Are Training Their AI Models on Reddit Data Without Permission, And Reddit Is Declaring War

Reddit is taking steps to prevent companies from using automated bots to collect public data from its platform.

Companies must have a licensing agreement in order to access Reddit’s data.

Reddit Is at War With Web Scrapers

Reddit Is at War With Web Scrapers

Receive "Xatakaletter", our weekly newsletter