Artificial intelligence companies rely on data to train their models. To gather this data, they often use web scraping, a technique that involves extracting and storing public information from web pages. This is typically done without the consent of the content creators or licensees, and no payment is involved.
This week, Reddit announced a new measure to stop prevent companies from scraping its site. The platform, which hosts millions of conversations on various topics within subreddits, will prevent unauthorized companies from using its public content. To do so, it’s going to implement a backend-level change to the robots.txt file exclusion protocol “in the coming weeks.”
Reddit Is at War With Web Scrapers
The recent announcement aims to limit access to the content from the company led by Steve Huffman for individuals who don’t have an agreement with the platform. In recent months, major tech companies such as OpenAI, the owner of ChatGPT, and Google, the creator of Gemini, have entered into formal partnerships with Reddit for access to its data. Essentially, access to the data is restricted without an agreement.
The modifications announced on Wednesday have been incorporated into the platform’s Public Content Policy. While the company is cracking down on web scrapers, it states that it’ll continue to provide researchers and academics access to its content. Reddit added that it’ll ensure access for moderators and organizations like the Internet Archive, which is dedicated to preserving online content.
In today’s AI-driven world, the importance of text, images, music, and videos can’t be overlooked. Companies have been “scraping” the web to supply their models with various content for a while now. However, entities like OpenAI aren’t forthcoming about the sources of the data they use, claiming to utilize licensed and “publicly available” content through agreements.
Nevertheless, this hasn’t prevented major players like The New York Times from suing Microsoft and OpenAI for copyright infringement. Additionally, record labels including Sony Music, Warner Music, and Universal Music are embroiled in legal disputes with the music generators Suno AI and Udio for allegedly using their songs. We’re witnessing a real-time battle for data to fuel AI. Only time will tell how it will all unfold.
Image | Reddit
Related | The War Over AI-Generated Music Continues to Escalate: Major Labels Are Suing Suno AI and Udio
View 0 comments