AI Models Are Being Trained With Photos of Children. And It Doesn’t Matter if Parents Try to Avoid It

  • The LAION-5B dataset, famous for training AI models, seems to be full of photos of children.

  • In some cases, obtaining sensitive data from minors is possible.

  • The AI was trained on these photos even though they had low online visibility because they were shared by friends and family.

Javier Pastor

Senior Writer

Computer scientist turned tech journalist. I've written about almost everything related to technology, but I specialize in hardware, operating systems and cryptocurrencies. I like writing about tech so much that I do it both for Xataka and Incognitosis, my personal blog. LinkedIn

Human Rights Watch has long monitored how technology can threaten people's rights and freedoms. Now, it has denounced a new problem related to AI. The most disturbing part is that the victims of this threat are children.

What were those photos of children doing there? Human Rights Watch researcher Hye Jung Han discovered something disturbing last month. The LAION-5B dataset, famous for training AI models, contained 170 photos of Brazilian children. According to Wired, the images came from parenting and personal blogs and from rarely viewed YouTube videos, possibly uploaded to share with friends and family. YouTube’s terms of service prohibit collecting personally identifiable information except in exceptional circumstances. However, similar to what's happened on other occasions, the damage has been done.

Now, they've discovered more. The same researcher found another 190 photos of children, this time from Australia. The pictures cover the entire children's entire period of infancy and included newborn babies to girls in bathing suits at a carnival, children blowing bubbles, and even photos of Indigenous Australian tribal children. And then there’s a disturbing fact: The parents tried to prevent these photos from being shown to the public.

Stolen photos. Human Rights Watch states that few people saw these images, which had “certain privacy measures.” They didn’t appear discoverable through an online search, as owners posted them on personal blogs or video-sharing sites. Schools and photographers hired by families also shared these photos. “Some were uploaded years or even a decade before LAION-5B was created,” Human Rights Watch notes.

Identifiable children. The research highlighted how URLs in the dataset sometimes reveal information about the children, including names or locations where the database obtained the images. From a photo described as “two boys, ages 3 and 4, grinning from ear to ear as they hold paintbrushes in front of a colorful mural,” the researcher was able to obtain “both children’s full names and ages, and the name of the preschool they attend in Perth, in Western Australia.” There was no information about the children anywhere else on the Internet, which makes it clear that the parents have taken steps to prevent the children from being identified.

And this is undoubtedly just the tip of the iceberg. Human Rights Watch explains that their researchers could only review “fewer than 0.0001 percent of the 5.85 billion images and captions contained in the data set.” Han says, “It’s stunning that this came out of a random set of about 5,000 images and that these 190 photos of Australian children immediately popped up. You’d expect to find more pictures of cats than personal pictures of children,” given that LAION-5B is theoretically “a reflection of the entire Internet.”

AI doesn’t know how to keep secrets. For Human Rights Watch, AI models are dangerous because previous studies have already demonstrated that it’s possible to collect sensitive information and “rescue” sensitive data—such as medical records—that end up in the datasets to train IA.

What LAION-5B creators say. The developers of this dataset are part of LAION, an NGO that clarifies that it has a “zero tolerance policy for illegal content.” Nathan Tyler, one of their spokespeople, told Ars Technica that they’re working to fix the problem, but removing these images is a slow and ineffective process. As Han states, removing the links from the datasets doesn’t change the AI models that have already been trained on that dataset. “They can’t forget the data they were trained on, even if that data was deleted after [from the training] dataset.”

Image | Robert Collins

Related | ‘Have I Been Trained?’ Is a Site That Helps You Find Out if Your Data and Work Have Been Used to Train AI

See all comments on https://www.xatakaon.com

SEE 0 Comment

Cover of Xataka On