A new report issued by Human Rights Watch reveals that a widely used, web-scraped AI training dataset includes images of and information about real children — meaning that generative AI tools have been trained on data belonging to real children without their knowledge or consent.
The watchdog group says it discovered over 170 traceable photos of real Brazilian children in the LAION-5B image-text dataset, which is comprised of data gleaned from the web-scraped content depository Common Crawl and has been used to train AI models including Stability AI's Stable Diffusion image-generator.
Per the report, some of the retrieved photos were accompanied by alarmingly revealing information. One image of a two-year-old and her baby sister, for example, included details about the children's names and the "precise location" of the baby's birth. The photos also span decades of content: as Wired notes, the images were scraped "from content posted as recently as 2023 and as far back as the mid-1990s."
That AI is being trained on web-scraped images of children at all is, on its face, a revelation that raises serious privacy concerns. Add that AI tools trained on this data are being used to create content like nonconsensual deepfakes and fake child sexual abuse material, and the finding sheds light on a particularly grim reality of AI training processes and the eventual content that AI models can be used to produce.
"Their privacy is violated in the first instance when their photo is scraped and swept into these datasets," Human Rights Watch children's rights and technology researcher Hye Jung Han, who found the images, told Wired. "And then these AI tools are trained on this data and therefore can create realistic imagery of children."
"The technology is developed in such a way that any child who has any photo or video of themselves online is now at risk," Han continued, "because any malicious actor could take that photo, and then use these tools to manipulate them however they want."
It's also worth noting that many of the images discovered were sourced from web content that few folks online would ever stumble across, like personal blog posts or, per Wired, stills from YouTube videos with extremely low view counts. In other words, AI is being trained on content that wasn't necessarily designed for mass public dissemination.
"Just looking at the context of where they were posted," Han told Wired, children and families "enjoyed an expectation and a measure of privacy."
"Most of these images were not possible to find online through a reverse image search," the researcher added.
LAION, the nonprofit AI research group that created LAION-5B, confirmed to Wired that it removed the flagged photos from the dataset. But this is hardly the tip of the iceberg: as noted in the Human Rights Watch warning, the group examined "less than 0.0001 percent of the 5.85 billion images and captions contained in the data set," meaning that the 170+ figure is likely a "significant undercount of the total amount of children's personal data" used in LAION-5B alone.
As far as the History of Posting Stuff Online goes, it's safe to say that few posters ever expected their musings, images, and videos — especially those shared before 2023 — to get swept into data-hungry AI models. The reality, though, is that it has, and largely without the express knowledge or consent of anyone. And while AI companies have proven themselves to be very liberal with what they consider fair use, the nonconsensual use of minors' data feels far from ethically gray.
Share This Article