Image Database Powering Google's AI Contains Explicit Images of Children

LAION has taken down its machine learning dataset, which Google uses, after researchers found it contained child sexual abuse material. — *Image: Getty Images*

Google has been training its AI image generator on child sexual abuse material.

As 404 Media reports, AI nonprofit LAION has taken down its 5B machine learning dataset — which is very widely used; even Google uses it to train its AI models — out of “an abundance of caution” after a recent and disturbing Stanford study found that it contained 1,008 instances of externally validated child sexual abuse material (CSAM) and 3,226 suspected instances in total.

It’s a damning finding that highlights the very real risks of indiscriminately training AI models on huge swathes of data. And it’s not just Google using LAION’s datasets — popular image-generating app Stable Diffusion has also been training its models on LAION-5B, one of the largest datasets of its kind that is made up of billions of scraped images from the open web, including user-generated data.

The findings come just months after attorneys general from all 50 US states signed a letter urging Congress to take action against the proliferation of AI-generated CSAM and to expand existing laws to account for the distribution of synthetic child abuse content.

But as it turns out, the issue is even more deeply-rooted than bad actors using image generators to come up with new CSAM — even the datasets these image generators are being trained with appear to be tainted.

As detailed in the study, conducted by the Stanford Internet Observatory, researchers found the offending instances through a hash-based detection system.

“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images,” the paper reads, “not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction.”

While the material “does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence,” the researchers write.

In other words, Google and Stable Diffusion have seemingly been facilitating the generation of CSAM — or allowing existing CSAM to be used to generate other potentially harmful images.

The images “basically gives the [AI] model an advantage in being able to produce content of child exploitation in a way that could resemble real life child exploitation,” lead author and STO chief technologist David Thiel told the Washington Post.

The finding also suggests that researchers may be unintentionally storing disgusting and illegal images on their computers.

“If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it,” Thiel told 404 Media.

Worse yet, LAION’s leadership reportedly knew since at least 2021 that their datasets may have scraped up CSAM, as internal Discord messages obtained by 404 Media show.

Thiel told 404 Media that the non-profit didn’t do enough to scan for CSAM, despite some early attempts.

LAION has since promised in a statement that it will take down the offending content.

But as the Stanford researchers noted, there may still be plenty of other images that may have fallen through the cracks.

While Thiel once believed these datasets could be effectively cleared of CSAM, he’s now of the opinion that the datasets “just need to be scratched,” he told 404 Media. “Places should no longer host those datasets for download.”

More on CSAM: Every Single State’s Attorney General Is Calling for Action on AI-Generated Child Abuse Material