LAION, the German research institute that creates the data used to train generative AI models such as Stable Diffusion, has published a new dataset that it claims has been “thoroughly cleaned of any known links to suspected child sexual abuse material (CSAM).”
The new dataset, Re-LAION-5B, is actually a re-release of the older dataset, LAION-5B, but with “fixes” implemented based on recommendations from non-profit organizations Internet Watch Foundation, Human Rights Watch, the Canadian Centre for Child Protection, and the now-defunct Stanford Internet Observatory. Two versions are available for download: Re-LAION-5B Research and Re-LAION-5B Research-Safe (which additionally removes NSFW content), and both filter out thousands of known (and “likely”) links to CSAM, according to LAION.
“LAION has been committed to removing illegal content from its datasets from the beginning and has implemented appropriate measures to achieve this from the beginning,” LAION said in a blog post. “LAION strictly adheres to the principle of removing illegal content as soon as it becomes aware of it.”
It’s important to note that LAION’s data set does not, and never has, included images: it is an index of links to images and image alt text that it curates, all of which come from a separate data set of scraped sites and web pages (Common Crawl).
The release of Re-LAION-5B follows the results of a December 2023 investigation by the Stanford Internet Observatory, which found that LAION-5B (specifically a subset called LAION-5B 400M) contained at least 1,679 links to illicit images collected from social media posts and popular adult sites. According to the report, 400M also contained links to “a variety of inappropriate content, including pornographic images, racist slurs, and harmful social stereotypes.”
The report's Stanford co-authors noted that removing problematic content is difficult and that the presence of CSAM does not necessarily affect the output of models trained on the dataset, but said LAION would temporarily take LAION-5B offline.
The Stanford report recommended that models trained on LAION-5B “should be retired and, where possible, removed from distribution.” Perhaps relatedly, AI startup Runway recently removed the Stable Diffusion 1.5 model from its AI hosting platform Hugging Face. We reached out to the company for more information. (Runway partnered with Stability AI, the company behind Stable Diffusion, in 2023 to help train the original Stable Diffusion model.)
LAION says that metadata from the new Re-LAION-5B data set, which contains approximately 5.5 billion text-image pairs and has been released under the Apache 2.0 license, can be used by third parties to clean up existing copies of LAION-5B by removing matching illegal content.
LAION emphasizes that its data set is for research purposes, not commercial use, but history suggests that this won't deter some organizations: Beyond Stability AI, Google previously used the LAION data set to train an image-generating model.
“A total of 2,236 links [to suspected CSAM] “They were removed after being checked against a list of links and image hashes provided by our partners,” LAION continued in the post. “These links also included 1,008 links discovered in the Stanford Internet Observatory report from December 2023… We strongly encourage all laboratories and organizations still using the old LAION-5B to migrate to the Re-LAION-5B dataset as soon as possible.”