Researchers from the Stanford Internet Observatory revealed in a study published earlier this week that a vast public dataset, commonly used to train popular AI image-generating models, contained over a thousand images of child sexual abuse material.
The research uncovered over 3,200 images of suspected child sexual abuse within the extensive AI database LAION. This database, used for training prominent AI image generators like Stable Diffusion, contains explicit images and captions, scrapped from the internet.
The watchdog organisation, situated at Stanford University, collaborated with the Canadian Centre for Child Protection and other anti-abuse charities to detect illicit content. Subsequently, they reported the original photo links containing the illegal material to law enforcement. Of these, over 1,000 suspected images were verified as child sexual abuse material.
What is LAION?
LAION, a German non-profit, stands for Large-scale Artificial Intelligence Open Network. The large dataset provided by the company acts as a training module for generative AI models. The training process involves exposing the AI system to a large dataset of images, allowing it to learn patterns, features, and styles present in the training data. The model learns to generate new, realistic images that share similarities with the examples it has seen during training.
According to the report, the LAION-5B dataset, leveraged by Stability AI, the creator of Stable Diffusion, contained a minimum of 1,679 illicit images sourced from social media posts and well-known adult websites. Stability AI is a generative AI tool which creates descriptive images with shorter prompts and also generate words within images.
In September 2023, researchers initiated an examination of the LAION dataset to assess the presence of child sexual abuse material (CSAM). The investigation involved scrutinising hashes, which are image identifiers, and submitting them to CSAM detection platforms such as PhotoDNA. The results were subsequently verified by the Canadian Centre for Child Protection.
Even though the images constitute only a small portion of LAION's vast index of approximately 5.8 billion images, the Stanford group suggests that they likely impact the capacity of AI tools to produce harmful results. Additionally, it is believed to reinforce the previous victimisation of real individuals who are repeatedly featured in the dataset.
According to its own website, the LAION dataset doesn't save images; it just keeps links to images and their alt text scrapped from the internet. The German nonprofit also said in a statement that it has a “zero tolerance policy for illegal content".
In light of the report, this week LAION also announced the creation of "rigorous filters" to identify and eliminate illicit content before the release of its datasets, with ongoing efforts to enhance these filters. It stated its intention to conduct a comprehensive safety review of its dataset by the second half of January and aims to republish it following the completion of this review.
The Stanford report recognised that LAION's developers had made certain efforts to filter out explicit content involving individuals who are underage. However, the report suggested that consulting with child safety experts earlier could have resulted in more effective measures.
AI and the CSAM challenge
The researchers acknowledged it would be difficult to fully remove the problematic content, especially from the AI models trained on it. They recommended that models trained on LAION-5B, such as Stable Diffusion 1.5, “should be deprecated and distribution ceased where feasible”.
However, this is not the first time that AI has been drawn into a CSAM controversy. In September, safety watchdog group, Internet Watch Foundation flagged that that paedophiles were exploiting freely available artificial intelligence software to produce CSAM. The report also highlighted that the offenders even manipulated photos of celebrity children or known victims to generate new content using AI.
Similarly, last month, a UK-based organisation found 3,000 AI-made abuse images available online. Here, the technology was being employed to generate images of celebrities that have been digitally altered to appear younger, depicting them in scenarios involving sexual abuse. Additionally, instances of using AI tools to remove clothing from pictures of clothed children sourced from the internet were noted as examples of CSAM.