Skip to content

An image of Ann Graham Loth on the left and a created image on the right of Loth's copy.

The image on the right was created by taking the training data caption for the left image, “Living in the Light with Anne Graham Lotz” and then feeding it into a stable distribution query.
Fig: Cornell University/Training Data Extraction from Distribution Models

One of the main defenses used by bullies against AI creators is that even if the models are trained on existing images, everything they create is new. AI evangelists often Compare these systems with real life artists. Innovators are inspired by all that came before them, so why can’t AI be similarly inspiring to previous work?

New research may put a damper on that debate, and may be a major sticking point for it. Many ongoing lawsuits regarding AI-generated content and copyright. Researchers in both industry and academia have found that popular up-and-coming AI image generators can “remember” images based on the data they’ve been trained on. Rather than creating something entirely new, certain questions will cause the AI ​​to simply reproduce an image. Some of these reproduced images may be protected by copyright. But even worse, modern AI generative models have the ability to memorize and reproduce sensitive data scraped for use in an AI training set.

The study It was conducted by researchers in the technology industry – in particular Google and DeepMind—and at universities like Berkeley and Princeton. Same sailors at Early research He identified a similar problem with AI language models, specifically GPT2, OpenAI’s predecessor The most popular ChatGPT. Bringing the team together, the researchers, led by Google Brain researcher Nicholas Carlini, have proven that both Google Image and the popular open source Stable Diffusion are capable of reproducing images.

The first image in that tweet was created using captions listed in Stable Diffusion’s dataset, a multi-terabyte scraped image database known as LAION. The team entered the caption into the Stable Diffusion query, and the same exact image emerged, albeit slightly distorted by digital noise. The process of obtaining these duplicate images was relatively simple. The team ran the same query multiple times, and after getting that same output image, the researchers manually verified that the image was in the training set.

The series of images above and below show images taken from the AI's training set and the AI ​​itself.

The bottom images were obtained directly from the top images taken from the AI ​​training data. All of these images may be licensed or copyrighted.
Fig: Cornell University/Training Data Extraction from Distribution Models

Two of the paper’s researchers, UC Berkeley PhD student Eric Wallace and Princeton University PhD candidate Vikash Sehwag, told Gizmodo in a Zoom interview that image duplication is rare. His team tested nearly 300,000 different captions, and found only a .03% recall rate. Inverted images were rare for models such as Stable Diffusion that worked to replicate images in the training set, although eventually all diffusion models will have the same issue to a greater or lesser degree. The researchers found that Eigen was able to fully recall images that existed only once in a data set.

“The caveat here is that the model should be general, it should generate novel images rather than spit out the intended version,” Sehwag said.

Their study found that as the AI ​​systems themselves become more and more sophisticated, it is more likely that AI will generate copied material. A small model like Stable Diffusion does not have the same amount of storage space to store most of the training data. That It may change a lot in the next few years.

“Probably if any new model comes out next year that’s much bigger and more powerful, the risks of these kinds of recalls will be much higher than they are now,” Wallace said.

Through a complex process that involves noise-cancelling the training data before removing that same distortion, diffusion-based machine learning models create data — in this case, images — similar to what it was trained on. Diffusion models were an evolution of generative adversarial networks or gantry-based machine learning.

The researchers found that GAN-based models don’t have the same problem with image recall, but unless a more sophisticated machine learning model comes along, it’s unlikely that big companies will go beyond diffusion, which can produce more realistic and high-quality images.

Florian Trammer, professor of computer science at ETH Zurich, who participated in the study, pointed out how many AI companies have given users permission to share or monetize AI-generated content, both in free and paid versions. The AI ​​companies themselves retain some rights to these images. This could prove problematic if the AI ​​generates an image that is exactly the same as a copyrighted one.

With only a .03% recall rate, AI developers can look at this research and decide there isn’t much risk. Companies can work to de-emphasize images in training data, reducing the likelihood of recall. Hell, they could even build AI systems that recognize when an image is directly duplicating training data and suggest that it be deleted. However, it covers the full range of privacy risks posed by generative AI. Carlini and Trammer also helped Another recent paper They argue that even attempts to refine data do not prevent training data from escaping through the model.

And of course there’s a huge risk of images appearing on users’ screens that no one wants copied. Wallace asked if a researcher, for example, wanted to generate synthetic medical data from people’s X-rays. What should happen if distributed AI remembers And do they duplicate a person’s actual medical records?

“It’s very rare, so you might not notice it happening at first, and you can actually deploy this dataset on the web,” said the UC Berkeley student. The purpose of this work is to prevent those kinds of mistakes that people are likely to make.

.

[ad_2]

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *