Fictitious Data Injection Could Protect from Copyright Infringement in AI Training

Image provided by Pixabay

This post is also available in: עברית (Hebrew)

AI has been a revolutionary force in the past few years, and it’s only getting started. With the introduction of generative AI models such as ChatGPT, what used to be a tool reserved for advanced technology companies has now become an everyday commodity used by students to improve their schoolwork or couples trying to create the perfect itinerary for their next trip abroad. However, in order to achieve its powerful capabilities, vast amounts of training data are required. Some of this data is copyrighted and used by developers without the copyright holder’s consent.

In order to curtail this issue, researchers from Imperial College London suggested a technique that could detect copyrighted data used for AI training. This mechanism is detailed in a paper posted on arXiv. According to TechXplore, the team took inspiration from early 20th-century map makers, who intentionally inserted fictitious places into their maps in order to spot illegal copies. Accordingly, they suggested the addition of “copyright traps” into the original copyrighted text, so that it could later be detected when it shows up on the trained LLM’s output, proving illegal use. This is best fitting for online content, where the copyright trap could be hidden from the user but still likely to be picked up by a data scraper.

However, the team explains that it is still possible for developers to avoid detection by creating ways to remove these traps. Therefore, they suggest varying traps and implanting them across many articles. They explain that removing all of them is likely to pose a significant challenge for developers, making it easier for creators to see when their data has been used without permission.

This “paper towns” theory was validated in a successful experiment the researchers conducted along with a team in France, that involved developing their own LLM and injecting numerous copyright traps into its training set.

The researchers hope that this research will incentivize LLM developers to be more transparent with the data they use for training, something they are currently reluctant to do. They believe that this way, authors who unknowingly contribute to LLM training will be properly compensated for their original work.