Data Poisoning could be a tool we use to identify AI that has used copyritten material, or we use it to mess with AI.
https://www.vice.com/en/article/infinite-ai-homer-simpson-cover-songs-poisoned-soulseek/
https://mosis.eecs.utk.edu/harmonycloak.html
https://mosis.eecs.utk.edu/publications/meerza2024harmonycloak.pdf
identify AI that has used copyrighted material
but, that is basically all modern “AI”.
(the only LLM i’ve heard of which actually claims that its training corpus is freely licensed is Apertus…)
Traditionally, with machine learning, it is standard practice to mention what datasets and/or pretrains were used, so that the results are transparent and can be replicated. With GPT-2, it was “the common crawl and our own crawled 8 million web pages”, and since then I feel it’s mostly left out, falling back on (easily manipulated) benchmarks instead 😬
Yep. But just providing a list of millions of URLs and saying “we trained on this” as some models in the past have done also didn’t make it possible to replicate; by the time anyone re-fetches them all, many of the URLs will inevitably have changed or disappeared.
That’s exactly why projects like the common crawl exist though !
People have actually been doing this to catch plagiarism for centuries, long before LLMs were a thing.
See trap streets for one of the better known examples.




