Among the many anxieties that have bubbled up in AI’s wake since ChatGPT launched in November 2022 (and ushered in the dawn of GenAI, or ‘generative AI’), few rise to the level of outrage among creators of intellectual property (IP) as the prospect that their work is being used to train AI systems − without permission, accreditation, or payment.
Lawsuits have started piling up. There was an early victory for copyright holders (Thomson Reuters vs. Ross AI), but the technology in question did not use GenAI, so this outcome sits somewhat outside the main debate. There are about ten ongoing matters now, and the wind seems to be shifting in favour of the AI companies.
The reasons for this are found clearly in the nuts and bolts of how GenAI engines work. None of the original training data is stored, regurgitated, or distributed. In the words of Substack writer Enrique Dans, “Instead, what it does is transform, remix, and reinterpret. It is not a copy, it is a synthesis.”
Judges are starting to hear that message.
I have previously written about this following thought experiment. Let’s say you are a young student, passionate about art. You go to university. You take lectures on the great artists. You study their rendering techniques. You browse galleries, go abroad to look at the originals, buy art books, and read biographies of the artists.
Then you get your degree and go out into the world to start making your own art. Nobody would question that the original works you produce are not some amalgam of everything you have studied and every artwork you have ever seen, plus the fuel of your own integrative creative powers.
Of course, none of the artists you have studied or seen would think to sue you for copyright breach. Those who are still around would likely be flattered by your interest and their influence.
Ingests content
It turns out that AI works in much the same way. It ingests content − images, text, music, audio. It looks for hidden relationships between subsets of this training data, relationships that are so deep within the AI “brain” that they are not even fully understood by its developers and trainers. Then it conspires to produce content from those relationships under light instruction from a “prompter”.
GenAI output gets even further away from the original training data. If you look under the hood of ChatGPT or one of the other foundation models, you will not find the training data at all. All you find are billions of numbers in matrices that describe the relationships between various very small subsets of the original training data. If the Mona Lisa is part of the training set of one of these systems (which it no doubt is), you will not be able to find the image anywhere in the AI system once training is complete − not even a trace.
So, when a plaintiff sues an AI defendant for copyright infringement, he or she is faced with a difficult problem. The judge might well ask the defendant to open up the software for analysis. And the original IP will be nowhere to be found. The judge will ask, where is the song you wrote, or the copyrighted image you produced? You will not be able to produce it. As Mr Dans puts it: “If there is no copy, there is no copyright.”
This brings us to one of the most current cases, which looks to be heading toward defeat for the copyright holders. At the end of last year, Concord Music Group sued AI company Anthropic for “systematic and widespread infringement of their copyrighted song lyrics” without permission.
The arguments got a little gnarly. Anthropic argued that the lyrics are all over the web. They argued that their use of them in training was ‘fair use’ − a doctrine that seeks to balance the rights of copyright holders with the public interest in access to and use of information and creativity.
Careful prompting
Anthropic also argued that the lyrics were never meant to be output by the chatbot in full, but that the AI system could be “forced” to do so only by careful prompting (as was done by the plaintiffs). They argued that the intent of the training did not have the output of a complete lyric set as a primary goal. And, of course, they gave the argument about statistical relationships between various bits of data that I outlined above − the lyrics were not stored and retrieved in a database-like extraction.
The judge denied a request to halt the use of copyrighted song lyrics, refusing to grant a preliminary injunction, and found that Concord had failed to demonstrate immediate and irreparable harm. In addition, the court agreed to Anthropic’s request to dismiss all other claims brought by Concord.
It is perhaps easier to argue the legal matter than the gut emotional response of artists and creators − their work is being used in mathematically abstruse representations and they are not getting any love at all, monetary or otherwise. That matter is not going to resolve easily, but if I were a smart lawyer representing IP creators, I would try to convince the judge that the AI may not store the copy, but it certainly remembers it, albeit in a strange new language full of numbers and matrices.
So, is AI stealing? IP creators tend to think so. Is it illegal? It seems not. Unless, of course, if copyright law is changed − and I suspect that this is where we are headed.
[Image: reve.art]
The views of the writer are not necessarily the views of the Daily Friend or the IRR.
If you like what you have just read, support the Daily Friend