The rise of artificial intelligence (AI) has thrown American copyright law into a tailspin as regulators frantically attempt to monitor and regulate AI training practices. The current copyright system in the United States risks centralizing AI into the hands of a few massive corporations, and an aggressive application of these laws would blow our current lead and yield the industry to China.
To avoid this outcome, we must ensure that copyright laws are applied loosely to AI systems.
The development of AI requires tremendous amounts of data to train these new models. The Pile, an open-source dataset for training large language models like ChatGPT, consists of 825GB of raw text. For context, that comes out to over 140 billion words—or around 200 million copies of this article. Closed source datasets, those kept private by corporations, can be an order of magnitude larger. When working with this amount of data it becomes nearly impossible to ensure no copyright violations occur, and many have voiced concerns about models regurgitating and reproducing privately owned content.
This is further complicated by code licenses. AI models are frequently used to write code, and thus must be trained for it. Publicly available code can be published with widely variable licenses, but all are open for humans to study. The line between applying lessons learned from code and copying code can be a blurry one, and it is feasible that AI models will reproduce sections of code that are not licensed for reproduction.
Some software developers have sued companies like Microsoft over simply including code in training data that isn’t licensed for that use, and those lawsuits are still playing out. The issue of copyright has been just as contentious in the AI art world. Companies behind models like Stable Diffusion and Midjourney have seen copyright lawsuits from artists who claim their art was illegally trained on. Many artists have also complained that AI models will reproduce their particular styles upon request, and companies have taken steps to try to mitigate this.
All of this raises a very important question: should AI systems be bound by copyright like old-school deterministic software, or should they be free to learn from (and quote!) whatever information they come across, like we are? I would argue that these systems should be treated more like minds than like other software. If you own a book, you should be able to let an AI “read” the book and learn its contents.
Likewise, the same art that is the subject of recent copyright lawsuits is freely available for individuals online to view and learn from; simply learning from copyrighted art does not preclude artists from using their skills commercially. There is no clear reason to treat AI models differently from people, as they learn from other peoples’ works just as we do.
While we argue about how these systems should interact with copyright laws in the United States, massive corporations like Microsoft and OpenAI are plowing ahead, with confidence that their lawyers are able to minimize risk as much as possible, and that they can pay any fines or win any lawsuits that result from their questionable conduct. Likewise, foreign adversaries like China will happily vacuum up as much data as they can get their hands on, completely ignoring any copyright or privacy laws that might aim to restrict them. As such, we find ourselves hurtling towards a dystopia where these entities have a near monopoly on intelligence itself.
As a nation, we need to ensure that we are not hindering AI with eighteenth-century intellectual property laws. Not only should we avoid expanding these laws to cover AI, but we should also ensure our current laws are not broadly applied to new technologies.
Courts have broad authority under federal copyright law to determine fair use based upon the “purpose and character” of the use of copyrighted content. They should use this authority to protect all AI systems, so long as they do not exist solely to reproduce prior works. If the courts fail to take this stance, then federal law must be changed for the same purpose.
Dylan Dean studied electrical and computer engineering at Montana State University and Lake Washington Institute of Technology, and is an advocate for democratization of emerging technologies.