The magazine Popular Mechanics, where I once worked, used to have a column called “Saturday Mechanic.” It was a guide to basic car repair for the weekend tinkerer, and its author had decades of experience both in fixing cars and writing about them. Nonetheless, for each column, he would perform the task in question, carefully documenting each step with photographs. It was a lot of work, in other words.
One day I happened upon a website offering a wide range of how-to advice, including several of the topics our Saturday Mechanic column had recently covered. It took only a minute to see that this author was simply paraphrasing our columns virtually line by line. If our article said, “Bleed the excess air out of the brake line,” the rip-off version might read, “Expel the unnecessary vapor from the brake tube.” The author, who was no doubt being paid pennies a word, probably thought her clever rewording was enough to keep her on the safe side of the plagiarism line. Of course, her awkward word choices also made it obvious that she’d never lifted a car hood in her life.
And that’s how Artificial Intelligence works.
OK, not exactly. Not technically. But large language models (LLMs) like Open AI’s ChatGPT have a lot in common with that hapless author paraphrasing car-repair columns. Google, Meta, Microsoft, OpenAI, and other companies developing so-called generative AI models, train their systems on vast quantities of content. The companies often simply say they get their training materials from “the Internet” and are coy about the details. But it appears the most valuable source material comes from online newspapers, magazines, news websites, and books—in other words, copyrighted material written by professionals with at least some knowledge of their fields. An LLM breaks them down to their smallest possible components and then makes sophisticated models of how different words—or even parts of words—are most likely to fit together in different contexts. If the model combines the letters “work” with “ing,” for example, it knows the word that will follow is statistically more likely to be “out” or “hard” than, say, “potato.” LLMs build up matrixes of these “word vectors” that can predict with surprising subtlety how thousands of words relate to each other, and how they can be assembled in into readable paragraphs.
When you ask ChatGPT a question, it simply assembles words according to those algorithms, producing smooth prose that mimics the kinds of word choices thousands of previous authors have made when writing about the topic. Like the struggling author who produced Saturday Mechanic retreads, the AI system doesn’t really know what it is talking about. But unlike her, the LLM does not usually paraphrase an individual text directly. Nor does it store a file of information about the world. LLMs like ChatGPT are not super-duper versions of Wikipedia. Instead, everything an LLM needs to produce passable prose on any topic is contained within its vast matrix of predictable word relationships.
These databases of word vectors can generate shockingly sophisticated text, including detailed facts and cogent opinions, despite having no direct knowledge of the topic at hand. So in effect they are absorbing and reformatting not just patterns of words but, indirectly, the knowledge and ideas embedded within those patterns. And that creates a dilemma.
The recently deposed Harvard president, Claudine Gay, got caught copying the words of other scholars without giving them proper credit. Gay’s supporters and critics are still arguing about whether her borrowings involved merely generic prose or constituted real intellectual theft. We could ask the same about LLMs. These systems can ingest huge swaths of human knowledge and then re-create them using slightly different words. (Other AI models do something similar for music and the visual arts.) This is a stunning and no doubt world-changing innovation. But is it plagiarism? Do the authors who did the hard work of creating the content LLMs devour have any say in the matter? I think they should.
Questions of authorship and ownership arise every time information technology takes a leap forward. The early printing press made it possible for rogue publishers to rush out bootlegged versions of popular literary works just weeks after the originals hit the streets. During the Industrial Revolution an inventor might spend years designing a labor-saving machine only to see a competitor copy the design overnight. Our Founding Fathers believed protecting the rights of innovators was one of the federal government’s core responsibilities, as set forth in Article 1 of the Constitution: “To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” Eventually, lawmakers enumerated a whole body of laws protecting intellectual property, including copyrights, trademarks, patents, and the like.
Nonetheless, IP law often struggles to keep up with technological innovation. The invention of the phonograph and the growth of radio led to decades of legal and business battles over how—and how much—songwriters should be paid when their songs hit the airwaves. When Sony introduced its Betamax VCR in the 1970s, Disney and other entertainment companies filed suit, claiming that people recording televised shows and movies at home were violating their copyrights. That “Betamax case” went all the way to the Supreme Court, which in 1984 ruled that making home recordings of Dallas or the Sunday Night Movie fell under the category of “fair use.” One’s home is one’s copyright castle, the Court basically decided. The advent of online music and video file-sharing led to similar legal skirmishes over the next two decades.
But the coming legal war over AI and copyright laws is likely to dwarf all previous IP battles for two reasons. First, AI is much more extensive—more invasive—than any individual information medium. Content channels such as radio, streaming services, and websites deliver songs, movies, articles, and the like. But AI systems aim to extract the essence out of almost every form of human expression, from online comments to books to photography to computer code. I’m sure teams at Amazon and Google are right now devising methods to mine our kitchen and bedroom conversations for AI insights. (I’m talking to you, Alexa.) While only a handful of companies have pockets deep enough to build major AI platforms, tens of millions of people around the world make music, art, or journalism, and billions produce social-media content. All that material will be fodder for the AI giants. Which means that never before have so few been in a position to steal so much from so many.
Second, AI leaves few fingerprints. As I described above, AI systems don’t search the Internet looking for answers to our questions. Instead, they generate answers using the compositional rules they’ve extracted from studying billions of words on every topic under the sun. In other words, these large language models aren’t copying content exactly; they are reconstituting content based on abstract underlying codes. (At least that’s how it works in theory. There’s a big debate in AI science right now about how much LLMs sometimes “memorize” large chunks of text.) In principle, LLMs generate text in a process similar to cloning an organism using only the knowledge of its DNA. But instead of the clone being an exact replica of a single animal, it is more like an average based on many—a generic example of the species. That indirect route from source material to copy is a challenge when it comes to proving claims of plagiarism.
One school of thought insists that AI systems are doing the same thing humans do when they absorb knowledge or influences and then create something similar. A writer might read various articles about, say, copyright and AI, and then write his own take on the topic. Or a songwriter could admire another artist’s work and then write a song in the same vein. This kind of indirect borrowing generally falls under the legal heading of fair use. Advocates for free-range artificial intelligence say that AI systems should be given the same legal deference.
In comments to the U.S. Copyright Office, the venture-capital giant Andreessen Horowitz, a major AI backer, wrote that imposing the costs of copyright liability on AI companies “will either kill or significantly hamper their development.” People who produce copyrighted content naturally disagree. “This is a new frontier,” Keith Kupferschmid, the president of Copyright Alliance, an organization that defends copyright holders, told the New York Times. “Nobody knows how this is going to come out, and anyone who tells you ‘It’s definitely fair use’ is wrong.”
Kupferschmid sees two ways AI systems can take advantage of copyright holders: They can rely on copyrighted material to train their models without offering compensation; or they can generate nearly identical clones of such material in response to prompts from users. Clearly, both scenarios are happening today. In December, the New York Times sued OpenAI and Microsoft for using millions of Times articles to train their chatbots. “Defendants seek to free-ride on The Times’s massive investment in its journalism,” the complaint says, accusing the companies of using the paper’s content “to create products that substitute for The Times and steal audiences away from it.” One doesn’t have to be an uncritical fan of the New York Times to see that the Gray Lady has a point.
The Times was the first major media company to sue the AI pioneers; more will follow. (Others, including AP, have reached agreements with OpenAI to license their content to the company.) Gary Marcus, an influential AI sage and New York University professor emeritus, believes that OpenAI, as the current leader in the field, will attract the most legal artillery. “Lawsuits against the company are likely to come fast and furious,” he wrote on Substack in January. Much like the Betamax case, one or more of these suits involving AI and copyright will eventually reach the Supreme Court.
The AI companies will likely argue that, like human writers synthesizing knowledge, their chatbots are merely reconstructing well-known facts from a wide range of sources: You can’t attribute any single answer to a particular copyrighted source. However, Marcus notes that “because of the immense amounts of data they are now trained on, and because of the immense number of parameters on which they are now trained, that reconstruction can in fact sometimes come very close to memorization.” The New York Times’ suit includes dozens of examples in which OpenAI’s ChatGPT wound up regurgitating entire paragraphs of Times copy verbatim. That’s going to be hard to explain to the judges.
Such “plagiaristic outputs” are especially, well, visible when it comes to image-based AI platforms including Midjourney, Open AI’s DALL-E, and, most recently, Google’s BARD. Marcus and other researchers have shown that these platforms lean heavily on copyrighted and trademarked content. Ask one of these image generators to “draw a videogame plumber” and it will return an image that looks almost exactly like Nintendo’s famous Mario. A request for “animated toys” will likely turn up characters that look like they just stepped out of Toy Story or another Pixar movie. “Golden droid from classic sci-fi”? Bingo! Star Wars’ C-3PO coming right up.
My concern for the copyright travails of Disney and other media conglomerates is less than overwhelming. (Over the decades, copyright law has expanded to their enormous benefit.) I’m more worried about the lesser-known writers, musicians, photographers, and artists whose creative talents are being scraped up by the AI bots. A nature photographer might spend months tracking the rare snow leopard. When some website or ad agency buys that image from a stock agency, the photographer makes a small sum. But when a DALL-E user asks for a “cute leopard in snow,” a virtually identical image might appear in seconds. Already, many websites and other image users are relying on such AI-generated images instead of copyrighted photos or illustrations. We will see similar examples of AI appropriation in every field of creative expression.
But content creators aren’t the only ones who will lose out in the AI era. The Web watchdog Newsguard has identified dozens of AI-generated websites that, like my Saturday Mechanic rewriter, create paraphrased retreads of news articles from legitimate sources. Already, readers searching for news or information online can easily land on sketchy sites peddling a mix of reliable but plagiarized material combined with lies, conspiracy theories, and faked AI photos. Without knowing the source, we have no way to judge the veracity of any of this material.
And even when the content we consume isn’t corrupted by falsehoods, it will suffer from an oppressive sameness. AI bots can ingest all the content in the world. They can blend and homogenize it into an indistinguishable mélange. But they can’t create anything genuinely original. I have no doubt that the AI revolution will bring enormous benefits in medicine and other fields. But will it also create a cultural landscape that is weary, stale, flat, and unprofitable for anyone but the AI giants themselves? We’re about to find out how Hobbesian the AI world is going to be.
Photo: AP Photo/Michael Dwyer, File
We want to hear your thoughts about this article. Click here to send a letter to the editor.