The New York Times has become the first major media organisation to begin legal action against the use of its journalism by large language models and the “AI” chatbots, such as ChatGPT, on which they are based. In a lawsuit against OpenAI and Microsoft, filed in New York yesterday (27 December), the NYT claims that millions of its articles – products it spent “billions of dollars” creating – have been used to train the technology behind ChatGPT.
On the one hand this seems like a fairly straightforward case: publicly available information confirms that the NYT was the largest privately owned component of the Common Crawl dataset – a “copy of the internet” that comprised 60 per cent of the data used to train OpenAI’s GPT-3 mode. The NYT’s complaint claims “at least 16 million unique records of the content from the Times” were used to train GPT-3, and the later models used in ChatGPT seem likely to have used much of the same data.
The NYT’s evidence includes repeated examples in which it claims ChatGPT produced responses that were effectively identical to the paper’s reporting, including when the software was asked to replicate paywalled information. OpenAI and Microsoft (which are, after some recent boardroom drama, closer than ever to being the same company) have made huge gains from ChatGPT: OpenAI is looking for new investment at a $100bn valuation, while Microsoft’s market value has risen by a trillion dollars in the last year.
On the other hand, the case addresses the wider question of who owns information, and who gets to profit from it. To get very briefly into the kind of semantics that may be debated, OpenAI/Microsoft might argue that because LLMs simply predict text using statistics, ChatGPT isn’t copying the NYT’s article, it’s coming up with a prediction for what the text of the NYT article would be (a prediction that is note-perfect, because it also happens to have accessed the article). The technology is new but the business model – distribute and replicate information, without claiming to own it – has been fundamental to Big Tech for more than two decades.
There are great arguments for making information free, as exemplified by Wikipedia (also used to train ChatGPT). No-one should own the periodic table or the Canterbury Tales or the right to make paracetamol. The same could be said of the news: if the government is investing in the same companies as the Prime Minister’s wife, the public should know. But in order for someone to do the work of finding that out, there has to be space for someone to profit from gathering the news.
For three centuries, this was achieved by selling the paper on which the news was printed, so there was no need for anyone to claim that the news itself was their intellectual property. Search engines and social media bypassed this mechanism, and while legislators and regulators spent decades dithering over who should be accountable for what, Google and Facebook were free to profit as distributors of journalism (and much else), without taking on any of the responsibilities of a publisher.
Investors in companies such as OpenAI are keen for this mistake to be repeated, at a greater scale. Marc Andreesen, a prominent venture capitalist and investor in OpenAI, recently wrote to the US Copyright Office that “Imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development”, a statement that a cynic might translate as “if we’re not allowed to use everyone else’s work for free, we can’t make a fortune”.
Andreesen’s implication is that Big Tech must be allowed to continue to enjoy profit with responsibility, or another state – China, which has even less regard for intellectual property – will take the lead in AI. To restrain Big Tech would also restrain financial markets, which depend heavily on the performance of tech companies.Machine learning has huge benefits to offer the world; the automated analysis of large amounts of data is transforming fields such as astronomy and drug discovery. But the risks are also becoming sharply apparent; the internet is already drowning in auto-regurgitated soup, and many local newspapers – vital to democracy – have become zombies, used for advertising and propaganda. The NYT’s lawsuit highlights what could be lost if chatbots are given free reign to dominate the distribution of information: the world will take another step in dismantling a system that has held power to account for most of the post-feudal era.
[See also: The strange resurrection of the local paper]