Big Tech’s race for our data is on

The question is who we want to own the data, and how badly we want the AI that it feeds.

Illustration by Donna Grethen / Ikon Images

High-quality and wide-ranging data has been key in creating artificial intelligence for years. It is all the more important in the advent of, for example, OpenAI’s ChatGPT, which seems to be all anyone can talk about right now.

When OpenAI researchers created GPT-3, a precursor to ChatGPT, they filtered out 98.7 per cent of the data from their original dataset, and used datasets of books and English-language Wikipedia to achieve the resulting model. In fact, progress in AI may be limited more by lack of high quality data than the researchers’ access to specialised computers or talent.

This means that the race for data is on between companies such as Google and Microsoft to build increasingly powerful AI models in a bid for dominance in AGI (“artificial general intelligence”), a hypothetical AI that can do most human-level tasks as well as a human can. Or, at least, in the hopes of achieving economic dominance.

Google’s claim on data is quite significant already; it has access to everyone’s Gmail messages and Google Docs, and YouTube content, for example. The company knows the browsing activities and locations of billions of users. Your emails, among billions of other people’s messages, are regularly sent to Google to train their Gmail autocomplete feature, albeit with some encryption guarantees. This kind of data arrangement is critical for Google.

Companies and researchers that don’t have private repositories of data will have to partner with those that do. OpenAI, for example, has signed deals with Shutterstock to train DALL-E 2 (a programme that creates realistic images and art based on a text input) using Shutterstock data to create new stock images, and with BuzzFeed to help it to generate personalised quizzes and listicles. It’s not just a victory for Shutterstock and Buzzfeed to get extremely low-cost personalised content, it’s also good for OpenAI to potentially get more high-quality training data.

Movie studios might be next. Warner Bros and Universal Studios have massive repositories of video data, probably better quality than you could get from a pool of consumers. And maybe we will see one of the big tech companies buy training rights to all books from, say, HarperCollins and Simon & Schuster.

If the algorithms themselves get commoditised – as in, they become indistinguishable between companies – then data may become the economic moat in AI, the one differentiating factor between different companies. But then, without much in the way of data protection laws, it’s probably easier for the companies whose products we already use every day to collect individual data rather than sign deals with mega-corporations.

So it ultimately comes down to governments’ desire to regulate – and to us, the public. Do we, as societies and communities across the world, want increasingly advanced AI, even AGI? And if we do, who do we want to own the data? Which tech giants do we want to give the power to?