Are you ready for the era of Big Data?

Business agrees with governments — the more personal information they gather about us, the more “helpful” they can be. Should we give in to this “harmless” new science of benign surveillance?

Illustration: Andrew Baker/Ikon Images

Data will save us. All we need to do is measure the world. When we have quantified everything, problems both technical and social will melt away. That, at least, is the promise of “Big Data” – the buzzphrase for the practice of collecting mountains of data about a subject and then crunching away on it with shiny supercomputers. The term has lately become so ubiquitous that people make wry jokes about “small data”. But Big Data is not only something geeks do in the science lab or the start-up company; it affects us all. So we had better understand what its plans are.

Miraculous things can already be accomplished. By analysing web-searches tied to geographical location, Google Flu Trends can track the spread of an influenza epidemic in near-real time, thus helping to direct medical resources to the right places. Another of the company’s services, Google Translate, is so effective not because it understands language to any degree, but because it holds huge corpuses of written examples in various tongues and knows statistically which phrase of the sample text is most often translated by which phrase in the target language. Meanwhile, aircraft and other complex engineering projects can be made more reliable once components are able wirelessly to phone home information about how they are functioning. This mammoth store of telemetry data can be analysed to predict part failures before they happen.

But Big Data is not just an approach that improves uncontroversially useful systems. It’s also a hype machine. IBM, for instance, offers to furnish companies with its own Big Data platform, the PR material for which is a savoury mix of space-age techspeak and corporate mumbo-jumbo. “Big data represents a new era of computing,” the company promises, “an inflection point of opportunity where data in any format may be explored and utilised for breakthrough insights – whether that data is in-place, in-motion, or at-rest.” It sounds rather cruel to disturb data that is “at-rest”, presumably power-napping, but inflection points of opportunity wait for no man or megabyte.

Another platform capable of changing the game, albeit in an unhappily permanent manner, might be the Big Data skiing goggles marketed by the tech-shades manufacturer Oakley. Rather as the computerised spectacles known as Google Glass promise to do for the whole world, these $600 goggles project into your eyes all kinds of fascinating information about your skiing, including changes in speed and altitude, and can even display incoming messages from your mobile.

Of course, it might happen that while reading a titillating sext from a co-worker you ski at high speed into a tree. And so the goggles are sold with a splendidly self-defeating warning on the box: “Do not operate product while skiing.” Clearly, all the information all of the time is not always desirable. Still less so when Big Data’s tendrils move out from cool gadgets or website tools into our personal lives, the workplace and government – notwithstanding the Panglossian boosters of global datagasm.

What is “data” in the first place? The big lie of much Big Data publicity is that data is neutral, ambient information that can just be hoovered up; that it is already “given” – the meaning of the classical root of “data”. (Because the word is the Latin plural of datum, some people prefer to say “data are”. I am not one of them.)

But data is not simply collected; it is manufactured. There are always questions about how you choose what to measure, how you measure it, and how you analyse it afterwards; at each step, you make theoretical assumptions. A few years ago Chris Anderson, the former editor of Wired magazine, wrote an article claiming that big data meant “the end of theory” in science.               

But as Viktor Mayer-Schönberger and Ken neth Cukier point out in their useful recent book, Big Data: “Big data itself is founded on theory.” And once you’ve manufactured data with instruments that operate according to certain theories, you then need to analyse it theoretically. At the Large Hadron Collider, subatomic smashing generates a million gigabytes of data every second. Automated systems keep just a millionth of this for analysis (discarding the rest based on theories), but the bit-heap is still Brobdingnagian. And it needs to be analysed according to still other theories before scientists will understand what is going on. Until then, the data itself is just inscrutable numbers. Raw data is not knowledge. According to IBM, 90 per cent of the world’s extant data has been created in the past two years. Unless I missed something important, that is not because the human race has very rapidly become much wiser.

Nor can data always tell us why things are how they are. In Big Data, Mayer-Schönberger and Cukier adopt an optimistic meta-theoretical attitude on this point, arguing that increasing use of Big Data will wean us off our obsession with causation – finding out what causes what.

The authors argue that with data-sets that approach the entirety of the relevant information, rather than mere statistical samples, correlation – when two things are regularly associated with each other – will be king. When we can easily find out that one thing goes with another, we won’t worry too much about what is causing what.

This seems an odd claim to make when set beside, say, the history of research into the harmful effects of smoking. It was known for a long time that smoking was correlated with lung cancer – but then, so were many other things. Siddhartha Mukherjee pointed this out recently in the New York Times: “Asked about the strikingly concomitant increases in lung cancer and smoking rates in the 1930s, Evarts Graham, a surgeon, countered dismissively that ‘the sale of nylon stockings’ had also increased.” It took another few decades, and careful experimentation, for us to become sure that cigarettes were carcinogenic. Why mere correlations – even in large data-sets – should suddenly have become magical truth machines in the meantime is not clear.

The quarks and other colourful subatomic fauna at the Large Hadron Collider presumably don’t care all that much that vast quantities of data about them are being recorded, but human beings might be more nervous, and with good reason. Big Data also holds out the promise of, for instance, total supervision in the workplace. Lest perfect surveillance of employees sound alarming, this new field is given the blandly technocratic name of “workforce science”. Every phone call, email and even mouse-click of an employee can be stored and analysed to guide management in making decisions.

So “workforce science” is a scaled-up and automated version of the “scientific management” promoted by Frederick Winslow Taylor in his highly influential 1911 book, The Principles of Scientific Management, which recounted how he performed time-and- motion studies on labourers in order to get more work out of them. It has since been alleged that Winslow fiddled the data, but that didn’t stop him becoming an eponym: “taylorisation” is the breaking-down of some activity into discrete repetitive units, supposedly to improve efficiency. Big Data promises taylorisation on steroids.

Increasingly, workers are also forced to submit to “personality tests” so that their scores can be added to the swelling data file. Such tests are another example of what Mayer- Schönberger and Cukier call “datafication” – quantifying the previously unquantifiable – but their reliability is highly controversial. The Myers-Briggs personality type indicator, for instance, is derided by psychologists but widely employed in business.

At the same time, the UK government’s “nudge unit” has designed a psychometric test for jobseekers called My Strengths. Unfortunately, as bloggers have demonstrated, its results bear no relation to the answers given. In the age of Big Data, it begins to seem as though any kind of data, whether true or false, is better than no data at all.

We don’t choose to be targets of surveillance in the office, but we do when we use internet services such as Facebook. In assiduously entering our personal information, accepting “friends”, “checking in” to bars and restaurants and “liking” things, we have been inveigled into a digital taylorising of our social lives. Facebook’s billion users constitute a global underclass of volunteer labour in a giant programme of corporate welfare. Facebook owns this information and sells advertising against it. What we put in the “cloud” – a comfortingly fluffy name for a collection of gargantuan physical data centres, owned and operated by an oligopoly of corporations including Facebook, Google, Amazon, Microsoft and Apple – does not belong to us any more, and often has a worryingly long life. Apple keeps the questions you ask its voice-operated virtual assistant, Siri, for two years before deleting the data.

Through Big Data analysis, the “cloud” comes to know an awful lot about us. Simply analysing a person’s Facebook “likes” can identify a person’s sexual orientation or history of drug use. Even just searching for things and filling out online surveys can lead to personal information about you being bought and sold by big marketing analytics companies. When the Big Data is data about you, privacy becomes a faint memory. And this is true not just on the web. The Data Privacy Lab at Harvard University recently managed to identify 40 per cent of individuals who had taken part (again, supposedly anonymously) in a large-scale DNA study, the Personal Genome Project.

Depending on your taste, you might be more or less worried to find that Big Data about you is held by the state rather than a profit-seeking internet company. At least governments are supposed to be democratically accountable. Both David Cameron and Barack Obama announced “open data” initiatives on assuming office, though the types of data released were very carefully chosen. The British government put online information about all items of local-authority spending over £500, salaries of civil servants, and a big database of government spending. The Communities Secretary, Eric Pickles, announced that this move would “unleash an army of armchair auditors” among the public with fresh ideas for savings. Unfortunately, this citizen army never showed up; from the comfort of their armchairs, they presumably preferred to shoot one another repeatedly in the face on Call of Duty. Or perhaps the rhetoric of transparency in releasing this data to public scrutiny did not prevent people from noticing that the government was inviting citizens to do its own work for free.

Meanwhile legislators were preparing to grant themselves new powers to trawl through our data. Nick Clegg promised to block the Communications Data Bill, aka the “snoopers’ charter”, which wants to order services including Facebook and Skype to record information about every citizen’s communications and to grant access to the records on demand by the police.

These days, a promise from Nick Clegg probably ranks quite low on the list of things that the British public will bank on, but his statement does indicate that he knows which way public sentiment on official surveillance is leaning. When such legislation is proposed, critics’ rhetoric about a “Big Brother state” is hardly overblown. Indeed, Mayer-Schönberger and Cukier go further, remarking tartly that the Stasi, too, were Big Data fanatics avant la lettre.

There is no reason in principle why Big Data cannot be used by the government (or anyone else) to serve the public good, as in Google Flu Trends. Although Big Data in medicine raises privacy issues (as with the UK government’s current plans for datasharing across the NHS), it can also prevent needless deaths. Another positive development is the Big Data operation announced in the US in April by the Consumer Financial Protection Bureau. Banks and other financial services companies already collect and pool colossal volumes of data about their customers; now the official watchdog will force them to supply the same data to its analysts, so that the regulator can have detailed oversight of lenders’ behaviour. In this instance, amusingly, it is anti-regulation Republicans who are crying “Big Brother”. They don’t mind the credit companies holding and sharing such data, but heaven forbid that a body tasked to protect consumers should get its hands on it, too.

A subtler problem with Big Data, though, is that it might lead us to downgrade what can’t easily be measured. When you have a hammer, everything starts to look like a nail. When you have such vast quantities of information, you might care only about what is quantifiable. Stuff that can’t easily be turned into a forest of numbers for crunching might get sidelined.

Big-data analysis in the humanities is called “culturomics”. Its findings can be very interesting. Who would have guessed that, as a Harvard study cited by Mayer-Schönberger and Cukier has found, less than half “the number of English words that appear in books are included in dictionaries”, the rest being “lexical ‘dark matter’”? Another recent study by researchers at Bristol, Durham, Sheffield and Stockholm analysed the appearance of what it calls “mood words” in a large number of 20th-century English-language novels. (The “mood words” are those semantically associated with six mood categories: anger, disgust, fear, joy, sadness and surprise.) The study concluded that “American English has become decidedly more ‘emotional’ than British English in the last half-century”. The scare-quotes around “emotional” may be taken as an acknowledgment of the limits of such a data-driven approach. After all, prose can be “emotional” in more ways than one, and there are many “moods” besides the six named here.

No doubt counting words in books is, for such researchers, an amusing activity that harms no one and might offer fascinating results. But when just the same kind of analysis, only on a larger scale and tied to more personal information, is done by a corporate giant such as Amazon, we might feel more squeamish. Thanks to its Kindle devices and apps, Amazon is sitting on a prodigious quantity of data about not only what people read, but which passages they highlight and during what part of a book they are most likely to give up reading. In this titanic datachest lurks an intriguing potential business plan for Amazon, if not good news for the culture at large, as Evgeny Morozov notes in his latest book, To Save Everything, Click Here. One day, Amazon might be able to build a system that uses this aggregated mountain of reading data to write new books automatically – books that readers are statistically guaranteed to like. At that point, will writers and readers the world over shrug and admit that the data knows best?         

The great Victorian scientist Lord Kelvin once said: “When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarely, in your thoughts, advanced to the stage of science . . .”

This is a fine nostrum for mathematical physics and other sciences, and could serve as a slogan (of much more dubious aptness) for the Big Data hype today in culture, commerce and politics. It might also be the reasoning behind one of the weirdest modern tech subcultures, the Quantified Self movement. Using wearable gadgets and smartphone apps, Selfers collect information about every measurable aspect of their daily lives, including food intake, numbers of footsteps walked, variations in heart rate, emails written and received, and even evacuations of waste matter into porcelain receptacles. This is perhaps best understood as a kind of hoarding; the piling-up of terabytes of such stuff seems a modern kind of monstrous egotism. One might look indulgently on the Selfers as harmless eccentrics, but even they signal the wider concerns we should have about Big Data. If your friend is a Selfer who records all his social interactions with a webcam strapped to his head (yes, some people do this) then he is going to be storing video footage of you. And where exactly is he putting it? And who else can get at it? The central question of Big Data, as its invisible bit-storm blows through every aspect of our lives, is going to be who owns it and controls access. Big Data is big power.

At least it is potentially. More heartening, in the context of such dystopian worries, is a recent story of human beings’ loveable fallibility in the face of their own complex tools. Willy Brandt International Airport in Berlin, Germany, is a dazzlingly lit ghost town. It was supposed to open in 2011; in January this year it was announced that it still won’t be ready by the latest target of October 2013. So why keep the lights on? Because officials can’t figure out how to turn them off. The data-gobbling computer that runs the airport’s systems is so complicated that no one knows how to use it. Sometimes, apparently, Big Data really is too big.

Steven Poole’s latest book is “You Aren’t What You Eat” (Union Books, £7.99)