No, Facebook political ads didn’t “cause Brexit” – but that doesn’t mean there is no scandal

Everything you need to know about how Cambridge Analytica used a quiz to harvest data from 50m Facebook users.

For a company which somehow managed to amass one in three of the world’s population among its regular users without attracting serious scrutiny, Facebook suddenly cannot catch a break.

Following a series of bombshell reports in the Observer, New York Times and Channel 4 News, Facebook is once again on the ropes, this time over an apparent data breach affecting 50 million users, connected to Cambridge Analytica – the company accused of using political advertising to convince voters of the merits of Brexit and Donald Trump.

The three outlets have done solid reporting, boosted by the presence of an articulate, on-record, on-camera former employee sporting a shock of pink hair – but, as ever with in-depth tech reporting in the modern era, the story is widely misunderstood and has been prompting all the wrong questions.

What actually happened

At the core of most headlines around this story lies the allegation that Cambridge Analytica harvested the data of around 50 million users, through the use of an online quiz.

This is roughly how it worked: some Cambridge academics designed a quiz which could use your public Facebook likes and other information to predict your personality, based on the widely used OCEAN scale.

Cambridge Analytica, seeing potential in this quiz, started a subsidiary with one of those academics, which then promoted a new version of the quiz designed to harvest data. This quiz was heavily promoted to US users through advertising, and through paying US users to take the quiz.

It’s believed a total of around 270,000 people did some version of this quiz. However, when those users gave Facebook access to their accounts, they were also granting access to the public profile information of all of their Facebook friends – so if your Likes, relationship status, location and similar were set to “viewable by anyone”, they were open to be harvested.

The average Facebook user has more than 300 friends, so it’s not hard to see how this sample of 270,000 users taking the quiz amassed a dataset of 50 million people. This information is at the centre of the story.

Was this against the rules?

This bit gets a little bit complex – but not overly so. Facebook has a set of tools that developers can use to build Facebook features – like open login, or social sharing, or more – into their websites or apps. This is known as an “API”, and at the time the data-harvesting quiz was in operation Facebook’s API allowed for this kind of information to be harvested.

This was controversial at the time, and provoked a privacy backlash – even though the app or website would tell users as they gave authorisation what information it could access, many were (correctly) worried people didn’t properly read those.

Partly in response to these concerns, and partly owing to the fact Facebook hadn’t anticipated people using the API in quite this way to harvest data, an update across 2014/2015 removed this functionality – collecting data in the way this app did has been impossible for more than two years.

Eagle-eyed users of Mechanical Turk – the Amazon-owned service Cambridge Analytica used to recruit test-takers – noticed from privacy notices at the time that the test-takers were being used to collect data on their Facebook friends, and raised concerns on public forums at the time: this was done in plain sight.

That means that to call this a “hack” or a “breach” would be to extend either term to the point of meaningless (obviously that hasn’t stopped lots of outlets doing just that). But just because it lined up with Facebook rules doesn’t mean it’s OK.

The companies can try to say this was at the time possibly within Terms of Service – more on this later – and only took data set to “public”, but that doesn’t mean it didn’t violate users’ reasonable expectations of privacy and responsible data handling. And despite this tool being targeted at US Facebook users, given the international nature of many people’s friendships, there will inevitably be UK and EU citizens within the database – and it’s not clear this would be compliant with EU data rules.

So who knew what, when?

This is the big and important question – and one outlets everywhere have managed to get angry MPs, regulators, and congressmen to ask: when did Facebook know about this data harvesting, and when could politicians have found out about it?

It’s also a very, very easy question to answer: Facebook knew about this in 2015, if for no other reason than a journalist asked them about it and then published their answer – along with all of the other information above.

The actual fact of the data harvesting by Cambridge Analytica was reported in an extensive Guardian investigation in 2015 (which was duly credited in the Observer reporting this week), and has been around for any member of the public, Facebook staff, or any politician around the world.

As occasionally happens with the world’s media (and politicians), everyone has centred in on a point which if they’d followed the story – or done a good search of newspaper cuttings – they could have answered years ago.

The new reporting does reveal some significant new facts for the timeline, though. Between Facebook’s statement and the new articles, it is clear Facebook wrote to Cambridge Analytica in 2016 and said it was their view that the company’s app had violated Facebook rules, and demanded they send back a certified statement saying they had deleted all information harvested in this way. The company did so.

However, sources speaking to the New York Times and Observer appear to have – carefully and cautiously – contradicted that statement, saying they thought it was possible or likely some copies of the data, which they said had been poorly handled and sent without encryption, could still exist.

It’s this allegation which prompted the new action from Facebook of (temporarily, at least for now) suspending Cambridge Analytica from its services, until it has received assurances and evidence all this data was deleted. This is also Facebook’s explanation for suspending the on-record source of the stories from their services – he did, after all, admit to being behind a lot of this data use.

How powerful is all of this data?

It’s been said in some more breathless quarters of the internet that this is the “data breach” that could have “caused Brexit”. Given it was a US-focused bit of harvesting, that would be the most astonishing piece of political advertising success in history – especially as among the big players in the political and broader online advertising world, Cambridge Analytica are not well regarded: some of the people who are best at this regard them as little more than “snake oil salesmen”.

One of the key things this kind of data would be useful for – and what the original academic study it came from looked into – is finding what Facebook Likes correlate with personality traits, or other Facebook likes.

The dream scenario for this would be to find that every woman in your sample who liked “The Republican Party” also liked “Chick-Fil-A”, “Taylor Swift” and “Nascar racing”. That way, you could target ads at people who liked the latter three – but not the former – knowing you had a good chance of reaching people likely to appreciate the message you’ve got. This is a pretty widely used, but crude, bit of Facebook advertising.

When people talk about it being possible Cambridge Analytica used this information to build algorithms which could still be useful after all the original data was deleted, this is what they’re talking about – and that’s possible, but missing a much, much bigger bit of the picture.

So, everything’s OK then?

No. Look at it this way: the data we’re all getting excited about here is a sample of public profile information from 50 million users, harvested from 270,000 people.

Facebook itself, daily, has access to all of that public information, and much more, from a sample of two billion people – a sample around 7,000 times larger than the Cambridge Analytica one, and one much deeper and richer thanks to its real-time updating status.

If Facebook wants to offer sales based on correlations – for advertisers looking for an audience open to their message, its data would be infinitely more powerful and useful than a small (in big data terms) four-year-out-of-date bit of Cambridge Analytica data.

Facebook aren’t anywhere near alone in this world: every day your personal information is bought and sold, bundled and retraded. You won’t know the name of the brands, but the actual giants in this company don’t deal in the tens of millions with data, they deal with hundreds of millions, or even billions of records – one advert I saw today referred to a company which claimed real-world identification of 340 million people.

This is how lots of real advertising targeting works: people can buy up databases of thousands or millions of users, from all sorts of sources, and turn them into the ultimate custom audience – match the IDs of these people and show them this advert. Or they can do the tricks Cambridge Analytica did, but refined and with much more data behind them (there’s never been much evidence Cambridge Analytica’s model worked very well, despite their sales pitch boasts).

The media has a model when reporting on “hacks” or on “breaches” – and on reporting on when companies in the spotlight have given evidence to public authorities, and most places have been following those well-trod routes.

But doing so is like doing forensics on the burning of a twig, in the middle of a raging forest fire. You might get some answers – but they’ll do you no good. We need to think bigger.