Predicting the text in redacted documents is close to reality

Releasing delicate information with big black bars all over it has kept secrets safe for years - but not for much longer, maybe.

For those with secrets they want to keep, redacting documents is a pretty important thing to get right. It’s necessary to understand how to redact documents, firstly - look to Southwark Council, which in February uploaded its controversial agreement with developer Lend Lease for the regeneration of the Heygate Estate in a form that let people copy and paste the text underneath the black bars.

But it’s also necessary to know which parts of a document to redact so that the context from the stuff left open doesn’t give the game away. There is always, however, information left behind. The choices made in how to block text - be it with other bits of paper, or black marker pen, or even by typing out new words and then covering those up - can reveal something about the person doing the redacting. Different agencies had different redaction standards at different times, which gives a further clue as to what technique is needed. Each typeface fits into the space under a bar in a limited number of contextually-relevant ways, as well.

In the New Yorker, William Brennan reports on The Declassification Engine, an intriguing attempt by a group of academics to use these clues to try and crack any redacted text. A snippet:

Together with a group of historians, computer scientists, and statisticians, [Columbia history professor Matthew] Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

One of the things that jumps out in here is the parallel between the "mosaic theory" - where "pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified" - and critics of the NSA who realise that mass collection of metadata rather than the actual data of communications is, in many ways, just as bad.

Redacted Iraq War info at a 2004 US Senate press conference (Photo: Getty)

Ian Steadman is a staff science and technology writer at the New Statesman. He is on Twitter as @iansteadman.

Show Hide image

You are living in a Black Mirror episode and you don’t care

The Investigatory Powers Bill is likely to become law later this year, but barely anyone is resisting the dystopian surveillance it will bring.

“They’re all about the way we live now – and the way we might be living in 10 minutes’ time if we're clumsy,” explained Charlie Brooker when asked to describe the concept behind his science fiction series Black Mirror. When series three was released on Netflix last week, this sentiment was reiterated over and over. “Omg, it’s just like Instagram!!!!” squealed individuals in their masses after watching episode one, “Nosedive”, set in a world where everyone rates one another out of five after their interactions. The parallel with social media is easy, obvious, and intentional, but it doesn’t teach us much. The real ways in which our world is like a dystopian sci-fi are, in fact, much more boring.

There will be no suspenseful songs or dramatic jump cuts preluding the third reading of the Investigatory Powers Bill in the House of Lords next week. The “snoopers’ charter” is likely to become law after it passed through its House of Commons readings with a few amendments, with 444 MPs voting in favour and 69 against. In short, the Bill will give the government unprecedented surveillance powers, allowing them to intercept and collect your communications, collect a list of the websites you visit and search it without a warrant, and force your internet service provider to help them collect your data.

Even though this is highly comparable to the dark visions of the future offered by Black Mirror, no one cares. Though the Bill faced initial resistance when it was announced in 2015, it has passed through its readings relatively unscathed. Black Mirror should provide a prime opportunity to discuss issues around privacy, but people prefer to compare dystopias to things they already hate. Lord help us all if we take selfies or stare at a device which is simultaneously an encyclopaedia, a newspaper, a book, a map, a bank, a radio, a camera and a telephone for more than ten minutes.

Yet the Investigatory Powers Bill does hold many parallels to the last episode of Black Mirror series three, “Hated in the Nation”. In it, the government use autonomous drones shaped like bees to spy on its people, which are then hacked to murder hated public figures. “Ok! The government’s a c**t, we knew that already,” says DCI Karin Parke, moving on to the real issue – not that the government spies on its citizens, but that the spying device can be hacked by those naughty, naughty citizens themselves.

The hacker – Garrett Scholes – has programmed the bees to kill whoever gets the most votes on Twitter via the hashtag #DeathTo. Then, in a Jon-Ronson-worthy twist, he sets the bees on the people who used the hashtag in the first place. The actual, moral, wake-up-sheeple message of “Hated in the Nation”, then, is that we should be careful who we wish death upon on social media. But it is precisely this freedom that we should be protecting. Under the Investigatory Powers Bill, your emails and search history could be used to argue that you really want to kill Katie Hopkins, rather than were just blowing off steam.

Yet it’s hard to blame anyone for ignoring the Bill, which is off-putting not because it’s not an episode of Black Mirror, but because it is long and confusing. Breaking through the terminology is hard, even in the handy fact sheets provided, and the government can claim transparency while using alienating language and concepts.

“Some of the powers in the Bill are deeply intrusive, and with very little possible justification,” warned former MP Dr Julian Huppert last week, “the cost to all of our privacy is huge.” The good news is that you don’t have to worry about metal bees spying on you, and the bad news is that this is because the government will soon have permission to do it the easy way.


Now listen to a review of the new series of Black Mirror on the NS pop culture podcast, SRSLY:

Amelia Tait is a technology and digital culture writer at the New Statesman.