Predicting the text in redacted documents is close to reality

Releasing delicate information with big black bars all over it has kept secrets safe for years - but not for much longer, maybe.

For those with secrets they want to keep, redacting documents is a pretty important thing to get right. It’s necessary to understand how to redact documents, firstly - look to Southwark Council, which in February uploaded its controversial agreement with developer Lend Lease for the regeneration of the Heygate Estate in a form that let people copy and paste the text underneath the black bars.

But it’s also necessary to know which parts of a document to redact so that the context from the stuff left open doesn’t give the game away. There is always, however, information left behind. The choices made in how to block text - be it with other bits of paper, or black marker pen, or even by typing out new words and then covering those up - can reveal something about the person doing the redacting. Different agencies had different redaction standards at different times, which gives a further clue as to what technique is needed. Each typeface fits into the space under a bar in a limited number of contextually-relevant ways, as well.

In the New Yorker, William Brennan reports on The Declassification Engine, an intriguing attempt by a group of academics to use these clues to try and crack any redacted text. A snippet:

Together with a group of historians, computer scientists, and statisticians, [Columbia history professor Matthew] Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

One of the things that jumps out in here is the parallel between the "mosaic theory" - where "pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified" - and critics of the NSA who realise that mass collection of metadata rather than the actual data of communications is, in many ways, just as bad.

Redacted Iraq War info at a 2004 US Senate press conference (Photo: Getty)

Ian Steadman is a staff science and technology writer at the New Statesman. He is on Twitter as @iansteadman.

BBC
Show Hide image

SRSLY #45: Love, Nina, Internet Histories Week, The Secret in Their Eyes

This week on the pop culture podcast, we chat Nick Hornby’s adaptation of Nina Stibbe’s literary memoir, our histories on the internet, and an Oscar-winning 2009 Argentinian thriller.

This is SRSLY, the pop culture podcast from the New Statesman. Here, you can find links to all the things we talk about in the show as well as a bit more detail about who we are and where else you can find us online.

...or subscribe in iTunes. We’re also on StitcherRSS and SoundCloud – but if you use a podcast app that we’re not appearing in, let us know.

SRSLY is hosted by Caroline Crampton and Anna Leszkiewicz, the NS’s web editor and editorial assistant. We’re on Twitter as @c_crampton and @annaleszkie, where between us we post a heady mixture of Serious Journalism, excellent gifs and regularly ask questions J K Rowling needs to answer.

The Links

Love, Nina

The first episode on iPlayer.

An interview with Nina Stibbe about the book.

Internet Histories Week

The index of all the posts in the series.

Our conversation about MSN Messenger.

The Secret in Their Eyes

The trailer.

For next week

Anna is watching 30 Rock.

If you’d like to talk to us about the podcast or make a suggestion for something we should read or cover, you can email srslypod[at]gmail.com.

You can also find us on Twitter @srslypod, or send us your thoughts on tumblr here. If you like the podcast, we’d love you to leave a review on iTunes - this helps other people come across it.

We love reading out your emails. If you have thoughts you want to share on anything we’ve discussed, or questions you want to ask us, please email us on srslypod[at]gmail.com, or @ us on Twitter @srslypod, or get in touch via tumblr here. We also have Facebook now.

Our theme music is “Guatemala - Panama March” (by Heftone Banjo Orchestra), licensed under Creative Commons. 

See you next week!

PS If you missed #44, check it out here.