Predicting the text in redacted documents is close to reality

Releasing delicate information with big black bars all over it has kept secrets safe for years - but not for much longer, maybe.

For those with secrets they want to keep, redacting documents is a pretty important thing to get right. It’s necessary to understand how to redact documents, firstly - look to Southwark Council, which in February uploaded its controversial agreement with developer Lend Lease for the regeneration of the Heygate Estate in a form that let people copy and paste the text underneath the black bars.

But it’s also necessary to know which parts of a document to redact so that the context from the stuff left open doesn’t give the game away. There is always, however, information left behind. The choices made in how to block text - be it with other bits of paper, or black marker pen, or even by typing out new words and then covering those up - can reveal something about the person doing the redacting. Different agencies had different redaction standards at different times, which gives a further clue as to what technique is needed. Each typeface fits into the space under a bar in a limited number of contextually-relevant ways, as well.

In the New Yorker, William Brennan reports on The Declassification Engine, an intriguing attempt by a group of academics to use these clues to try and crack any redacted text. A snippet:

Together with a group of historians, computer scientists, and statisticians, [Columbia history professor Matthew] Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

One of the things that jumps out in here is the parallel between the "mosaic theory" - where "pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified" - and critics of the NSA who realise that mass collection of metadata rather than the actual data of communications is, in many ways, just as bad.

Redacted Iraq War info at a 2004 US Senate press conference (Photo: Getty)

Ian Steadman is a staff science and technology writer at the New Statesman. He is on Twitter as @iansteadman.

Show Hide image

7 problems with the Snooper’s Charter, according to the experts

In short: it was written by people who "do not know how the internet works".

A group of representatives from the UK Internet Service Provider’s Association (ISPA) headed to the Home Office on Tuesday to point out a long list of problems they had with the proposed Investigatory Powers Bill (that’s Snooper’s Charter to you and me). Below are simplified summaries of their main points, taken from the written evidence submitted by Adrian Kennard, of Andrews and Arnold, a small ISP, to the department after the meeting. 

The crucial thing to note is that these people know what they're talking about - the run the providers which would need to completely change their practices to comply with the bill if it passed into law. And their objections aren't based on cost or fiddliness - they're about how unworkable many of the bill's stipulations actually are. 

1. The types of records the government wants collected aren’t that useful

The IP Bill places a lot of emphasis on “Internet Connection Records”; i.e. a list of domains you’ve visited, but not the specific pages visited or messages sent.

But in an age of apps and social media, where we view vast amounts of information through single domains like Twitter or Facebook, this information might not even help investigators much, as connections can last for days, or even months. Kennard gives the example of a missing girl, used as a hypothetical case by the security services to argue for greater powers:

 "If the mobile provider was even able to tell that she had used twitter at all (which is not as easy as it sounds), it would show that the phone had been connected to twitter 24 hours a day, and probably Facebook as well… this emotive example is seriously flawed”

And these connection records are only going to get less relevant over time - an increasing number of websites including Facebook and Google encrypt their website under "https", which would make finding the name of the website visited far more difficult.

2. …but they’re still a massive invasion of privacy

Even though these records may be useless when someone needs to be found or monitored, the retention of Internet Connection Records (IRCs) is still very invasive – and can actually yield more information than call records, which Theresa May has repeatedly claimed are the non-digital equivalent of ICRs. 

Kennard notes: “[These records] can be used to profile them and identify preferences, political views, sexual orientation, spending habits and much more. It is useful to criminals as it would easily confirm the bank used, and the time people leave the house, and so on”. 

This information might not help find a missing girl, but could build a profile of her which could be used by criminals, or for over-invasive state surveillance. 

3. "Internet Connection Records" aren’t actually a thing

The concept of a list of domain names visited by a user referred to in the bill is actually a new term, derived from “Call Data Record”. Compiling them is possible, but won't be an easy or automatic process.

Again, this strongly implies that those writing the bill are using their knowledge of telecommunications surveillance, not internet era-appropriate information. Kennard calls for the term to be removed, or at least its “vague and nondescript nature” made clear in the bill.

4. The surveillance won’t be consistent and could be easy to dodge

In its meeting with the ISPA, the Home Office implied that smaller Internet service providers won't be forced to collect these ICR records, as it would use up a lot of their resources. But this means those seeking to avoid surveillance could simply move over to a smaller provider.

5. Conservative spin is dictating the way we view the bill 

May and the Home Office are keen for us to see the surveillance in the bill as passive: internet service providers must simply log the domains we visit, which will be looked at in the event that we are the subject of an investigation. But as Kennard notes, “I am quite sure the same argument would not work if, for example, the law required a camera in every room in your house”. This is a vast new power the government is asking for – we shouldn’t allow it to play it down.

6. The bill would allow our devices to be bugged

Or, in the jargon, used in the draft bill, subjected to “equipment interference”. This could include surveillance of everything on a phone or laptop, or even turning on its camera or webcam to watch someone. The bill actually calls for “bulk equipment interference” – when surely, as Kennard notes, “this power…should only be targeted at the most serious of criminal suspects" at most.

7. The ability to bug devices would make them less secure

Devices can only be subject to “equipment interference” if they have existing vulnerabilities, which could also be exploited by criminals and hackers. If security services know about these vulnerabilities, they should tell the manufacturer about them. As Kennard writes, allowing equipment interference "encourages the intelligence services to keep vulnerabilities secret” so they don't lose surveillance methods. Meanwhile, though, they're laying the population open to hacks from cyber criminals. 


So there you have it  – a compelling soup of misused and made up terms, and ethically concerning new powers. Great stuff. 

Barbara Speed is a technology and digital culture writer at the New Statesman and a staff writer at CityMetric.