Predicting the text in redacted documents is close to reality

Releasing delicate information with big black bars all over it has kept secrets safe for years - but not for much longer, maybe.

For those with secrets they want to keep, redacting documents is a pretty important thing to get right. It’s necessary to understand how to redact documents, firstly - look to Southwark Council, which in February uploaded its controversial agreement with developer Lend Lease for the regeneration of the Heygate Estate in a form that let people copy and paste the text underneath the black bars.

But it’s also necessary to know which parts of a document to redact so that the context from the stuff left open doesn’t give the game away. There is always, however, information left behind. The choices made in how to block text - be it with other bits of paper, or black marker pen, or even by typing out new words and then covering those up - can reveal something about the person doing the redacting. Different agencies had different redaction standards at different times, which gives a further clue as to what technique is needed. Each typeface fits into the space under a bar in a limited number of contextually-relevant ways, as well.

In the New Yorker, William Brennan reports on The Declassification Engine, an intriguing attempt by a group of academics to use these clues to try and crack any redacted text. A snippet:

Together with a group of historians, computer scientists, and statisticians, [Columbia history professor Matthew] Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

One of the things that jumps out in here is the parallel between the "mosaic theory" - where "pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified" - and critics of the NSA who realise that mass collection of metadata rather than the actual data of communications is, in many ways, just as bad.

Redacted Iraq War info at a 2004 US Senate press conference (Photo: Getty)

Ian Steadman is a staff science and technology writer at the New Statesman. He is on Twitter as @iansteadman.

Show Hide image

The internet makes writing as innovative as speech

When a medium acquires new functions, it will need to be adapted by means of creating new forms.

Many articles on how the internet has changed language are like linguistic versions of the old Innovations catalogue, showcasing the latest strange and exciting products of our brave new digital culture: new words (“rickroll”); new uses of existing words (“trend” as a verb); abbreviations (smh, or “shaking my head”); and graphic devices (such as the much-hyped “new language” of emojis). Yet these formal innovations are merely surface (and in most cases ephemeral) manifestations of a deeper change a change in our relationship with the written word.

I first started to think about this at some point during the Noughties, after I noticed the odd behaviour of a friend’s teenage daughter. She was watching TV, alone and in silence, while her thumbs moved rapidly over the keys of her mobile phone. My friend explained that she was chatting with a classmate: they weren’t in the same physical space, but they were watching the same programme, and discussing it in a continuous exchange of text messages. What I found strange wasn’t the activity itself. As a teenage girl in the 1970s, I, too, was capable of chatting on the phone for hours to someone I’d spent all day with at school. The strange part was the medium: not spoken language, but written text.

In 1997, research conducted for British Telecom found that face-to-face speech accounted for 86 per cent of the average Briton’s communications, and telephone speech for 12 per cent. Outside education and the (white-collar or professional) workplace, most adults did little writing. Two decades later, it’s probably still true that most of us talk more than we write. But there’s no doubt we are making more use of writing, because so many of us now use it in our social interactions. We text, we tweet, we message, we Facebook; we have intense conversations and meaningful relationships with people we’ve never spoken to.

Writing was not designed to serve this purpose. Its original function was to store information in a form that did not depend on memory for its transmission and preservation. It acquired other functions, of the social kind, among others; but even in the days when “snail mail” was less snail-like (in large cities in the early 1900s there were five postal deliveries a day), “conversations” conducted by letter or postcard fell far short of the rapid back-and-forth that ­today’s technology makes possible.

When a medium acquires new functions, it will need to be adapted by means of creating new forms. Many online innovations are motivated by the need to make written language do a better job of two things in particular: communicating tone, and expressing individual or group identity. The rich resources speech offers for these purposes (such as accent, intonation, voice quality and, in face-to-face contexts, body language) are not reproducible in text-based communication. But users of digital media have found ways to exploit the resources that are specific to text, such as spelling, punctuation, font and spacing.

The creative use of textual resources started early on, with conventions such as capital letters to indicate shouting and the addition of smiley-face emoticons (the ancestors of emojis) to signal humorous or sarcastic intent, but over time it has become more nuanced and differentiated. To those in the know, a certain respelling (as in “smol” for “small”) or the omission of standard punctuation (such as the full stop at the end of a message) can say as much about the writer’s place in the virtual world as her accent would say about her location in the real one.

These newer conventions have gained traction in part because of the way the internet has developed. As older readers may recall, the internet was once conceptualised as an “information superhighway”, a vast and instantly accessible repository of useful stuff. But the highway was a one-way street: its users were imagined as consumers rather than producers. Web 2.0 changed that. Writers no longer needed permission to publish: they could start a blog, or write fan fiction, without having to get past the established gatekeepers, editors and publishers. And this also freed them to deviate from the linguistic norms that were strictly enforced in print – to experiment or play with grammar, spelling and punctuation.

Inevitably, this has prompted complaints that new digital media have caused literacy standards to plummet. That is wide of the mark: it’s not that standards have fallen, it’s more that in the past we rarely saw writing in the public domain that hadn’t been edited to meet certain standards. In the past, almost all linguistic innovation (the main exception being formal or technical vocabulary) originated in speech and appeared in print much later. But now we are seeing traffic in the opposite direction.

Might all this be a passing phase? It has been suggested that as the technology improves, many text-based forms of online communication will revert to their more “natural” medium: speech. In some cases this seems plausible (in a few it’s already happening). But there are reasons to think that speech will not supplant text in all the new domains that writing has conquered.

Consider my friend’s daughter and her classmate, who chose to text when they could have used their phones to talk. This choice reflected their desire for privacy: your mother can’t listen to a text-based conversation. Or consider the use of texting to perform what politeness theorists call “face-threatening acts”, such as sacking an employee or ending an intimate relationship. This used to be seen as insensitive, but my university students now tell me they prefer it – again, because a text is read in private. Your reaction to being dumped will not be witnessed by the dumper: it allows you to retain your dignity, and gives you time to craft your reply.

Students also tell me that they rarely speak on the phone to anyone other than their parents without prearranging it. They see unsolicited voice calls as an imposition; text-based communication is preferable (even if it’s less efficient) because it doesn’t demand the recipient’s immediate and undivided attention. Their guiding principle seems to be: “I communicate with whom I want, when I want, and I respect others’ right to do the same.”

I’ll confess to finding this new etiquette off-putting: it seems ungenerous, unspontaneous and self-centred. But I can also see how it might help people cope with the overwhelming and intrusive demands of a world where you’re “always on”. (In her book Always On: Language in an Online and Mobile World, Naomi Baron calls it “volume control”, a way of turning down the incessant noise.) As with the other new practices I’ve mentioned, it’s a strategic adaptation, exploiting the inbuilt capabilities of technology, but in ways that owe more to our own desires and needs than to the conscious intentions of its designers. Or, to put it another way (and forgive me if I adapt a National Rifle Association slogan): technologies don’t change language, people do.

Deborah Cameron is Professor of Language and Communication at the University of Oxford and a fellow of Worcester College

This article first appeared in the 16 February 2017 issue of the New Statesman, The New Times