Predicting the text in redacted documents is close to reality

Releasing delicate information with big black bars all over it has kept secrets safe for years - but not for much longer, maybe.

For those with secrets they want to keep, redacting documents is a pretty important thing to get right. It’s necessary to understand how to redact documents, firstly - look to Southwark Council, which in February uploaded its controversial agreement with developer Lend Lease for the regeneration of the Heygate Estate in a form that let people copy and paste the text underneath the black bars.

But it’s also necessary to know which parts of a document to redact so that the context from the stuff left open doesn’t give the game away. There is always, however, information left behind. The choices made in how to block text - be it with other bits of paper, or black marker pen, or even by typing out new words and then covering those up - can reveal something about the person doing the redacting. Different agencies had different redaction standards at different times, which gives a further clue as to what technique is needed. Each typeface fits into the space under a bar in a limited number of contextually-relevant ways, as well.

In the New Yorker, William Brennan reports on The Declassification Engine, an intriguing attempt by a group of academics to use these clues to try and crack any redacted text. A snippet:

Together with a group of historians, computer scientists, and statisticians, [Columbia history professor Matthew] Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

One of the things that jumps out in here is the parallel between the "mosaic theory" - where "pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified" - and critics of the NSA who realise that mass collection of metadata rather than the actual data of communications is, in many ways, just as bad.

Redacted Iraq War info at a 2004 US Senate press conference (Photo: Getty)

Ian Steadman is a staff science and technology writer at the New Statesman. He is on Twitter as @iansteadman.

Cameron Spencer/Getty
Show Hide image

“Predatory” journals are distorting the brave new world of open science

An outbreak of new journals in recent years threatens the potential benefits of open-access science

The modern, digital era of peer-reviewed science is changing the way high-quality research is being released. As soon as a study has been validated for accuracy, it’s almost immediately published online and covered by a dozen websites before the end of the working day. It can create a sense of collaboration, with more people finding ways to tackle serious challenges such as cancer and climate change. Or it can increase global competitiveness, with discoveries leading to new products and services.

However, there’s been a huge proliferation in recent years of new, obscure open-access journals, potentially hindering quality and verification. A new study published in BMC Medicine is claiming that such “predatory” journals are drastically altering the landscape for the worse, by “preying” on both readers and potential scientists throughout the process. (Incidentally, we can trust BMC Medicine on this. It’s one of many periodicals from BioMed Central, a well-respected subsidiary of the science publishing giant Springer Nature.)

The business model for journal publishing organisations varies. Most are commercial businesses, charging authors a fee to have their papers scrutinised and published, while also charging other individual readers or groups, such as universities, for access. Non-profit groups, like PLOS, only charge authors who have submitted their manuscripts, eventually releasing papers into the public after a round of fact-checking.

This can sometimes become a long, arduous process, given that a journal’s reputation is at stake, especially when publishing high-profile research or claims. We don’t have to look too far back to remember the implosion at the Lancet, with Andrew Wakefield’s unsubstantiated claims of a link between the MMR vaccine and autism. Some people (especially friends across the pond) still believe this nonsense. With this in mind, you can see how and why readers can become the ultimate target for misleading declarations in journals.

But it’s understandable why there’s an enormous weight behind having research published. It allows an academic to improve their future job prospects and salary levels, all while giving their work the approval they seek. After all, an academic’s list of published work is an extension of their CV. Just look at any university lecturer’s online profile and you’ll see a string of links to their published research on the same page.

This pressure to publish as much work as possible has led to an explosion in the number of articles by open-access publishers who carry between 10-99 different journals. Only four years ago, the market share was dominated by larger, long-established institutions who each carried 100 or more different journals, covering a range of scientific topics. The study also found “predatory” journals have increased the number of open-access articles from 53,000 in 2010 to approximately 420,000 from 8,000 various journals in 2014.

What may also be contributing to the pressure of becoming a well-cited author is the article processing charge (APC) amounts by “predatory” journals. Unsurprisingly, scientists want to save as much money as possible, with the average cost of publication in these publications approximately $178. This is a far cry from the many hundreds of dollars charged by widely-known and respected journals. For example, Scientific Reports, a journal offered by the powerful Nature Publishing Group, charges $1,495 to process a manuscript, excluding taxes. By having such low APCs, “predatory” publishers can make well-intentioned researchers victims just like their readers, at the same time as making money from them.

Where exactly are these new journals coming from? Investigators Professor Bo-Christer Björk and Cenyu Shen of Helsinki’s Hanken School of Economics note that 27 per cent of “predatory” publishers are based in India, 17.5 per cent in North America and 26.8 per cent in locations impossible to determine. It’s also telling that many of these journals often have the words “international” or “American” in their title in order to display a misleading sense of importance and prestige – something the study highlights.

What separates well-known journals from “predatory” ones is the often lengthy, tedious process it can take in order to publish a study. You can usually see this at the top of a research paper, with dates showing when it was submitted for review and also official publication.

However, even this can reveal major flaws within the peer-review system at some of the most prestigious science periodicals. This was proved by John Bohannon, correspondent for Science, who purposefully submitted fake documents riddled with major errors. In the end, the made-up study was accepted by 157 journals, and rejected by only 98. The story has its own Wikipedia page, so you know it’s true.

Creating hoaxes and half-truths about people or places is just part of everyday life with the internet. But this new (and reliable!) study is showing the possible negative outcome in the drive of pushing more science into the open. Perhaps it’s a small price to pay. Maybe we need to research it a bit more.