From Documents to Data: Help Build a Toolkit for the Rest of Us
From Tesseract to Tabula, there are dozens of open-source software libraries designed to help users work with unstructured data — or what non-nerds might call documents.
Web-based services like DocumentCloud, Open Calais and Overview have solved a few of the most common problems journalists face, but there is still a big hole to fill.
That’s the challenge our Knight/Mozilla fellow will tackle.
The project is not necessarily about invention so much as integration: taking stock of what’s available and building an easy-to-use, easy-to-deploy suite of tools to help journalists work with original source documents. Although the toolkit will be designed in a newsroom with journalists in mind, we think it will be just as useful for members of the public as well.
What problems might this toolkit address? Well, volume for one. It’s not uncommon for journalists to be confronted with huge numbers of documents — in some cases, millions of pages at a time.
The limitations of services like DocumentCloud to work with these documents, along with limitations in their format, places a hard limitation on journalists’ ability to tag, share, analyze and surface newsworthy tidbits and patterns. This is particularly true for smaller newsrooms that don’t have in-house developers to work with these often complex and quirky libraries.
The fellow will be attached jointly to the two teams at The Times most involved in solving the documents-to-data dilemma: Interactive News and the computer-assisted reporting team. He or she will spend 10 months working on real stories with real reporters and editors, the end goal of which will be to develop and, ultimately, release the document toolkit that real people can understand and use.
We’re excited about the potential for this kind of tool in large and small newsrooms, in crowdsourcing efforts and even in the academy to transform investigations, opening up in-depth document analysis to everyone.