Here’s a a thought-provoking tweet forwarded to me by Jonathan Petters. Random twitter user Kenny Jacoby found a giant 1,200 page document that he wanted to use — not read! So what did he do? He banged on it for 8 hours in order to extract all the data from the PDF and convert it to a structured dataset. Good on him for doing that, but how many people have the skills and time to be able to do that?
And this, kind readers, is a perfect example of what we’ve been writing about here at FGI for some time. In the age of near-ubiquitous online access, it’s not enough for governments to publish PDFs, they need to provide more and better access for both humans and machines.
Information must be:
not just preserved, but discoverable [2.2.2]
not just discoverable, but deliverable [2.3.3]
not just deliverable as bits, but readable [2.2.1]
not just readable, but understandable [2.2.1]
not just understandable, but usable [188.8.131.52]
(*numbers in brackets refer to sections of the OAIS standard.)
This is a nut that our friends at the Congressional Data Coalition are trying to crack. And some federal agencies are wading into this space as well — see for example Data.gov. But there are still far too many examples of this data/publication divide. It’s going to take a concerted effort by the public, watchdog groups like Sunlight Foundation and OpenTheGovernment along with librarians to push for this change in the way we think about government information.
Spent about 8 hours writing and debugging code to scrape a hideous 1,200-page PDF into a structured dataset because a public agency refused to give up its raw data.
Don’t mess with me. pic.twitter.com/Eiz1SZl3Xx
— Kenny Jacoby (@kennyjacoby) May 28, 2018
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.