Home » Posts tagged 'Government information as corpus'
Tag Archives: Government information as corpus
Here’s a a thought-provoking tweet forwarded to me by Jonathan Petters. Random twitter user Kenny Jacoby found a giant 1,200 page document that he wanted to use — not read! So what did he do? He banged on it for 8 hours in order to extract all the data from the PDF and convert it to a structured dataset. Good on him for doing that, but how many people have the skills and time to be able to do that?
And this, kind readers, is a perfect example of what we’ve been writing about here at FGI for some time. In the age of near-ubiquitous online access, it’s not enough for governments to publish PDFs, they need to provide more and better access for both humans and machines.
Information must be:
not just preserved, but discoverable [2.2.2]
not just discoverable, but deliverable [2.3.3]
not just deliverable as bits, but readable [2.2.1]
not just readable, but understandable [2.2.1]
not just understandable, but usable [126.96.36.199]
(*numbers in brackets refer to sections of the OAIS standard.)
This is a nut that our friends at the Congressional Data Coalition are trying to crack. And some federal agencies are wading into this space as well — see for example Data.gov. But there are still far too many examples of this data/publication divide. It’s going to take a concerted effort by the public, watchdog groups like Sunlight Foundation and OpenTheGovernment along with librarians to push for this change in the way we think about government information.
Spent about 8 hours writing and debugging code to scrape a hideous 1,200-page PDF into a structured dataset because a public agency refused to give up its raw data.
Don’t mess with me. pic.twitter.com/Eiz1SZl3Xx
— Kenny Jacoby (@kennyjacoby) May 28, 2018
Hot off the presses from the National Academies is this prepublication version of a report Frontiers in Massive Data Analysis. This is a really nice survey of much of the state of the art and current issues involved in “big data.” Govt information librarians owe it to themselves to become well-versed as more and more researchers across many disciplines will become interested in govt information as a corpus to do larger analysis (I’m already getting questions about corpus research!).
From Facebook to Google searches to bookmarking a webpage in our browsers, today’s society has become one with an enormous amount of data. Some internet-based companies such as Yahoo! are even storing exabytes (10 to the 18 bytes) of data. Like these companies and the rest of the world, scientific communities are also generating large amounts of data-—mostly terabytes and in some cases near petabytes—from experiments, observations, and numerical simulation. However, the scientific community, along with defense enterprise, has been a leader in generating and using large data sets for many years. The issue that arises with this new type of large data is how to handle it—this includes sharing the data, enabling data security, working with different data formats and structures, dealing with the highly distributed data sources, and more.
Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data’s work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee’s recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis.