Home » Commentary » Can we rely on trying to ‘harvest’ the web? part 2
Our mission
Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.
← Digitization does not magically preserve paper Librarians protesting cuts to Canadian Archives and federal libraries silenced at CLA conference →
Latest Posts
Latest Comments
- Aimee Quinn on The government information crisis is bigger than you think it is
- James R. Jacobs on Federal information scrubbing has begun. Please support the End of Term Archive and Environmental Data Governance Initiative (EDGI)
- James R. Jacobs on Down a few rabbit holes in search for a historic pamphlet on Fascism
- Kathleen Cunningham on Down a few rabbit holes in search for a historic pamphlet on Fascism
- Gretchen Gehrke on EDGI’s new public comments initiative
- James R. Jacobs on Reference question and the saga of chasing down a Congressionally mandated report
- James R. Jacobs on Down a few rabbit holes in search for a historic pamphlet on Fascism
- Cass Hartnett on Down a few rabbit holes in search for a historic pamphlet on Fascism
- James R. Jacobs on Beautiful video on the history of fire lookouts – and fire! – highlights lots of US govt publications and records
- James R. Jacobs on Down a few rabbit holes in search for a historic pamphlet on Fascism
Blogroll
- ASU Gov Docs
- beSpacific
- Best. Titles. Ever. (Tumblr)
- Center for Effective Government
- Every CRS Report New Reports RSS Feed
- FDLP Desktop
- FDLP News & Events
- FullTextReports
- GISIG UW-SLIS: Gov Info, Sources, Data & Docs
- Government Book Talk
- Government Information Network (Canada)
- Government Information News from Fondren Library, Rice University
- GPO [twitter]
- INFOdocket
- Information Observatory
- Libraries+ Network
- Library Babel Fish by Barbara Fister
- NARA records express
- Open The Government
- Secrecy News
- SLA GovInfo [twitter]
- StatFountain
- Sunlight Foundation
- University of Washington Gov Pubs Finds
Can we rely on trying to ‘harvest’ the web? part 2
Recently, we posted here a link to David Rosenthal’s list of problems of we have with harvesting and preserving the Web.
Here is more on the same topic.
It is a 22 page PDF that presents in some detail an overview of challenges to capturing web content. It was presented at The Future Web workshop, which was held in May as part of the 2012 International Internet Preservation Consortium General Assembly meeting (IIPC GA) hosted by the Library of Congress. The purpose of the paper was to provide a shared context for participants.
The problems:
The paper also lists “Current Mitigation Strategies” but, as Rosenthal pointed out, all of these are aimed at capturing a “user experience” — and our ability to meet even that goal is limited:
A different question libraries should be asking is, How can libraries capture the content behind the user experience? The presentation is important, but, even more important is the raw data that sites use to provide those experiences. This kind of information used to be instantiated in books and magazines and maps and pamphlets and newspapers.
Today that “raw data” is stored in databases, XML files, GIS applications, and other data stores.
Web harvesting can do little more than capture a snapshot of how that information was presented at a given time in the past by a particular information provider.
Libraries should be capturing those raw data sources. By doing that, libraries will ensure that current and future users of libraries will be able to actually use, analyze, and mine the data in new and interesting ways. Seeing how a user in the past might have seen a web page at a particular point in time will be of interest to some cultural historians and is therefore certainly important. But it is only a very small part of what future users will expect from their libraries.
As the report says, in passing, “the classical model of web archiving is no longer sufficient for capturing preserving, and re-rendering all the bytes of interest we care about.”
There’s a quick overview of the workshop and lots more links here:
Related
Tags: Web harvesting