Can we rely on trying to 'harvest' the web? part 2
Here is more on the same topic.
- IIPC Future of the Web Workshop - Introduction & Overview (May 17, 2012)
It is a 22 page PDF that presents in some detail an overview of challenges to capturing web content. It was presented at The Future Web workshop, which was held in May as part of the 2012 International Internet Preservation Consortium General Assembly meeting (IIPC GA) hosted by the Library of Congress. The purpose of the paper was to provide a shared context for participants.
- Database driven features and functions
- Complex/variable URI formats and inconsistent/variable link implementations
- Dynamically generated, ever changing, URIs
- Rich Media
- Scripted, incremental display and page loading mechanisms
- Scripted, HTML forms
- Multi-sourced, embedded material
- Dynamic login/auth services: captchas, cross-site/social authentication, & user- sensitive embeds
- Alternate display based on user agent or other parameters
- Exclusions by convention
- Exclusions by design
- Server side scripts & remote procedure calls
- HTML5 "web sockets"
- Mobile publishing
The paper also lists "Current Mitigation Strategies" but, as Rosenthal pointed out, all of these are aimed at capturing a "user experience" -- and our ability to meet even that goal is limited:
But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach.
A different question libraries should be asking is, How can libraries capture the content behind the user experience? The presentation is important, but, even more important is the raw data that sites use to provide those experiences. This kind of information used to be instantiated in books and magazines and maps and pamphlets and newspapers.
Today that "raw data" is stored in databases, XML files, GIS applications, and other data stores.
Web harvesting can do little more than capture a snapshot of how that information was presented at a given time in the past by a particular information provider.
Libraries should be capturing those raw data sources. By doing that, libraries will ensure that current and future users of libraries will be able to actually use, analyze, and mine the data in new and interesting ways. Seeing how a user in the past might have seen a web page at a particular point in time will be of interest to some cultural historians and is therefore certainly important. But it is only a very small part of what future users will expect from their libraries.
As the report says, in passing, "the classical model of web archiving is no longer sufficient for capturing preserving, and re-rendering all the bytes of interest we care about."
There's a quick overview of the workshop and lots more links here:
- Harvesting and Preserving the Future Web: Content Capture Challenges, by Nicholas Taylor, The Signal (June 1st, 2012).