Home » Commentary » Can we rely on trying to ‘harvest’ the web?

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Can we rely on trying to ‘harvest’ the web?

Dr. David S.H. Rosenthal, who is Chief Scientist at LOCKSS, and Kris Carpenter Negulescu of the Internet Archive recently organized a workshop on the problems of harvesting and preserving the Web as it evolves from a collection of linked HTML documents to a programming environment whose primary language is Javascript.

David and Kris, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:

Database driven features
Complex/variable URI formats
Dynamically generated URIs
Rich, streamed media
Incremental display mechanisms
Multi-sourced, embedded content
Dynamic login, user-sensitive embeds
User agent adaptation
Exclusions (robots.txt, user-agent, …)
Exclusion by design
Server-side scripts, RPCs

Read more about this on David’s blog:

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.