Home » Commentary » Can we rely on trying to ‘harvest’ the web?

Can we rely on trying to ‘harvest’ the web?

Dr. David S.H. Rosenthal, who is Chief Scientist at LOCKSS, and Kris Carpenter Negulescu of the Internet Archive recently organized a workshop on the problems of harvesting and preserving the Web as it evolves from a collection of linked HTML documents to a programming environment whose primary language is Javascript.

David and Kris, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:

Database driven features
Complex/variable URI formats
Dynamically generated URIs
Rich, streamed media
Incremental display mechanisms
Form-filling
Multi-sourced, embedded content
Dynamic login, user-sensitive embeds
User agent adaptation
Exclusions (robots.txt, user-agent, …)
Exclusion by design
Server-side scripts, RPCs
HTML5

Read more about this on David’s blog:

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Archives

Powered by WordPress / Academica WordPress Theme by WPZOOM