Have you ever wondered why preserving web-published information is a complex task? Here are three examples of what makes “web harvesting” a difficult and inexact method of digital preservation.
- Keystone XL Pipeline: Final Supplemental Environmental Impact Statement (SEIS)
This prompts many questions. Do any of the PDF files themselves contain a list of all the PDF files? To preserve this title, would one have to preserve this page and all 94 PDFs? What should be preserved: this page plus the PDFs, or just the PDFs? Should the css, js. and image files be preserved? How should the files that we choose to preserve be linked and described? How should we count what we have preserved: as one location, or a count of all the files preserved, or a count of the WARC files we create, or as one single title?
- The White House current third party (social media) pages / accounts
This page lists a total of 92 accounts that the White House uses on 26 non-.gov, social media websites.
This raises many questions: How much information is duplicated? How much is unique? Do these sites simply point back to .gov websites? Is some of the information (such as videos) only stored on these “third party” sites? Do the terms-of-use of the third-party sites impose restrictions, copyright, or other limitations that either technically or legally limit what we can preserved? Even in a case where a site points at the more substantive information on a .gov website, would it be of value to preserve how the White House presents itself differently to different communities?
- Executive Order 13662
The above link is to an official, authenticated version of an Executive Order. This same executive order also appears in other formats on other websites. A total of ten different URLs have (apparently) the same information (and metadata about the content).
This raises the question of whether all these really are the same and, if so, how to know that and whether to preserve them all. It also raises the question of whether and how to preserve the links to these different versions: Should we make sure that the links on other web pages that we preserve that link to the different versions are able to do so successfully in a preservation environment?
These kinds of questions also raise other questions. Should we have some sort of standards so that we can know how to resolve these questions? Should all preservation projects make the same decisions, or would it be better to know that different projects will preserve differently because of their own missions and scopes? How can we compare what we preserve with the originals and with other preservation projects to avoid unnecessary duplication and to encourage comprehensiveness and enable accurate discovery by users of our preservation repositories?