Home » post » Examples of challenges of web-harvesting for digital preservation

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Examples of challenges of web-harvesting for digital preservation

Have you ever wondered why preserving web-published information is a complex task? Here are three examples of what makes “web harvesting” a difficult and inexact method of digital preservation.

  • Keystone XL Pipeline: Final Supplemental Environmental Impact Statement (SEIS)
    To preserve this page intact, one would have to collect a total of 13 urls: 7 images, 3 javascript files, 2 css style sheets, and the HTML page itself. The page itself is, however, only a list of links to the 94 files that comprise a single title, the Keystone pipeline EIS.

    This prompts many questions. Do any of the PDF files themselves contain a list of all the PDF files? To preserve this title, would one have to preserve this page and all 94 PDFs? What should be preserved: this page plus the PDFs, or just the PDFs? Should the css, js. and image files be preserved? How should the files that we choose to preserve be linked and described? How should we count what we have preserved: as one location, or a count of all the files preserved, or a count of the WARC files we create, or as one single title?

  • The White House current third party (social media) pages / accounts
    This page lists a total of 92 accounts that the White House uses on 26 non-.gov, social media websites.

    This raises many questions: How much information is duplicated? How much is unique? Do these sites simply point back to .gov websites? Is some of the information (such as videos) only stored on these “third party” sites? Do the terms-of-use of the third-party sites impose restrictions, copyright, or other limitations that either technically or legally limit what we can preserved? Even in a case where a site points at the more substantive information on a .gov website, would it be of value to preserve how the White House presents itself differently to different communities?

  • Executive Order 13662
    The above link is to an official, authenticated version of an Executive Order. This same executive order also appears in other formats on other websites. A total of ten different URLs have (apparently) the same information (and metadata about the content).

    This raises the question of whether all these really are the same and, if so, how to know that and whether to preserve them all. It also raises the question of whether and how to preserve the links to these different versions: Should we make sure that the links on other web pages that we preserve that link to the different versions are able to do so successfully in a preservation environment?

    Other copies: White House, Federal Register, Federal Register printer-friendly, GPO Federal Regsiter PDF, GPO Federal Register html, GPO html, GPO mods, GPO Premis, GPO zip

These kinds of questions also raise other questions. Should we have some sort of standards so that we can know how to resolve these questions? Should all preservation projects make the same decisions, or would it be better to know that different projects will preserve differently because of their own missions and scopes? How can we compare what we preserve with the originals and with other preservation projects to avoid unnecessary duplication and to encourage comprehensiveness and enable accurate discovery by users of our preservation repositories?

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.