Home » Commentary » Some facts about the born-digital “National Collection”

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Some facts about the born-digital “National Collection”

We want to contribute a couple of facts and context about the born-digital “National Collection” to help inform the discussions on the priorities of GPO and FDLP libraries at the upcoming spring 2022 Depository Library Conference as well as discussions surrounding the work of the all-digital FDLP task force.

We believe these facts lead to an unavoidable conclusion: GPO and FDLP need to explicitly state a strong priority of how to deal with unpreserved born-digital government information.

Here are the facts.

Who produces born-digital government information?

We have been examining data from the 2020 End-of-Term crawl. We found (not surprisingly) that, by far, the most prominent types of born-digital content on the web are web pages (HTML files) and PDF files. We counted just unique web pages and PDF files from the government web in EOT20 and found more than 126 million web pages and more than 2.8 million PDF files for a total of more than 129 million born-digital items. More than 80% of that content is from the executive branch.

What is GPO preserving?

GOVINFO: There are roughly 2 million PDFs in Govinfo. These items are secure and preserved in GPO’s certified trusted digital repository. By our count, 74% of the born-digital PDF content in Govinfo is from the judicial branch, 24% from the legislative branch, and only 2% from the executive branch. In other words, GPO devotes almost 3/4 of its born-digital preservation space to the judiciary, which produces only about 2% of all born-digital government information. Conversely, GPO devotes only 2% of its born-digital preservation space to the executive branch, which produces more than 80% of born-digital government information.

FDLP-WA. The FDLP Web Archive on the Internet Archive’s Archive-It servers had 211 “collections” or “websites” when we counted earlier this year. Most of the content of the FDLP-WA is from the executive branch (by our count, it only includes 3 congressional agencies and one judicial agency). GPO describes its web harvesting as targeted at small websites. By our count, using the EOT20 data, there are 23,666 “small” government websites and altogether they contain only .06% of the public information posted on the government web. By contrast 99% of Public Information on the government web is hosted by 1,882 “large” websites, none of which GPO is targeting.

GPO also stores some copies of some cataloged web-based content on its permanent.fdlp.gov server. We do not have exact figures on the quantity of content stored, but we do know that, on average, GPO catalogs just over 19,000 titles a year. As a percentage of just the PDFs on the government web in 2020, that is less than 1% per year.

GPO has a few “digital access” partnerships (NASA, NLM, GAO and a couple of others), but there’s only 1 digital preservation stewardship agreement: with University of North Texas (UNT) libraries (check out the difference between a “digital access partner” and a “digital preservation steward” here).

Although we do not have data on how quickly content on the web is altered or removed, one study determined that 83% of the PDF files present in the 2008 EOT crawl were missing in the 2012 EOT crawl.

Conclusions

  • GPO is doing a good (though not comprehensive) job of preserving born-digital content from the judicial and legislative branches but, by our rough estimate, this accounts for only about 15% of born-digital government information.

  • GPO is preserving very, very little of the born-digital content of the executive branch, which is where about 80% of born-digital publishing is being done.

  • To ensure the preservation of this executive branch born-digital government information, GPO needs an active program to acquire and preserve it. Depository Library Council (DLC) should create a strong statement recognizing this huge gap in digital preservation and recommending that GPO prioritize developing plans for addressing it.

Authors

James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Archives