Home » Articles posted by James A Jacobs
Author Archives: James A Jacobs
JCP and JCL house members
House Resolution 190, introduced on March 5, 2025, names Mike Carey (R, OH:15), Joseph Morelle (D, NY:25), and Julie Johnson (D, TX:32) to the the Joint Committee of Congress on the Library and Joseph Morelle (D, NY:25), Gregory Murphy (R, NC:3), Terri Sewell (D, AL:7), and Mary Miller (R, IL:15) to the Joint Committee on Printing.
The government information crisis is bigger than you think it is
[This post is adapted from our forthcoming book, Preserving Government Information: Past, Present, and Future.]
Today we want to clarify something important about preserving government information. There is a difference between the government changing a policy and the government erasing information, but the line between those two has blurred in the digital age.
When a new president is inaugurated, one expects new policies. The number of changes and the speed of change may vary for different administrations, but we expect that every administration will be different in some ways from its predecessor. After all, that is part of the reason we have elections. Also, information that the government publishes is updated all the time, not just when administrations change. Laws and regulations are added and amended and rescinded, new economic and environmental and census data are collected and published, government recommendations to the public (like the Department of Agriculture’s “food pyramid” guidance) are revised.
Changes in government information are normal in a democracy.
Because change is normal, it is essential to preserve government information – even “non-current” and “out of date” information – in order to document those changes. This is not a new idea, but a long-accepted principle of democracy. Citizens need a record of what a government’s stated values were and when they changed, what actions it took and when it took them, what data it collected and generated at specific points in time, and so forth. It is important to preserve even information that later proves to be inaccurate in order to document what the government knew and when it knew it.
Because published government information is the evidence for a democracy, its preservation is essential.
In the era in which government information was published in paper formats, preservation of that information relied on libraries. The information was distributed to FDLP libraries based on the needs of the communities that those libraries served. Beginning in 1962, Regional FDLs received and retained all the paper publications in the FDLP system. When new information superseded or replaced old information, the old information was not erased or discarded; it was preserved in Regional FDLs and in every FDL whose community valued that older information. In the print era, it was taken for granted that, once government information was released to the public, it would not be withdrawn or altered or lost.1
In the digital age, government publishing has shifted from the distribution of unalterable printed books to digital posts on government websites. Such digital publications can be moved, altered, and withdrawn at the flick of a switch. Publishing agencies are not required to preserve their own information, nor to provide free access to it.
Some digital government information is actively preserved by GPO, NARA, and the Library of Congress. Some government-collected data are preserved by law or by tradition. But the laws that allow this are weak and government preservation of government information suffers from large gaps. Non-government projects (notably the Internet Archive and the End-of-Term Archive) use web harvesting to attempt to acquire and store government information, but these projects are, by their nature, incomplete and their long-term guarantees of access are fragile. As a result of all this, the public can no longer assume that any given piece of government information will not be withdrawn or altered or lost.
The early actions of the incoming Trump administration (as well as the actions of the first Trump administration) have brought the vulnerability of digital information to the public’s attention (see our previous post “Federal information scrubbing has begun”) and the public is rightfully worried. That vulnerability is, however, not limited to this administration. Digital government information was being lost before President Trump.
The current crisis of imminent loss of information exists not only because government information is being changed, but because it is being erased. The erasure is possible because of the gaps in the current preservation infrastructure.
The scale of loss and alteration of information under Trump may prove to be unprecedented and certainly requires immediate short-term action. But librarians and archivists and citizens should use this current crisis to demand more than short-term solutions. A new distributed digital preservation infrastructure is needed for digital government information.
James A. Jacobs
James R. Jacobs
- Even when information was withdrawn for some reason, there was a record of the withdrawals. (See this spreadsheet listing withdrawn documents 1981 – 2018, collated from GPO’s no-longer published “Administrative Notes” newsletter.) ↵
Maryellen Trautman
The great government documents librarian Maryellen Trautman died on November 17, 2024. She was a pioneer and a protector of government documents as a regional depository librarian in Oklahoma and as a government documents librarian at the National Archives and Records Administration. She was one of the founders of the American Library Association Government Documents Round Table (GODORT) in 1972. Among her many other accomplishments, she served on the Congressional Joint Committee on Printing’s Serial Set Committee where she supported the preservation of the Serial Set.
Government recommendations to preserve government information not preserved by government
James and I are writing a book on preserving government information. In the course of researching the book, we find ourselves hunting down government publications that we need but that are not available from the government or from any FDLP library. Each of these documents has its own explanation for why it is missing and each explanation tells a story about the gaps in preservation of government information.
This is one of those stories. Think of this as a long footnote to a future book.
In 2002, Congress established the Interagency Committee on Government Information (ICGI). One of its charges (more…)
Some facts about the born-digital “National Collection”
April 9, 2022 / Leave a comment
We want to contribute a couple of facts and context about the born-digital “National Collection” to help inform the discussions on the priorities of GPO and FDLP libraries at the upcoming spring 2022 Depository Library Conference as well as discussions surrounding the work of the all-digital FDLP task force.
We believe these facts lead to an unavoidable conclusion: GPO and FDLP need to explicitly state a strong priority of how to deal with unpreserved born-digital government information.
Here are the facts.
Who produces born-digital government information?
We have been examining data from the 2020 End-of-Term crawl. We found (not surprisingly) that, by far, the most prominent types of born-digital content on the web are web pages (HTML files) and PDF files. We counted just unique web pages and PDF files from the government web in EOT20 and found more than 126 million web pages and more than 2.8 million PDF files for a total of more than 129 million born-digital items. More than 80% of that content is from the executive branch.
What is GPO preserving?
GOVINFO: There are roughly 2 million PDFs in Govinfo. These items are secure and preserved in GPO’s certified trusted digital repository. By our count, 74% of the born-digital PDF content in Govinfo is from the judicial branch, 24% from the legislative branch, and only 2% from the executive branch. In other words, GPO devotes almost 3/4 of its born-digital preservation space to the judiciary, which produces only about 2% of all born-digital government information. Conversely, GPO devotes only 2% of its born-digital preservation space to the executive branch, which produces more than 80% of born-digital government information.
FDLP-WA. The FDLP Web Archive on the Internet Archive’s Archive-It servers had 211 “collections” or “websites” when we counted earlier this year. Most of the content of the FDLP-WA is from the executive branch (by our count, it only includes 3 congressional agencies and one judicial agency). GPO describes its web harvesting as targeted at small websites. By our count, using the EOT20 data, there are 23,666 “small” government websites and altogether they contain only .06% of the public information posted on the government web. By contrast 99% of Public Information on the government web is hosted by 1,882 “large” websites, none of which GPO is targeting.
GPO also stores some copies of some cataloged web-based content on its permanent.fdlp.gov server. We do not have exact figures on the quantity of content stored, but we do know that, on average, GPO catalogs just over 19,000 titles a year. As a percentage of just the PDFs on the government web in 2020, that is less than 1% per year.
GPO has a few “digital access” partnerships (NASA, NLM, GAO and a couple of others), but there’s only 1 digital preservation stewardship agreement: with University of North Texas (UNT) libraries (check out the difference between a “digital access partner” and a “digital preservation steward” here).
Although we do not have data on how quickly content on the web is altered or removed, one study determined that 83% of the PDF files present in the 2008 EOT crawl were missing in the 2012 EOT crawl.
Conclusions
GPO is doing a good (though not comprehensive) job of preserving born-digital content from the judicial and legislative branches but, by our rough estimate, this accounts for only about 15% of born-digital government information.
GPO is preserving very, very little of the born-digital content of the executive branch, which is where about 80% of born-digital publishing is being done.
To ensure the preservation of this executive branch born-digital government information, GPO needs an active program to acquire and preserve it. Depository Library Council (DLC) should create a strong statement recognizing this huge gap in digital preservation and recommending that GPO prioritize developing plans for addressing it.
Authors
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
Share this: