Home » Posts tagged 'future of government information in libraries'
Tag Archives: future of government information in libraries
As a follow-up to our recent post, “Some facts about the born-digital “National Collection,” we want to suggest some specific actions that GPO and FDLP libraries can take to do a better job of collecting and preserving born-digital content for the “National Collection”.
For context, our starting assumption is that GPO and FDLP have two connected priorities: preservation and user services. The two go hand-in-hand. To be “preserved,” content must be discoverable, deliverable, readable, understandable, and usable by people. Broadly speaking, this can be understood as “user services.” Addressing these priorities at scale will require innovative, collaborative approaches. Old solutions that do not scale will not work.
With regard to preservation, digital objects have to be under sufficient control of the preservationist to be preservable. As we pointed out in our previous post, the vast bulk of born-digital government Public Information is not being preserved by GPO or FDLP libraries. But, worse than this, GPO and FDLP have no active plan to address that gap in preservation. While there are lots of projects to digitize historic paper documents in FDLs, there is no active project to acquire, describe, store, manage, and preserve — ie., curate! — the bulk of born-digital content (the End of Term crawl notwithstanding). Regardless of what minor steps GPO is taking, the results are, at best, insignificant when compared to the scale of the problem. What is needed is a recognition of the problem of the huge gap in digital preservation and a specific plan for developing active strategies to address the problem. Waiting for agencies to deposit with GPO doesn’t work. Simply advertising GPO’s publishing services is not enough. GPO needs new strategies.
The two most important aspects of user services are “discovery” (providing tools that enable users to find the information they need) and “usability” (providing tools that enable users to use the content they discover). The two approaches GPO uses for discoverability (catalog records in the Catalog of Government Publications and a hierarchical presentation of agencies and publication types and dates in govinfo.gov) are woefully incomplete in the 21st century. One resembles a legacy card catalog and the other resembles a 1990s Yahoo!-like directory interface. Each has some utility, but they are not sufficient. GPO needs to work with FDLP libraries to develop new user-centric tools for discovery.
As for usability, GPO’s approach is still very document-centric, being designed to deliver one document at a time for reading. It should be evident to all that there are many more potential uses of government information than simply retrieving one document at a time. 21st century users are more sophisticated and have more use-case needs than that. We believe that GPO should continue to provide the services it does through Govinfo, but it should supplement that work by developing programs, tools, and support for FDLs to develop new uses built on the specific use-case needs of Designated Communities of users — and potential users. Doing that will have the additional benefit of helping drive collection development — and preservation.
GPO already has policies in place that can be read to include the broader vision we offer here. For example, GPO’s Draft Strategic Plan Fiscal Years 2023 Through 2027, while explicitly mentioning digitizing paper collections also includes the vague phrase “focus on adding new collections and filling the gaps in existing collections.” Although, in the context, it seems to imply filing in gaps of paper/digitized collections, it could be taken as a broader mission to address the real preservation gap of new, born-digital content. Nevertheless, vague phrases, are not enough. Policies and projects need to specifically address the massive and growing born-digital preservation gap with action plans.
Given our assumptions and priorities, here are some suggestions for steps GPO can take now.
- Publicly and explicitly, acknowledge and publicize the born-digital preservation gap.
- Develop an aggressive, active strategy for gaining agreements with executive branch agencies to deposit their born digital content with GPO. Work with Congress to provide funding to agencies for providing those deposits and to GPO for receiving and processing them;
- Develop an aggressive, active strategy to promote and enforce existing OMB A-130 policy (“making Government publications available to depository libraries through the Government Publishing Office regardless of format”) for depositing executive branch content with GPO. The policy exists, but OMB does nothing to enforce it. The strategy could include working with NARA, the Federal CIO Council, the Federal Web Archiving Group (consisting of GPO, NARA, Library of Congress, the National Library of Medicine, the Smithsonian Institution, Department of Education, and Department of Heath and Human Services) to support OMB enforcement of that policy and set new policies and regulations for preserving federal agency publications and data;
- Develop an aggressive, active strategy for the development of new tools for harvesting and processing Public Information and metadata, and for the processing of that harvested data for the automated generation of rich metadata for the description, management, preservation, discovery, delivery, and use of harvested data and metadata. Develop tools, workflows, and policies to help FDLs preserve born-digital government information. This can include identifying and acquiring unreported documents, new methods of selection to build digital collections, metadata creation, and the development of digital repositories connected by APIs and a robust system of stable Permanent Identifiers;
- Develop a plan for active, continual harvesting of born-digital content that remains undeposited by agencies with GPO. Develop new strategies for targeting content by document and file-type, use-case, and source. Develop workflows to allow FDLs and other libraries and harvesters to feed their web archiving activities into the National Collection through ingest or cooperative metadata creation, or both;
- Develop next-generation tools and methods for extracting digital objects and metadata from existing Web archives for inclusion in the National Bibliography;
- Develop an active plan for obtaining federal funding to fund libraries, agencies, and GPO to do this ongoing and critical work.
Now THAT’s an “all-digital FDLP”!
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
We want to contribute a couple of facts and context about the born-digital “National Collection” to help inform the discussions on the priorities of GPO and FDLP libraries at the upcoming spring 2022 Depository Library Conference as well as discussions surrounding the work of the all-digital FDLP task force.
We believe these facts lead to an unavoidable conclusion: GPO and FDLP need to explicitly state a strong priority of how to deal with unpreserved born-digital government information.
Here are the facts.
Who produces born-digital government information?
We have been examining data from the 2020 End-of-Term crawl. We found (not surprisingly) that, by far, the most prominent types of born-digital content on the web are web pages (HTML files) and PDF files. We counted just unique web pages and PDF files from the government web in EOT20 and found more than 126 million web pages and more than 2.8 million PDF files for a total of more than 129 million born-digital items. More than 80% of that content is from the executive branch.
What is GPO preserving?
GOVINFO: There are roughly 2 million PDFs in Govinfo. These items are secure and preserved in GPO’s certified trusted digital repository. By our count, 74% of the born-digital PDF content in Govinfo is from the judicial branch, 24% from the legislative branch, and only 2% from the executive branch. In other words, GPO devotes almost 3/4 of its born-digital preservation space to the judiciary, which produces only about 2% of all born-digital government information. Conversely, GPO devotes only 2% of its born-digital preservation space to the executive branch, which produces more than 80% of born-digital government information.
FDLP-WA. The FDLP Web Archive on the Internet Archive’s Archive-It servers had 211 “collections” or “websites” when we counted earlier this year. Most of the content of the FDLP-WA is from the executive branch (by our count, it only includes 3 congressional agencies and one judicial agency). GPO describes its web harvesting as targeted at small websites. By our count, using the EOT20 data, there are 23,666 “small” government websites and altogether they contain only .06% of the public information posted on the government web. By contrast 99% of Public Information on the government web is hosted by 1,882 “large” websites, none of which GPO is targeting.
GPO also stores some copies of some cataloged web-based content on its permanent.fdlp.gov server. We do not have exact figures on the quantity of content stored, but we do know that, on average, GPO catalogs just over 19,000 titles a year. As a percentage of just the PDFs on the government web in 2020, that is less than 1% per year.
GPO has a few “digital access” partnerships (NASA, NLM, GAO and a couple of others), but there’s only 1 digital preservation stewardship agreement: with University of North Texas (UNT) libraries (check out the difference between a “digital access partner” and a “digital preservation steward” here).
Although we do not have data on how quickly content on the web is altered or removed, one study determined that 83% of the PDF files present in the 2008 EOT crawl were missing in the 2012 EOT crawl.
GPO is doing a good (though not comprehensive) job of preserving born-digital content from the judicial and legislative branches but, by our rough estimate, this accounts for only about 15% of born-digital government information.
GPO is preserving very, very little of the born-digital content of the executive branch, which is where about 80% of born-digital publishing is being done.
To ensure the preservation of this executive branch born-digital government information, GPO needs an active program to acquire and preserve it. Depository Library Council (DLC) should create a strong statement recognizing this huge gap in digital preservation and recommending that GPO prioritize developing plans for addressing it.
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
The Government Publishing Office (GPO) recently released its updated document entitled GPO’s System of Online Access: Collection Development Plan (here are the 2016 and 2018 Plans for comparison) which is “revised annually to reflect content added to govinfo in the preceding fiscal year, in-process titles, and current priorities.” The Plan explains GPO’s designated communities for govinfo, the broad content areas that fall within scope of govinfo, and the various codes — basically Title 44 of the US Code and Superintendent of Documents policies (SODs) — which undergird GPO’s collection development activities. While there is no mention in this document of the “National Collection”, it describes the three major pillars of GPO’s permanent public access efforts as govinfo, the FDLP, and the Cataloging & Indexing program (which produces the bibliographic records for the Catalog of Government Publications (CGP)).
The central part of the Plan is where GPO defines the govinfo collection depth level — defined in Appendix A of the Plan as collection levels modified from the Research Libraries Group (RLG) Conspectus collection depth levels and going from Comprehensive, Research, Study or Instructional Support, Basic, Minimal, to Out of Scope — of the various public information products of the legislative, executive, and judicial branches of the US government.
Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall 😐 ), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).
Maybe when/if the government opens again, we should scrape the NIST and CSRC websites, put all those publications somewhere public. It’s worrying that *every single US cryptography standard* is now unavailable to practitioners.
— Matthew Green (@matthew_d_green) January 12, 2019
Matthew Green, a professor at Johns Hopkins, said “It’s worrying that every single US cryptography standard is now unavailable to practitioners.” He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.
Others who noticed the same problem started chiming in to the discussion Green started, noting that they couldn’t find the standards they needed in Google’s cache or the Wayback machine, either. Someone else suggested that “Such documents should be distributed to multiple free and public repositories” and said that “These documents are “Too important to have subject to a single point of failure.” Someone else said that he downloads personal copies of the documents he needs every month, but had missed one that he uses “somewhat often.” One lone voice wondered about “Federal Depository Libraries, of which I believe there is at least one in every state.” (James responded to that one, letting people know about the FDLP and End of Term crawl!)
There are at least two reasons why users cannot get the documents they need from government servers during the shutdown. In some cases, agencies have apparently shut off access to their documents. (This is the case for both NIST and CSRC.) In other cases, the security certificates of websites have expired — with no agency employees to renew them! — leaving whole websites either insecure or unavailable or both.
Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.
Some libraries probably do have some of these documents. But too many libraries have chosen to adopt a new model of “services without collections.” GPO proudly promotes this model as “All or Mostly Online Federal Depository Libraries.” GPO itself is affected by this model. Almost 20% of the PURLs in CGP point to content on non-GPO government servers. So, even though GPO’s govinfo database and catalog of government publications (CGP) may still be up and running, during the shut-down GPO cannot ensure that all its “Permanent URLs” (PURLs) will work.
This no-collections-model means that libraries are too often choosing simply to point to collections over which they have no control — and we’ve known what happens “When we depend on pointing instead of collecting” for quite some time. When those collections go offline and users lose access, users begin to wonder why someone hasn’t foreseen this problem and put “all those publications somewhere public.”
The gap between what libraries could do to prevent the kind of loss of access the shutdown is causing and what they are doing is particularly notorious in the area of government information. Most federal government information is in the public domain and is available without technical or copyright restrictions or fees. There is nothing preventing libraries from building collections to support users except the will to do so.
Many library administrators are eager to proclaim that pointing to collections they do not control is the new role of libraries in the digital age. Those who promote this new model of services without collections then struggle to demonstrate the value of libraries to their user communities. This is difficult when those communities go directly to collections of information, bypassing libraries and, perhaps, wondering why libraries still exist at all.
This represents a failure by libraries to fulfill their role in society and in the digital information ecosystem.
When the shutdown ends, access will, presumably, be restored. In the wake of the many other problems caused by the shutdown (many of them immediate and even dangerous), this temporary loss of access to some government information may not seem pressing. But librarians should see this as another wake-up call. Hopefully, Depository Library Council’s recent recommendation regarding digital deposit will answer that call. Libraries should not focus on bemoaning the short-term problem. We should, instead, focus on making the next crisis impossible. We can do this by focusing on the long-term problems of digital collection development, preservation and access. The current crisis may be temporary, but when we rely only on the government to provide access to these important resources, access will remain vulnerable to the next crisis or misstep or conscious decision to cut off access. We need to recognize that government agencies do not always have the same priorities as our users.
Today, libraries cannot ensure long-term access to government information because they do not control it. But, if libraries select, acquire, organize, and preserve the government information that is vital to their user communities, then they can ensure long-term access to it. You will not have to persuade your users of the value of your library when you do what they value.
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
[UPDATE 1:30pm 09122018: The bill going forward in the Senate is S. 2944, NOT 2673. And S.2944 includes reference to the depository library program! I’ve updated the link below to the correct Senate bill. JRJ]
Heads up! There’s a bill at the beginning of the legislative process called “Preventing Additional Printing of Electronic Records Act of 2018″ or the PAPER Act of 2018. Don’t you just love how Congress has to acronymize their bill titles?! This bill seeks to limit the printing of the Congressional Record, one of our most important Congressional publications, the official record of the proceedings and debates of the US Congress. It’s important to the Federal Depository Library Program to keep publishing the CR in paper for research utility and preservation purposes.
The House version mentions the FDLP, but the Senate version does not:
(d) Depository libraries
The Director of the Government Publishing Office shall furnish to the Superintendent of Documents as many daily and bound copies of the Congressional Record as may be required for distribution to depository libraries.
This bill is at the very beginning of the process, so it’s not time to get nervous. But the depository community ought to keep an eye on this bill in case it gathers momentum in the House and/or Senate.