Home » Commentary
Category Archives: Commentary
FGI’s recommendations for creating the “all-digital FDLP”
As a follow-up to our recent post, “Some facts about the born-digital “National Collection,” we want to suggest some specific actions that GPO and FDLP libraries can take to do a better job of collecting and preserving born-digital content for the “National Collection”.
For context, our starting assumption is that GPO and FDLP have two connected priorities: preservation and user services. The two go hand-in-hand. To be “preserved,” content must be discoverable, deliverable, readable, understandable, and usable by people. Broadly speaking, this can be understood as “user services.” Addressing these priorities at scale will require innovative, collaborative approaches. Old solutions that do not scale will not work.
With regard to preservation, digital objects have to be under sufficient control of the preservationist to be preservable. As we pointed out in our previous post, the vast bulk of born-digital government Public Information is not being preserved by GPO or FDLP libraries. But, worse than this, GPO and FDLP have no active plan to address that gap in preservation. While there are lots of projects to digitize historic paper documents in FDLs, there is no active project to acquire, describe, store, manage, and preserve — ie., curate! — the bulk of born-digital content (the End of Term crawl notwithstanding). Regardless of what minor steps GPO is taking, the results are, at best, insignificant when compared to the scale of the problem. What is needed is a recognition of the problem of the huge gap in digital preservation and a specific plan for developing active strategies to address the problem. Waiting for agencies to deposit with GPO doesn’t work. Simply advertising GPO’s publishing services is not enough. GPO needs new strategies.
The two most important aspects of user services are “discovery” (providing tools that enable users to find the information they need) and “usability” (providing tools that enable users to use the content they discover). The two approaches GPO uses for discoverability (catalog records in the Catalog of Government Publications and a hierarchical presentation of agencies and publication types and dates in govinfo.gov) are woefully incomplete in the 21st century. One resembles a legacy card catalog and the other resembles a 1990s Yahoo!-like directory interface. Each has some utility, but they are not sufficient. GPO needs to work with FDLP libraries to develop new user-centric tools for discovery.
As for usability, GPO’s approach is still very document-centric, being designed to deliver one document at a time for reading. It should be evident to all that there are many more potential uses of government information than simply retrieving one document at a time. 21st century users are more sophisticated and have more use-case needs than that. We believe that GPO should continue to provide the services it does through Govinfo, but it should supplement that work by developing programs, tools, and support for FDLs to develop new uses built on the specific use-case needs of Designated Communities of users — and potential users. Doing that will have the additional benefit of helping drive collection development — and preservation.
GPO already has policies in place that can be read to include the broader vision we offer here. For example, GPO’s Draft Strategic Plan Fiscal Years 2023 Through 2027, while explicitly mentioning digitizing paper collections also includes the vague phrase “focus on adding new collections and filling the gaps in existing collections.” Although, in the context, it seems to imply filing in gaps of paper/digitized collections, it could be taken as a broader mission to address the real preservation gap of new, born-digital content. Nevertheless, vague phrases, are not enough. Policies and projects need to specifically address the massive and growing born-digital preservation gap with action plans.
Given our assumptions and priorities, here are some suggestions for steps GPO can take now.
- Publicly and explicitly, acknowledge and publicize the born-digital preservation gap.
- Develop an aggressive, active strategy for gaining agreements with executive branch agencies to deposit their born digital content with GPO. Work with Congress to provide funding to agencies for providing those deposits and to GPO for receiving and processing them;
- Develop an aggressive, active strategy to promote and enforce existing OMB A-130 policy (“making Government publications available to depository libraries through the Government Publishing Office regardless of format”) for depositing executive branch content with GPO. The policy exists, but OMB does nothing to enforce it. The strategy could include working with NARA, the Federal CIO Council, the Federal Web Archiving Group (consisting of GPO, NARA, Library of Congress, the National Library of Medicine, the Smithsonian Institution, Department of Education, and Department of Heath and Human Services) to support OMB enforcement of that policy and set new policies and regulations for preserving federal agency publications and data;
- Develop an aggressive, active strategy for the development of new tools for harvesting and processing Public Information and metadata, and for the processing of that harvested data for the automated generation of rich metadata for the description, management, preservation, discovery, delivery, and use of harvested data and metadata. Develop tools, workflows, and policies to help FDLs preserve born-digital government information. This can include identifying and acquiring unreported documents, new methods of selection to build digital collections, metadata creation, and the development of digital repositories connected by APIs and a robust system of stable Permanent Identifiers;
- Develop a plan for active, continual harvesting of born-digital content that remains undeposited by agencies with GPO. Develop new strategies for targeting content by document and file-type, use-case, and source. Develop workflows to allow FDLs and other libraries and harvesters to feed their web archiving activities into the National Collection through ingest or cooperative metadata creation, or both;
- Develop next-generation tools and methods for extracting digital objects and metadata from existing Web archives for inclusion in the National Bibliography;
- Develop an active plan for obtaining federal funding to fund libraries, agencies, and GPO to do this ongoing and critical work.
Now THAT’s an “all-digital FDLP”!
Authors
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
Some facts about the born-digital “National Collection”
We want to contribute a couple of facts and context about the born-digital “National Collection” to help inform the discussions on the priorities of GPO and FDLP libraries at the upcoming spring 2022 Depository Library Conference as well as discussions surrounding the work of the all-digital FDLP task force.
We believe these facts lead to an unavoidable conclusion: GPO and FDLP need to explicitly state a strong priority of how to deal with unpreserved born-digital government information.
Here are the facts.
Who produces born-digital government information?
We have been examining data from the 2020 End-of-Term crawl. We found (not surprisingly) that, by far, the most prominent types of born-digital content on the web are web pages (HTML files) and PDF files. We counted just unique web pages and PDF files from the government web in EOT20 and found more than 126 million web pages and more than 2.8 million PDF files for a total of more than 129 million born-digital items. More than 80% of that content is from the executive branch.
What is GPO preserving?
GOVINFO: There are roughly 2 million PDFs in Govinfo. These items are secure and preserved in GPO’s certified trusted digital repository. By our count, 74% of the born-digital PDF content in Govinfo is from the judicial branch, 24% from the legislative branch, and only 2% from the executive branch. In other words, GPO devotes almost 3/4 of its born-digital preservation space to the judiciary, which produces only about 2% of all born-digital government information. Conversely, GPO devotes only 2% of its born-digital preservation space to the executive branch, which produces more than 80% of born-digital government information.
FDLP-WA. The FDLP Web Archive on the Internet Archive’s Archive-It servers had 211 “collections” or “websites” when we counted earlier this year. Most of the content of the FDLP-WA is from the executive branch (by our count, it only includes 3 congressional agencies and one judicial agency). GPO describes its web harvesting as targeted at small websites. By our count, using the EOT20 data, there are 23,666 “small” government websites and altogether they contain only .06% of the public information posted on the government web. By contrast 99% of Public Information on the government web is hosted by 1,882 “large” websites, none of which GPO is targeting.
GPO also stores some copies of some cataloged web-based content on its permanent.fdlp.gov server. We do not have exact figures on the quantity of content stored, but we do know that, on average, GPO catalogs just over 19,000 titles a year. As a percentage of just the PDFs on the government web in 2020, that is less than 1% per year.
GPO has a few “digital access” partnerships (NASA, NLM, GAO and a couple of others), but there’s only 1 digital preservation stewardship agreement: with University of North Texas (UNT) libraries (check out the difference between a “digital access partner” and a “digital preservation steward” here).
Although we do not have data on how quickly content on the web is altered or removed, one study determined that 83% of the PDF files present in the 2008 EOT crawl were missing in the 2012 EOT crawl.
Conclusions
-
GPO is doing a good (though not comprehensive) job of preserving born-digital content from the judicial and legislative branches but, by our rough estimate, this accounts for only about 15% of born-digital government information.
-
GPO is preserving very, very little of the born-digital content of the executive branch, which is where about 80% of born-digital publishing is being done.
-
To ensure the preservation of this executive branch born-digital government information, GPO needs an active program to acquire and preserve it. Depository Library Council (DLC) should create a strong statement recognizing this huge gap in digital preservation and recommending that GPO prioritize developing plans for addressing it.
Authors
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
FGI comment on GPO RFC re Regional Online Selections Draft Policy
Last fall, GPO announced a new Superintendent of Documents (SOD) draft policy statement “Regional Depository Libraries Online Selections.” GPO surveyed regional depository libraries and released the results of that survey in February, 2021. They’re also asking the wider library community and interested parties for comment DUE MAY 16, 2021.
FGI has submitted a comment regarding this proposed policy change. Below is the text of our comment. In short, this policy change could negatively impact the preservation of and long-term access to the National Collection. Our suggestion was to change the policy and add a “digital deposit” requirement:
“Regional depository libraries may select “online” as a format IF AND ONLY IF regionals participate in a “digital deposit” program and agree to receive, host, and provide access to digital FDLP publications.”
We hope others will submit comments BY MAY 16, 2021!
Thank you for requesting comments from the Federal Depository Library community for this proposed major policy change for regional library collection management.
Suggested edit of draft policy:
“Regional depository libraries may select “online” as a format IF AND ONLY IF regionals participate in a “digital deposit” program and agree to receive, host, and provide access to digital FDLP publications.”
We at FGI have 2 concerns regarding this proposed policy change.
The first concern has to do with the current practice described in the background section of the proposed SOD:
“…they [regionals] no longer are receiving all new and revised tangible versions for all titles through the FDLP. Nor are regional depository libraries necessarily retaining a printed or microfacsimile version of what they receive.”
According to 44 U.S. Code § 1912, Regional libraries are required to receive and “retain at least one copy of all Government publications either in printed or microfacsimile form.” How many regional libraries are no longer following the requirements of the statute? What is GPO doing to assure that the letter and spirit of Title 44 are being followed by regional libraries? Rather than codifying this bad behavior, GPO should be doing more to help regionals fulfill the requirements of the statute and assure the long-term viability of the FDLP for all of the libraries and the wider public that rely on regionals. Any proposed SOD should seek to correct this unfortunate situation.
Our second concern has to do with the proposed policy change itself.
“Regional depository libraries may select “online” as a format, without having to make a corresponding tangible selection, for titles or series accessible through GPO’s system of online access, a trusted digital repository, or from official digital preservation steward partners.”
One of the primary functions of regional libraries is to participate in the long-term preservation of US government publications. Indeed, retention (ie., preservation) is written into 44 U.S. Code § 1912 itself. Selective libraries across the country rely heavily on this regional requirement to manage their FDLP collections.
The existing law is clear: “In addition to fulfilling the requirements for depository libraries” regional depositories must “retain at least one copy of all Government publications either in printed or microfacsimile form (except those authorized to be discarded by the Superintendent of Documents).” The only other mention in the law of the Superintendent being able to authorize discarding is for “superseded publications or those issued later in bound form which may be discarded as authorized by the Superintendent of Documents” (§ 1911).
As the Senate Report on the bill stated, “Complete document collections would thus be accessible to all the regular depositories within the State, enabling them to be more selective in the items they would request” (S. Rep. 1587, 87th Cong., 2d Sess. 1962). The legislative history is clear that the establishment of Regional Depositories was designed both to allow selectives to discard publications after five years and to ensure that all publications would be available from a Regional.
The law has not changed and this policy would contradict both the letter and intent of the law.
Although GPO continues to promulgate policies that wrongly equate “online access” with “deposit,” no change in the law allows this. We welcome online access and the efforts GPO is making to ensure preservation of digital government information, but, as GPO’s draft policy says, the policy is rooted in the past, in choices made twenty-five years ago. It would be wiser and more sustainable to base new decisions in the current and developing capabilities of FDLP libraries rather than on the past. We suggest that there is a better path that conforms to the existing law, enhances preservation, and improves access and use of digital government information. Our suggested edit looks to a future of GPO and FDLP libraries collaborating together to preserve and give access to the National Collection.
We suggest that, until Title 44 is changed, GPO should choose a simple and effective alternative that will accomplish more than GPO’s proposal.
We recommend a policy of allowing a regional depository to choose digital copies of government publications (instead of printed or microfacsimile) IF AND ONLY IF it agrees to actually receive, host, and provide access to those digital files. The SOD could do this by, for example, making regional selection and deposit format-agnostic or adding digital formats to the list of currently anachronistic “tangible” formats.
Our suggestion begins by respecting the existing law, which mandates that multiple copies of government publications be held for both preservation and access by libraries outside the government. For “access” our suggestion will allow libraries to provide digital services for specific designated communities. For preservation, it ensures against intentional or unintentional loss of access, corruption of content, or outright loss of information in the government’s care.
Our suggestion is also compatible with the work of the The Digital Deposit Working Group of the Depository Library Council (on which James is participating), which is currently working on recommendations for digital deposit based on FDLP community feedback which would directly contradict GPO’s proposed regional policy. Our proposal looks to a future of digital deposit. Indeed, ten regional libraries are already receiving and preserving all content published in govinfo.gov through the LOCKSS-USDOCS program. Our proposal provides GPO the opportunity to create a policy that will lay a solid foundation for the digital FDLP, increase participation by FDLP libraries, and enhance services for the National Collection.
It has long been established that the preservation of born-digital government information is a challenging endeavor. It also should be clear that a one-size-fits-all model of “access” without digital services is inadequate in the digital age. GPO cannot and should not go it alone. GPO needs multiple partners to participate in digital preservation and in the provision of digital services.
GPO’s proposed SOD, rather than strengthening the long-term viability of the digital FDLP, erodes its very foundation by literally erasing the critical, legislatively-required job for which regionals were created. Any library or individual can do what the draft SOD suggests (point to govinfo.gov), but FDLP libraries could do so much more. They can complement what GPO does by providing official, legislatively-mandated, redundant preservation, and by providing enhanced digital services targeted to specific OAIS designated communities.
FGI’s guide to “unreported” FDLP publications
- Introduction
- Four easy steps to reporting “unreported” publications
- Strategies for finding “unreported” documents (more tips and tricks!)
- Historically “Unreported” materials of particular interest
- History of the problem
- Appendix: how to fill out the askGPO form
“Unreported” publications (which were, until recently, called “fugitive” publications) are those that are within scope of the Federal Depository Library Program (FDLP) but for various reasons have slipped through the cracks and not been collected and cataloged by the Government Publishing Office (GPO), distributed to FDLP libraries, or included in the “National Collection” (See a partial list of historically “unreported” publications below).
We here at FGI consider “unreported” publications as the paramount problem facing the FDLP today. FDLP librarians, with their critical information skills and expertise about the structure and publishing activities of the federal government, are a vital piece of the solution to this vexing problem. The National Collection is at the core of what FDLP libraries have done for the last 200+ years, so “unreported” publications erode that very foundation. During the spring 2021 virtual Depository Library Conference, I challenged every FDLP librarian to search for, find, and report to GPO five “unreported” documents every month. I’d like to reiterate that challenge here on FGI. If every one of the 1100+ FDLP librarians were to find and report 5 documents each month, through this iterative process we’d soon put a dent in this existential “unreported” documents problem.
Four easy steps to reporting “unreported” publications
To that end, we’d like to share some simple steps for how to find and report “unreported” documents to GPO:
- find an interesting federal document or information product like a report, data set, video, or slide deck (see the “strategies” section below for tips and tricks for finding documents);
- Search the Catalog of Government Publications (CGP) to see if GPO has cataloged it;
- If it’s NOT in the CGP, go to askGPO and fill in the “unreported document” form. See appendix for how to fill out the askGPO form;
- Rinse and repeat!
Strategies for finding “unreported” documents (more tips and tricks!)
- Read the news with an eye toward those news items and sources which cover federal policies; (See for example, https://federalnewsnetwork.com, https://www.govexec.com, https://www.washingtonpost.com, etc.)
- Set up Google search and news alerts for publications from your favorite agency(ies), especially the Inspector Generals’ offices of those agencies (Inspector General reports are an especially critical and long-standing type of “unreported” document! Only a portion are even posted publicly on Oversight.gov);
- Find and report documents you use to answer reference/research consultations;
- Bookmark and visit the publications- and/or press release page of your favorite agency(ies);
- Follow on social media your favorite agency(ies), heads of agencies, your state’s Congressional delegation, known people within the executive branch, and Federal watchdog groups. New publications are often announced on government social media accounts.
Historically “Unreported” materials of particular interest
- Agency Inspector General reports;
- Executive branch agency publications. See the LostDocs project for examples of documents that have been reported to GPO;
- Communication/Letters from members of Congress to executive branch agencies;
- Communication/Letters from federal officials to a Presidential administration;
- Public datasets;
- Congressional Research Service (CRS) reports* (*CRS reports were, until 2018, considered “privileged communication” between Congress and the Library of Congress and were therefore never released via the FDLP. Here’s the back story).
Since 1813 when the FDLP started, there have always been “unreported” documents which slipped through the cracks and were lost to the sands of time (until very recently, these were termed “fugitive” documents) [Footnote 1]. This problem has grown exponentially as executive agencies’ publishing operations have exploded, now that they can easily and freely distribute content online, and very few if any of them follow Title 44 regulations and send their documents to GPO as they are required to by law. Only a minuscule fraction of born-digital executive branch information is cataloged in the Catalog of Government Publications (CGP) or makes it into the “National Collection.” This means that every year, thousands — if not hundreds of thousands! — of Federal documents, datasets, maps, and other born-digital materials [Footnote 2] — are never preserved and are lost to the fog of history as websites are updated and historical content removed [Footnote 3].
Depository librarians reporting found publications are a critical part of a holistic solution to the “unreported” documents problem. By identifying federal information resources that are important to their local constituents, librarians are making sure that these documents will be cataloged, captured, and made accessible to a wider audience. Reporting documents also adds to a National Collection pipeline for long-term access and helps to make sure that what is collected and preserved reflects the needs and interests of the wide-ranging communities and the public which libraries serve.
Many hands make light work. Won’t you join in the effort? Please contact us if you have questions or comments at freegovinfo AT gmail DOT com.
Footnotes
1. See “‘Issued for Gratuitous Distribution:’ The History of Fugitive Documents and the FDLP.” James R. Jacobs. Article in special issue of Against the Grain: “Ensuring Access to Government Information”, 29(6) December 2017/January 2018.
2. My back of the napkin estimate is that well over 1/2 of the “National Collection” is unreported! The executive branch is far and away the largest portion of the National Collection, and is almost completely “unreported.” See slide 5 of my 2018 Canadian Govinfo presentation for some context. Jim Jacobs’ chart cites the 2008 End of Term crawl for context on how many born-digital government publications are on the Web. The 2016 End of Term crawl nearly doubled the 2008 crawl and went from 160 million URLs to 310 million URLs harvested. I expect the 2020 End of Term crawl happening at the time of this post’s publication to far surpass 310 million!
3. FGI has written about “link rot,” “content drift,” and other issues which make it difficult to collect and preserve born-digital information.
Appendix: how to fill out the askGPO form
The AskGPO form can be used for single documents or for reporting multiple documents, for example, those listed on an agency’s publications index page. See below for the steps to filling out the askGPO form. If a site is extremely large and/or complex (eg., the Office of the Director of National Intelligence (ODNI) reports site) send the URL and description of the site to the GPO Web archiving team at FDLPwebarchiving AT gpo DOT gov.
- Log in to ask.gpo.gov (This will automatically fill in your contact information and depository library number in the form if you have used the system before);
- Click on “Federal Depository Library Program”;
- Select category “Fugitive Publications” (which will soon be changed to “unreported publications”);
- Choose single publication or multiple publications (there’s an excel template if you prefer to collect multiple documents and submit them all at once!);
- Enter title, publishing agency, publication URL, format (other fields are not required). Use your best guess if you are not sure;
- Upload PDF file as attachment (not required but helpful for GPO staff to have the document “in hand” when cataloging);
- Add any additional context that you think may aid GPO staff;
- Do the reCAPTCHA “I’m not a robot” test;
- Submit the document(s)!
FGI’s comments and recommendations for the GPO draft report of the task force on an “all-digital” FDLP
October 12, 2022 / Leave a comment
[editor’s note 10/28/2022: we updated the text below about 100% of govinfo being published digitally in order to clarify where we got that number and why we use the 100% number rather than the 97% born-digital that is most frequently cited.]
We want to thank GPO Director Halpern for calling a “Task Force on a Digital FDLP” and for all of the members of the task force for diligently working through the many thorny issues regarding the future of the Federal Depository Library Program (FDLP). Director Halpern has requested public comment on the draft report until October 14, 2022. We at FGI are submitting the following as our public comments.
1. The task force was asked to study “the feasibility of an all-digital FDLP.” The group was charged to define the scope of an all-digital depository program and recommend how to implement and operate it.
Although the task force working groups concluded that the FDLP can and should go “all-digital,” the draft report was also consistent in noting that “all digital” does not mean everything will be available only in digital formats (pp. 7, 10). The final report should emphasize this point and clarify and clearly state that print remains a viable format for some of our most important government publications as well as an important access method – and recommend exploring future print opportunities like “print and distribute on demand” as an option for depository libraries.
2. The draft report has lots of good ideas but we suggest that some clarifications and reorganization will bring the findings and recommendations of the different working groups into better focus.
We suggest that the final report should begin with a clear “problem statement” that the report will then address. We suggest that this should have two points:
(Note: We base this on the research we have done examining the contents of GPO’s Govinfo repository and 2020 End-of-Term crawl data. We found that the great bulk of new digital Public Information is produced by the executive branch (90% of all government PDFs (aka “publications” on the government web are published by the executive branch), but only 2% of the born digital PDFs in GPO’s Govinfo repository are from the executive branch. Meanwhile LC’s web harvesting is relying on GPO and NARA to take care of the executive branch [https://www.loc.gov/acq/devpol/webarchive.pdf] and, by law, NARA is treating all executive branch web content as “records,” of which only 1-3% are typically preserved. For some more details, see our post “Some facts about the born-digital ‘National Collection'”.)
These two points would put the Task Force’s recommendations into context. The primary focus of actions designed to ensure “permanent no-fee public access to digital content” must focus on ensuring the preservation of that content. No digital system can ensure “access” unless that system has control over and preserves the content it intends to make accessible.
3. To address the problem statement, we recommend that the report create long-term goals from which all recommendations would flow. It would be persuasive and most helpful if the Task Force provided explicit connections between each recommendation and one or more of the goals, showing how the success of each recommendation can be evaluated in terms of the goals.
4. We believe that the draft report minimizes the need to preserve the existing national print collection. It emphasizes digitization of paper documents for access and accepts and proposes even looser rules for the discarding of paper collections without adequate safeguards for the preservation of the information in those collections in either paper or digital formats. Digitizing for the sake of better access is a noble objective, but preserving the born-digital content that is currently NOT being curated and in danger of loss is a much more urgent matter than enhancing access to already well-preserved paper collections.
5. We suggest that the final recommendations and action items be grouped or labeled in categories that will clarify their purpose and scope. For example:
6. Earlier this year we created a list of some specific long-term strategies which may be of use to the Task Force: “FGI’s recommendations for creating the ‘all-digital FDLP'”.
By reorganizing and refocusing the report on what is truly important — preservation first, access built on preserved content — the report will be clearer about the current status of preservation and access and how GPO and FDLP can contribute solutions to existing gaps and weaknesses.
Authors
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
(more…)
Share this: