In anticipation of this week’s Depository Library Council meeting, FGI suggests a focus on the biggest challenges facing long-term preservation and access.
The scope of the challenges we face is large and clear. The quantity of government information that is “born-digital” each year is literally orders of magnitude greater than the quantity of government publications accumulated over the entire 200+ year history of the FDLP. (Born-Digital U.S. Federal Government Information: Preservation and Access). Although the redundancy of copies of the historical FDLP paper collections ensures at least their passive preservation, repeated calls for discarding those collections endanger both the preservation of that content as well as access to it. Consequently, the inadequacy of bibliographic records for those collections now poses a significant threat to their long-term preservation and access. In this time of proliferation of government information, it is essential for the FDLP to have a clear understanding of exactly what information exists, what is being preserved, and who is accepting responsibility for long-term preservation of and free access to government information.
Congressional support of government information programs is at an all time low. Over the last decade, appropriations bills have steadily decreased budgetary support for the Government Publishing Office (GPO). This year, a House bill proposes 9% cut to GPO’s budget – a cut that would negatively affect the maintenance and development of FDsys. Additionally, Congress has proposed shuttering the National Technical Information Service (NTIS) and Congressional pressure resulted in taking NASA technical reports offline. Anti-government sentiment is so strong it is difficult for agencies to reliably maintain even essential basic services (including GPO’s own PURL server) and Congress has even shut down the entire government more than once and some in Congress continue to threaten future shutdowns. While GPO is doing a good job of preserving in Fdsys most born-digital official Congressional information, preservation of the digital information of the Judicial and Executive branches is haphazard, uncoordinated and fragile. GPO has (with the acquiescence of FDLP libraries) arrogated to itself the job of being solely responsible for preservation of born-digital government information. This has actually weakened the infrastructure of preservation by changing a system that relied on 1200 partners to a system that depends on a single government agency. In this context, a single budget cut would mean a loss of a huge quantity of digital government information – if not for the innovative, active cooperation of a hardy band of FDLP libraries that participate in the LOCKSS U.S. Documents project. (This project only serves as backup of the information in FDsys and does not, currently, provide any means of making that information accessible.) In this time of fragile support for government action, it is more essential than ever to reverse the twenty year old model of centralization and return to a model of shared responsibility with the participation of as many non-profit, service-oriented libraries as possible.
It is time for specifics and time for libraries that claim to value permanent free access to government information to step up and take new digital-library responsibilities. GPO’s proposed “National plan for access to US government information” and “Federal Information Preservation Network (FIPNet)”) have, so far, been vague outlines with few specifics. We at FGI propose that FDLP librarians and GPO use this week’s virtual Depository Library Council (DLC) meeting to 1) clarify the existing state of government information; and 2) specify an agenda for what is needed in order to have a successful national library plan for a sustainable system of government information collection, preservation and provision. We propose that DLC use this meeting to flesh out and expand the parameters of the discussion and more fully describe what needs to be done by the FDLP community. The following is our own take on these two ideas.
Clarifying the state of things
The current state of identification of government information is fragmented and incomplete. GPO uses the Catalog of Government Publications (CGP) to meet its legal requirement to maintain an “electronic directory of Federal electronic information” (44USC4401). But the CGP is incomplete. It is complemented by the historic Monthly Catalog, the 1909 Checklist, and GPO’s “shelflist” project. GPO’s online digital collections (which include FDsys, the Federal Depository Library Program Web Archive, and the Federal Depository Library Program Electronic Collection (FDLP/EC) Archive provide additional, but still incomplete, pieces of the bibliographic puzzle. HathiTrust’s government documents registry project promises to better define the breadth and depth of the historic national bibliography, but it has a limited scope. These separate projects provide an incomplete and confusing picture and they fail to provide any unified tool for managing long-term preservation and access. There are at least two areas in particular that require clarification:
- GPO’s PURL policies and actions. GPO uses PURLs to provide permanent URLs to digital resources, but it is not clear how GPO policies ensure accurate linkage of metadata to digital objects. For example, we understand that some PURLs point to agency web sites and some point to digital objects in permanent.access.gpo.gov. GPO should provide answers to the following questions:
How does GPO deal with the metadata for information that changes (not just moves) on agency web sites?
Are there clear policies that govern the creation of PURLs and how they are checked for accuracy over time?
Is there metadata that clarifies the relationship between agency copies and GPO copies?
Is there a reason that GPO forbids the Internet Archive to harvest documents using PURLs?
Are there polices that deal with versioning of digital documents?
Has GPO compared the functionality of PURLs and the functionality of DOIs and the possibility of pointing to multiple copies of the same item?
(For reference, here are examples of existing PURLs):
- http://purl.fdlp.gov/GPO/gpo56804 (resolves to FDsys)
- http://purl.fdlp.gov/GPO/gpo56917 (resolves to .gov live site)
- http://purl.fdlp.gov/GPO/gpo55615 (resolves to permanent.access.gpo.gov)
Note: All 3 of the above PURLs had the same error message in the Internet Archive’s wayback machine: “Page cannot be crawled or displayed due to robots.txt.”
The final landing pages for the above GPO PURLs in wayback machine and got mixed results:
- http://www.gpo.gov/fdsys/pkg/PLAW–114publ4/pdf/PLAW–114publ4.pdf “Hrm. Wayback Machine doesn’t have that page archived.”
- http://www.gao.gov/assets/670/669627.pdf OK
- http://permanent.access.gpo.gov/gpo55615/63507.pdf “Page cannot be crawled or displayed due to robots.txt.”
- Questions about GPO Web Harvesting.
- What is the relationship between the Federal Depository Library Program Web Archive, which uses the Internet Archive’s Archive-It service, and GPO’s own Federal Depository Library Program Electronic Collection (FDLP/EC) Archive?
- Are there policies as to what materials go into each?
- What metadata exist to describe the holdings of these different archives?
- Are Web-harvested documents and Websites cataloged for the CGP?
What we need
In order to define the national bibliography, bring it under the control of GPO and FDLP libraries and accurately and successfully manage FDLP collections (paper, born-digital, and digitized) for the long-term, the FDLP (and the public!) need accurate, complete, up-to-date, unified metadata for all FDLP ‘publications.’ This includes:
- everything in FDsys
- everything with a PURL.
- everything in permanent.access.gpo.gov
- everything in GPO’s Archive-It collection
- Every Executive agency’s Website (ideally including the proposed development of ../publications and ../data directories on every agency site)
- Every digital surrogate qualifying for the Digital Surrogate Seal of Approval to assure quality and completeness of digitizations.
- Metadata should accurately link to specific digital objects
- Metadata should have specific information about versions and editions and the relationship of GPO copies to agency copies of web-harvested information.
- Metadata should include an indication about who is providing preservation services.
What libraries can do to help
- Expand the HT registry through a process of collaborative cataloging and metadata creation. Even cataloging one document per month aids in the ongoing effort to thoroughly describe the national bibliography;
- Develop and participate in a new, digital [[Farmington Plan]] in which libraries divvy up, adopt, and track digital documents from executive agencies;
- Participate in fugitive hunting (see “Want to be a fugitive hunter?”);
- Develop tools for collaborative and targeted Web harvesting and community crowd-sourcing of Web crawl Quality Assurance (QA) (i.e., tuning Web harvests and checking to make sure they collect the targeted material);
- Manage historic collections with a more geographically holistic view toward collection access and preservation;
- Develop and participate in community-based projects for contacting executive agency CIOs/CTOs to advocate for ingest of agency publications and data into FDsys.
And the list could go on. Some of these tasks are large and expensive, but some of them can be done on a regular basis in as little as 1 hour/month. The FDLP needs all hands on deck. One thing is for sure: if FDLP libraries and librarians do not step up to the admittedly large task of continuing to build digital FDLP collections, we could potentially see the end of the historic record.
We look forward to the coming conversations.
Jim Jacobs and James R. Jacobs