The Government Publishing Office has a brief proposal to omit some metadata in the Catalog of Government Publications (CGP). As written in the announcement, the proposed change in policy seems simple and obvious. GPO says that, for publications in govinfo, “historic URLs” (the original, source URL) and PURLs (the Persistent Uniform Resource Locators) are “identical.” GPO reasons that, since the the two fields are “redundant,” the historic URLs are unnecessary:
“LSCM proposes to cease the inclusion of historic URLs only in catalog records for resources in govinfo.”
The announcement ask for feedback on how the policy change would affect “cataloging/metadata and other operations and processes.”
The proposal should be rejected for three reasons.
1. The premise is wrong
GPO’s premise is flat out wrong. The information recorded in PURLs and “historic URLs” is neither identical nor redundant. PURLs change over time and “historic URLs” stay the same. The two fields record different information even when they are the same: PURLs record the current location of a resource and “historic URLs” record the original location of the resource. PURLs exist because URLs change.
(If GPO could guarantee that the public URL of govinfo items will never change, then it could just as easily advocate not including PURLs in CGP records. But that would, obviously, be crazy. PURLs are needed to account for the problem of “link rot.” Link rot is a well documented problem including for government information. GPO’s own digital repository has had already had three base addresses over the years requiring PURL redirects: access.gpo.gov, gpo.gov/fdsys, and govinfo.gov.)
Because PURLs are necessary, the issue GPO should be addressing is whether or not the original URL is a valuable piece of metadata — even after it “rots” (changes). And the question GPO should be asking is not how omitting that information will affect “cataloging/metadata and other operations and processes,” but how it will affect users and long-term access to digital resources.
2. The historic URL is valuable to users
The original URL of govinfo resources is a valuable piece of metadata for users. They can use it to locate a copy of a document in other archives (e.g., the Internet Archive, the End of Term Archive). They can use it to compare copies when more than one institution has archived the same content (perhaps at different times, reflecting changes over time).
The historic URL is essential to retrieving copies in other archives and this would be vital if technical, administrative, legal, or budgetary changes interfered with GPO keeping PURLs up to date, or keeping its PURL server online, or keeping its online copy available. It would also be vital if government information is intentionally withdrawn or privatized. There is no good reason to deny such essential information to users.
Some resources in govinfo (such as the Federal Register) are available in other versions from other government sources. The CGP records should record the historic URL so that users can understand which versions are described and to which version the PURL points.
The govinfo website provides users with the URLs of resources, but does not provide PURLs. Most users of resources in the TDR will have those public govinfo URLs and not PURLs. Users who cite those resources and provide links to them will likely use those URLs, not PURLs. Having those URLs in CGP would help users find the PURL when those URLs change (as they inevitably will).
3. The proposal ignores the evolution of govinfo
The proposed policy change is notable in what it leaves out: the future. At its inception, the policy would, apparently, affect only content in GPO’s Trusted Digital Repository (TDR).[1] The proposal should explicitly define the scope of the policy.
Currently, as far as we can tell[2], the TDR contains two kinds of resources with CGP records: Born-digital items that are first published by GPO in either fdsys or govinfo (example CGP record: America’s water infrastructure needs and challenges); and digitized paper items that GPO has ingested into fdsys or govinfo (example CGP record: Internal revenue cumulative bulletin.).
But CGP includes historic URLs for other kinds of resources. For example:
- Items harvested by GPO into permanent.access.gpo.gov and into wayback.archive-it.org. (Example CGP record: HealthCare.gov.)
- Digitized paper items not ingested into gpo.gov/fdsys or govinfo. (Example CGP record: Conservation and full utilization of water )
This prompts several questions that the policy proposal does not address: Will GPO’s TDR ever contain other resources? Will it evolve and expand over time? Will the policy affect other kinds of content if they become “resources in govinfo”?
This is important because the historic URLs in CGP records for resources that, apparently, are not affected by the new policy today, could be affected if they were ingested into the TDR (or if the policy is broader than we think it is, or if GPO interprets the policy differently in the future, etc.).
The “historic URLs” for that other content is even more important in some cases than it is for the content currently in the TDR.
The source URL of a harvested item establishes the provenance (origin) of the item archived. The standard for Trusted Digital Repositories requires TDRs to preserve information on the provenance of resources they archive (TDR 4.2.6.3). If GPO ever ingests harvested content into its TDR, it will need the historic URL. Having it in CGP records could provide a useful link between CGP and the GPO’s repository.
Digitized paper resources present a special problem for users. More than one digitization project may digitize the same paper document or different editions or versions of the same title. Differences can occur because of the use of different source copies and because of different standards and levels of accuracy of different digitization processes. This is further complicated in CGP by the fact that some CGP records point to non-government websites (example: Joan of Arc saved France), some point to the TDR (example: An account of the receipts and expenditures of the United States), and some point to harvested content (example: Pots and pans for your kitchen). There are probably other permutations and combinations that we have not discovered. The “historic” URL (e.g., its original location) would help the user identify the source of the digitization and would enable the user to obtain a copy from that source or from an archive of that source.
Conclusion
The “historic URLs” in CGP provide information to users that PURLs do not. That information is useful to users because it will help them identify, understand, and locate copies of resources. “Historic URLs” may seem unnecessary to GPO today, but they will increase in value to users over time. Making a decision for “resources in govinfo” today fails to take into account what resources may be in GPO’s TDR in the future (including harvested content and digitizations). The proposal to drop historic URLs is short-sighted. Dropping historic URLs today would be a mistake that users would resent in the future.
GPO should clarify the scope of the policy and how it would be applied in the future and evaluate its effects on users and long-term access.
Authors:
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
Endnotes
- The announcement mentions getting TDR certification and refers to “resources in govinfo.” GPO does not make it clear, however, if “govinfo” is a reference to its digital repository or the website www.govinfo.gov. Without a clarification, the scope of the policy proposal is unclear and ambiguous. The website runs using the content management system drupal. It appears that everything in the TDR is available through the website, but not everything listed on the website is available in the TDR (e.g., see: the browse page that points to GPO partner resources). For the most part, though, the website appears to be the public interface to the TDR. As far as we know, GPO has not said if that will always be the case. Presumably, the website could be used to point at content in more than one repository (permanent.gpo.gov, for example), or GPO might want to ingest into the TDR the content it has harvested at wayback.archive-it, or GPO might replace drupal with a different CMS. Any of these would result in a different scope of “resources in govinfo.” ↵
- Since GPO has not yet shared its TDR certification report publicly, our evaluation is based on what we can infer from using the govinfo website. ↵
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Latest Comments