In a recent post on the blog of the Web Science and Digital Libraries Research Group, Shawn Jones reports on research that is vital to all those interested in long term access to government information.
- How well are the National Guideline Clearinghouse and the National Quality Measures Clearinghouse Archived? Shawn M. Jones, Web Science and Digital Libraries Research Group (July 15, 2018).
In the post, Jones reports on his research into how much of the content of two sites (the National Guideline Clearinghouse and the National Quality Measures Clearinghouse) of the Agency for Healthcare Research and Quality (AHRQ) have been preserved by librarians and archivists. He chose these two sites because, as we noted last week, The Department of Health and Human Services (HHS) is deleting from those sites "a vast trove of medical guidelines that for nearly 20 years has been a critical resource for doctors, researchers and others in the medical community." The Sunlight Foundation, The Daily Beast, and the Gov-Info Tumbler describe the situation in detail:
Notice of Removal on HHS’s Agency for Healthcare Research and Quality Clearinghouse Websites, Sunlight Foundation’s Web Integrity Project (July 12, 2018)
Explained: The shutdown of the National Guideline Clearinghouse and the independent efforts to launch a replacement, Andrew Bergman, Sunlight Foundation (July 20, 2018).
HHS Plans to Delete 20 Years of Critical Medical Guidelines Next Week, by Jon Campbell, The Daily Beast (07.12.18).
AHRQ shutting down National Guideline Clearinghouse & The National Quality Measures Clearinghouse on July 18, 2018, gov-info: The Government Info Librarian blog (Jul 10, 2018).
AHRQ says that the reason for the take down is that "Funding to support AHRQ’s National Guideline Clearinghouse (NGC) ended on July 16, 2018."
Jones’ research documents the difficulty of preserving digital government information. In his post he takes the time to meticulously document what he did and how he did it, including links to figshare and github with his data and code so that others can learn from and build on his experiences.
Although it is somewhat heartening to learn that a great deal (though not all) of this information has been captured, the lessons from this situation are, on balance, unequivocally disheartening. Jones demonstrates that it can be very difficult to accurately and completely crawl a website and download its content and his research reveals why:
- Content is hidden in databases rather than easily and clearly listed and described and inventoried.
- Content is made available in a variety of hard-to-preserve, incompatible formats without any fixity or provenance or authenticity.
- Capturing content from government websites can be either difficult and opaque (at best) or prohibited (at worst).
Jones draws several conclusions from his research, all of which are essential lessons for government information professionals:
- The importance of organizations like the Sunlight foundation for identifying at risk resources.
- The importance of web archives like the Internet Archive.
- The importance of using existing tools to ensure that important resources are preserved.
Jones also notes one important problem:
We do need to be concerned that so much of this content is preserved in one place, rather than spread across multiple archives.
He is referring to the Internet Archive, where he found the bulk of the content from the two websites. It is also important to remember that IA is a charity and has no permanent funding. (One small lesson from this is that we should all donate to IA whenever we can.)
And, he speaks directly to those who wish to guarantee long-term access to their Designated Communities:
If a page is of value to you, you have an obligation to archive it and archive it in multiple archives. What web pages have you archived today, so that you, and others, can access their content long after the live site has gone away?
What about NARA?
One question Jones does not address is this: Is the content of these web sites covered by requirements for preservation with NARA? The Federal Records Act and related laws and regulations cover only a portion of the huge amount of information gathered and created and disseminated by the government. Official executive agency “Records Schedules,” which are approved by the National Archives and Records Administration (NARA), define only a subset of that information as Records suitable for deposit with NARA and the implementation of those Records Schedules are subject to interpretation by executive agency political appointees who may not always have preservation as their highest priority. Indeed, politics can affect access and Jones documents how changes in administrations can make archiving government web sites more difficult: both web sites started excluding automated downloading of content (with robots.txt files) shortly after Tom Price was approved as the new Secretary of Health and Human Services. And the New York Times reports that some surgeons lobbied against the AHRQ when the guidelines suggested nonsurgical solutions.
The Sunlight Foundation says that “AHRQ is retaining it [the content], as required by federal records law.” Someday we will find out if NARA does get any of this content.
Jones’ experience leads us to additional conclusions:
- The current method of government information "dissemination" is broken and unacceptable.
Digital publishing by the government has grown up haphazardly, without adequate standards for long-term preservation, usability, and access. Each agency has its own content management routines and those routines change as management, Information Technology staff, and the underlying technology change — often, usually, without regard for the long-term. Expedience, short-term cost-efficiency, and short-term access drive IT decisions. The result is that information is not really disseminated at all, it is just "made available." Much of this is driven by an emphasis on e-government services and a focus on individual users. While such a focus is necessary and good, the resulting infrastructure is bad if it overlooks the needs of institutions that consume and use information in bulk and with different purposes. Such institutions need to be able to preserve content, not just use it and discard it; they need an easy way of identifying the content that they want to preserve and know that they are getting everything that matches their needs; they need to be able to guarantee its authenticity and make it easy to determine if content has been altered; they need to be able to guarantee the content will be usable in the future. The current status of government information dissemination is broken because it simply fails these essential requirements.
- Capturing content is not enough.
Simply capturing the content is only half of archiving. A large part of the value of the websites was that the content was well indexed and searchable in ways that were designed to be useful to physicians and the public. Just having a bunch of files in the Internet Archive will not mean that they will be easily findable or usable — or found or used. In the digital age, more than ever, Collections and Services complement each other. Collections are useful when an institution develops Services for them; Services can only be developed for Collections that an organization controls. “Archiving” in the digital age must mean more than capturing files; it must include providing Services for the content — services designed for the communities that use that content. Without services, the job of archiving is only half done.
The solutions to this situation are not easy, but they are clear. We cannot rely on libraries alone, or the Internet Archive alone, or GPO or NARA by themselves to solve the problem of a broken dissemination system or ad-hoc “archiving.” A combination of approaches and solutions is necessary. This will be a process, not an overnight solution.
- Libraries have to take responsibility for long-term access to government information using the tools we have today.
Libraries cannot wait for the perfect solution and they cannot hope someone else will take on this responsibility. Government agencies will withdraw, alter, and move digital government information. Funding is fragile and inadequate for GPO and NARA. Jones shows that content can be successfully captured and preserved, but the problem is too big to rely on the few that are already working on preservation. It requires participation by a large community of libraries.
- Libraries need to add value to Collections.
If the first step is for libraries to gather content from government web sites and develop robust collections to ensure their long-term availability, the second step is to build services for those collections. In the paper-and-ink age, libraries always added value to their collections by selecting and organizing them and providing public services for them. In the digital age, communities need this kind of value-added service more than ever. Just as libraries have to go beyond pointing without collecting, libraries also must add services to collections they acquire — services designed for their own Designated Communities. Doing this will not only protect against loss of government information in the future; it will also add value to libraries today.
- Libraries should work with agencies to make agency content more easily archivable.
Agencies can make our job easier and they should. It is to their advantage to partner with libraries to preserve their valuable content. This can be done today with existing technologies such as site-maps, RSS feeds, open-formats, and digital signatures. Librarians and archivists who work with the content of specific agencies can help by letting agencies know how archive-friendly (or archive-unfriendly!) their web sites are and how to fix them — there’s even a site called ArchiveReady that can help. Of course, we can and should also encourage agencies to work with GPO, but we must recognize that GPO’s resources are limited and it will not be able to do everything. GPO needs our help.
- Congress and OMB need to pass laws and create regulations to enforce good archiving practices.
We need laws and regulations that require agencies to produced digital content as Preservable Digital Objects with public-facing complete inventories of and access to all content. We believe some simple requirements in OMB Circular A-130 could do this quickly. Government agencies that provide research grants already require researchers to have a Data Management Plan for the long-term preservation of data collected with government research grant funding and this has produced a boom in library activity in response. A modification of A-130 could simply require agencies to have a similar plan, an Information Management Plan, for information they disseminate at government expense. Librarians should advocate with the Senate and ALA that the revision of Title 44 of the U.S. Code should require this of OMB.
Government information specialists who want to demonstrate to their communities that their collections and services have value can do so immediately by preserving what they can today. Libraries can increase their efficiency in the near-term by working directly with government agencies to help them make their valuable content more easily preservable. Libraries can prepare for the future by advocating for changes to government information policies that will fix a broken dissemination infrastructure. We urge you to work within your own library and with your library organizations and consortia to take action today.
We also urge you to follow the work of these groups and get involved: The Sunlight Foundation Web Integrity Project, the Environmental Data and Governance Initiative (EDGI), PEGI (Preservation of Electronic Government Information), Open the Government, and Demand Progress.
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.