(Editor’s note: this post is the second of two guest editorials on Libraries Network, a nascent collaborative effort of the Association of Research Libraries (ARL) spurred by the work of the DataRefuge project, End of Term crawl, and other volunteer efforts to preserve data and content from the .gov/.mil domain. The first post was pointed to libraries, the second to govt agencies. Please leave a comment of what you think! JRJ)
This moment in history provides us with a rare opportunity to go beyond short-term data rescue and set the much needed foundation for the long-term future of preservation of government information.
Awareness of risk. At the moment, more people than ever are aware of the risk of relying solely on the government to preserve its own information. This was not true even six months ago. This awareness goes far beyond government information librarians and archivists. It includes the communities that use government information (our Designated Communities!) and the government employees who devote their careers to creating this information. It includes our colleagues, our professional organizations, and library managers.
This awareness is documented in the many stories in the popular press this year about massive “data rescue” projects drawing literally hundreds of volunteers. It is also demonstrated by the number of people nominating seeds (URLs) and the number of seeds nominated for the current End of Term harvest. These have increased by nearly an order of magnitude or more over 2012.
Awareness of need for planning. But beyond the numbers, more people are learning first-hand that rescuing information at the end of its life-cycle can be difficult, incomplete, and subject to error and even loss. It is clear that last minute rescue is essential in early 2017. But it is also clear that, in the future, efficient and effective preservation requires planning. This means that government agencies need to plan for the preservation of the information they create at the beginning of the life-cycle of that information — even before it is actually created.
Opportunity to create demonstrable value. This awareness provides libraries with the opportunity to lead a movement to change government information policies that affect long-term preservation of and access to government information. By promoting this change, libraries will be laying the groundwork for future long-term preservation of information that their communities value highly. This provides an exceptional opportunity to work with motivated and inspired user communities toward a common goal. This is good news at a time when librarians are eager to demonstrate the value of libraries.
A model exists. And there is more good news. The model for a long-term government information policy not only exists, but libraries are already very familiar with it. In 2010, federal granting agencies like NSF, National Institutes of Health and Department of Energy started requiring researchers who receive Federal grants to develop Data Management Plans (DMPs) for the data collected and analyzed during the research process. Thus, data gathered at government expense by researchers must have a Plan to archive that data and make it available to other researchers. The requirement for DMPs has driven a small revolution of data management in libraries.
Ironically, there is no similar requirement for government agencies to develop a plan for the long-term management of information they gather and produce. There are, of course, a variety of requirements for managing government “Records” but there are several problems with the existing regulations.
Gaps in existing regulations. The Federal Records Act and related laws and regulations cover only a portion of the huge amount of information gathered and created by the government. In the past, it was relatively easy to distinguish between “publications” and “Records” but, in the age of digital information, databases, and transactional e-government it is much more difficult to do so. Official executive agency “Records Schedules,” which are approved by the National Archives and Records Administration (NARA), define only a subset of information gathered and created by an agency as Records suitable for deposit with NARA. Further, the implementation of those Records Schedules are subject to interpretation by executive agency political appointees who may not always have preservation as their highest priority. This can make huge swaths of valuable information ineligible for deposit with NARA as Records.
Government data, documents, and publications that are not deemed official Records have no long-term preservation plan at all. In the paper-and-ink world, many agency publications that did not qualify as Records were printed by or sent to the Government Publishing Office (GPO) and deposited in Federal Depository Library Program (FDLP) libraries around the country (currently 1,147 libraries). Unfortunately, a perfect storm of policies and procedures has blocked FDLP libraries from preserving this huge class of government information. A 1983 court decision (INS v. Chadha, 462 U.S. 919, 952) makes it impossible to require agencies to deposit documents with the Government Publishing Office (GPO) or FDLP. The 1980 Paperwork Reduction Act (44 U.S.C. §§ 3501–3521) and the Office of Management and budget (OMB)’s Circular A-130 have made it more difficult to distribute government information to FDLP libraries. The shift to born-digital information has decentralized publishing and distribution, and virtually eliminated best practices of meta-data creation and standardization. GPO’s Dissemination and Distribution Policy has severely limited the information it will distribute to FDLP libraries. Together, this “perfect storm” has reduced the deposit of this class of at-risk government information into FDLP libraries by ninety percent over the last twenty years.
The Solution: Information Management Plans. To plug the gaps in existing regulations, government agencies should be required to treat their own information with as much care as data gathered by researchers with government funding. What is needed is a new regulation that requires agencies to have Information Management Plans (IMPs) for all the information they collect, aggregate, and create.
We have proposed to the OMB a modification to their policy OMB Circular A-130: Managing Information as a Strategic Resource that would require every government agency to have an Information Management Plan.
Every government agency must have an “Information Management Plan” for the information it creates, collects, processes, or disseminates. The Information Management Plan must specify how the agency’s public information will be preserved for the long-term including its final deposit in a reputable, trusted, government (e.g., NARA, GPO, etc.) and/or non-government digital repository to guarantee free public access to it.
Many Benefits! We believe that such a requirement would provide many benefits for agencies, libraries, archives, and the general public. We think it would do more to enhance long-term public access to government information than changes to Title 44 of the US Code (which codified the “free use of government publications”) could do.
- It would make it possible to preserve information continuously without the need for hasty last-minute rescue efforts.
- It would make it easier to identify and select information and preserve it outside of government control.
- It would result in digital objects that are easier to preserve accurately and securely.
- It would make it easy for government agencies to collaborate with digital repositories and designated communities outside the government for the long-term preservation of their information.
- The scale of the resulting digital preservation infrastructure would provide an easy path for shared Succession Plans for Trusted Digital Repositories (TDRs) (Audit And Certification Of Trustworthy Digital Repositories [ISO Standard 16363]).
IMPs would provide these benefits through the practical response of vendors that provide software to government agencies. Those vendors would have an enormous market for flexible software solutions for the creation of digital government information and records that fit the different needs of different agencies for database management, document creation, content management systems, email, and so forth, while, at the same time, making it easy for agencies to output preservable digital objects and an accurate inventory of them ready for deposit as Submission Information Packages (SIPs) into TDRs.
We believe this is a reasonable suggestion with a good precedent (the DMPs), but we would appreciate hearing your opinions. Is A‑130 the best target for such a regulation? What is the best way to propose, promote, and obtain such a new policy? What is the best wording for such a proposed policy?
We believe we have a singular opportunity of awareness and support for the preservation of government information. We believe that this is an opportunity, not just to preserve government information, but also to demonstrate the leadership of librarians and archivists and the value of libraries and archives.
James A. Jacobs, Librarian Emeritus, University of California San Diego
James R. Jacobs, Federal Government Information Librarian, Stanford University
(Editor’s note: this post was a guest editorial on Libraries Network, a nascent collaborative effort of the Association of Research Libraries (ARL) spurred by the work of the DataRefuge project, End of Term crawl, and other volunteer efforts to preserve data and content from the .gov/.mil domain. This is the first of 2 posts for the Libraries Network. The second one will be posted tomorrow. JRJ)
Now that so many have done so much good work to rescue so much data, it is time to reflect on our long-term goals. This is the first of two posts that suggest some steps to take.
The amount of data rescue work that has already been done by DataRefuge, ClimateMirror, Environmental Data and Governance Initiative (EDGI) projects and the End of Term crawl (EOT) 2016 is truly remarkable. In a very practical sense, however, this is only the first stage in a long process. We still have a lot of work to do to make all the captured digital content (web pages, data, PDFs, videos, etc) discoverable and understandable and usable. We believe that the next step is to articulate a long-term goal to guide the next tasks.
Of course, we do already have broad goals but up to now those goals have by necessity been more short-term than long-term. The short-term goals that have driven so much action have been either implicit (“rescue data!”) or explicit (“to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations” [EOT]). These have been sufficient to draw librarian-, scientist-, hacker-, and public volunteers who have accomplished a lot! But, as the EOT folks will remind us, most of this work is volunteer work.
The next stages will require more resources and long-term commitments. Notable next tasks include: creating metadata, identifying and acquiring DataRefuge’s uncrawlable data, and doing Quality Assurance (QA) work on content that has been acquired. This work has begun. The University of North Texas, for example, has created a pilot crowdsourcing project to catalog a cache of EOT PDFs and is looking for volunteers. This upcoming work is essential in order to make content we rescue and acquire discoverable and usable and to ensure that the content is preserved for the long-term.
As we look to the long-term, we turn to the two main international standards for long-term preservation: OAIS (Reference Model For An Open Archival Information System) and TDR (Audit And Certification Of Trustworthy Digital Repositories). Using the terminology of those standards our current actions have focused on “ingest.” Now we have to focus on the other functions of a TDR: management, preservation, access, and use. We might say that what we have been doing is Data Rescue but what we will do next is Data Preservation which includes discovery, access and use.
Given that, here is our suggestion for a long-term goal:
Create a digital government-information library infrastructure in which libraries collectively provide services for collections that are selected, acquired, organized, and preserved for specific Designated Communities (DCs).
Adopting this goal will not slow down or interrupt existing efforts. It focuses on “Designated Communities” and the life-cycle of information and, by doing so, it will help prioritize our actions. By doing this, it will help attract libraries to participate in the next stage activities. It will also make long-term participation easier and more effective by helping participants understand where their activities lead, what the outcomes will be, and what benefits they will get tomorrow by investing their resources in these activities today.
How does simply adopting a goal do all that?
First, by expressing the long-term goal in the language of OAIS and TDR it assures participants that today’s activities will ensure long-term access to information that is important to their communities.
Second, by putting the focus on the users of the information it demonstrates to our local communities that we are doing this for them. This will help make it practical to invest needed resources in the necessary work. The goal focuses on users of information by explicitly saying that our actions have been and will be designed to provide content and services for specific user groups (Designated Communities in OAIS terminology).
Third, by focusing on an infrastructure rather than isolated projects, it provides an opportunity for libraries to benefit more by participating than by not participating.
The key to delivering these benefits lies in the concept of Designated Communities. In the paper-and-ink world, libraries were limited in who they could serve. “Users” had to be local; they had to be able to walk into our buildings. It was difficult and expensive to share either collections or services, so we limited both to members of our funding institution or a geographically-local community. In the digital world, we no longer have to operate under those constraints. This means that we can build collections for Designated Communities that are defined by discipline or subject or by how a community uses digital information. This is a big change from defining a community by its institutional affiliation or by its members’ geographical proximity to an institution or to each other.
This means that each participating institution can benefit from the contributions of all participating institutions. To use a simple example, if ten libraries each invested the cost of developing collections and services for two DCs, all ten libraries (and their local/institutional communities) would get the benefits of twenty specific collections and services. There are more than one thousand Federal Depository Library Program (FDLP) libraries.
Even more importantly, this model means that the information-users will get better collections of the information they need and will get services that are tailored to how they look for, select, and use that information.
This approach may seem unconventional to government information specialists who are familiar with agency-based collections and services. The digital world allows us to combine the benefits of agency-based acquisitions with DC-based collections and services.
This means that we can still use the agency-based model for much of our work while simultaneously providing collections for DCs. For example, it is probably always more efficient and effective to identify, select, and acquire information by focusing on the the output of an agency. It is certainly easier to ensure comprehensiveness with this approach. It is often easier to create metadata and do QA for a single agency at a time. And information content can be easily stored and managed using the same agency-based approach. And information stored by agency can be viewed and served (through use of metadata and APIs) as a single “virtual” collection for a Designated Community. Any given document, dataset, or database may show up in the collections of several DCs, and any given “virtual” collection can easily contain content from many agencies.
For example, consider how this approach would affect a Designated Community of economists. A collection built to serve economists would include information from multiple agencies (e.g., Commerce, Council of Economic Advisors, CBO, GAO, NEC, USDA, ITA, etc. etc.). When one library built such a collection and provided services for it, every library with economists would be able better serve their community of economists. And every economist at every institution would be able to more easily find and use the information she needs. The same advantages would be true for DCs based on kind of use (e.g. document-based reading; computational textual-analysis; GIS; numeric data analysis; re-purposing and combining datasets; etc.).
We believe that adopting this goal will have several benefits. It will help attract more libraries to participate in the essential work that needs to be done after information is captured. It will provide a clear path for planning the long-term preservation of the information acquired. It will provide better collections and services to more users more efficiently and effectively than could be done by individual libraries working on their own. It will demonstrate the value of libraries to our local user-communities, our parent institutions, and funding agencies.
James A. Jacobs, Librarian Emeritus, University of California San Diego
James R. Jacobs, Federal Government Information Librarian, Stanford University
There is a lot of activity going on to ensure that government information on government servers does not get altered, deleted, or lost during the transition between administrations. As we have pointed out before, this is not a new issue even if the immediacy of the problem is more apparent than ever before.
Much of the effort going into these activities has to deal with the inherent problems of how federal government agencies create and disseminate information. There is, for example, no comprehensive inventory or national bibliography of government information. Agencies do not even provide inventories of their own information. This makes it hard to identify and select information for preservation. Also, some information is in databases or linked to Web applications that are not directly acquirable by the public. Finally, the "digital objects" that we can identify and acquire are often not easily preservable.
The inherent problem is that agencies are not addressing digital preservation up front. Librarians and Web archivists are left trying to solve the digital preservation problem too late in the life-cycle of information. We are trying to preserve information long after its creation and "distribution" — in the absence of early preservation planning by the agencies that created the information. This is understandable under current government information policies because most government agencies do not have a mission that includes either the long-term preservation of their information or free public access to it. The Federal Records Act [Public Law 81-754, 64 Stat. 578, TITLE V-Federal Records (64 Stat. 583)] and related laws and regulations only cover a portion of the huge amount of information gathered and created by the government. In addition, the preservation plans that do exist are subject to interpretation by political appointees who may not always have preservation as their highest priority.
What we need is a better approach to government information management that includes preservation planning at the beginning of the information life-cycle and that guarantees its long-term preservation and free public access to it even if the agency has no more need for it, or if Congress has no more funding for it, or if politicians no longer want it.
How can that be done?
At FGI, we believe that a long term solution will require a change of government policy. That is why we have proposed a modification to OMB Circular A-130: Managing Information as a Strategic Resource that would require every government agency to have an Information Management Plan.
This seems to us to be a reasonable suggestion with a good precedent. The government agencies that provide research grants already require researchers to have a Data Management Plan for the long-term preservation of data collected with government research grant funding. A modification of A-130 would simply put the same requirement onto information produced at government expense by government agencies that the National Science Foundation (NSF) and other government funding agencies put onto the data produced by researchers with government funding.
Here is an draft of such a requirement:
Every government agency must have an “Information Management Plan" for the information it creates, collects, processes, or disseminates. The Information Management Plan must specify how the agency’s public information will be preserved for the long-term including its final deposit in a reputable, trusted, government (e.g., NARA, GPO, etc.) and/or non-government digital repository to guarantee free public access to it.
We believe that such a requirement would provide many benefits for agencies, libraries, archives, and the General Public. It would make it possible to preserve information continuously without the need for hasty last-minute rescue efforts. It would make it easier to identify and select information and preserve it outside of government control. It would result in digital objects that are easier to preserve accurately and securely. It would accomplish many of these goals through the practical response of vendors that provide software to government agencies. Those vendors would have an enormous market for flexible software solutions for the creation of digital government information that fits the different needs of different agencies for database management, document creation, content management systems, email, and so forth, while, at the same time, making it easy for agencies to output preservable digital objects and an accurate inventory of them ready for deposit in Trusted Digital Repositories (Audit And Certification Of Trustworthy Digital Repositories [ISO Standard 16363]) for long-term preservation and access.
Perhaps most important for FDLP Libraries, we believe that this OMB requirement would provide a clear and practical opportunity for libraries to guarantee long-term free access to curated collections of government information to their Designated Communities. And this, we believe, will drive new funding and staffing to libraries and digital repositories.
Jame A. Jacobs and James R. Jacobs.
During this past week, there were many reports about the Trump administration’s actions that appear to be either removing information, or blocking information, or filtering scientific information through a political screen before allowing that information to be released.
How concerned should government information specialists be about these developments? Very.
What can we do? First, let’s be cautious but vigilant. As librarians, we are well aware that today’s information environment bombards us with fragments of news and demonstrably false news and speculation and premature interpretation of fragmentary speculation of unverified news. We should neither panic nor dismiss all this as noise. There is so much happening in so many areas of public policy right now that no one can keep up with everything; one thing that government information specialists can do is keep up with developments about access to government information so we can keep our colleagues and communities informed with accurate information.
We also need to evaluate what is happening critically. The Trump administration has attempted to normalize last week’s actions, saying, essentially, that removal of information and control of information is a normal part of a transition. On Tuesday of last week, for example, White House press secretary Sean Spicer addressed concerns about reports of censorship at the Environmental Protection Agency (EPA) by saying “I don’t think there’s any surprise that when there’s an administration turnover, we’re going to review the policies.” And on Wednesday, Doug Ericksen, the communications director for the President’s transition team at the EPA, said “Obviously with a new administration coming in, the transition time, we’ll be taking a look at the web pages and the Facebook pages and everything else involved here at EPA.” In short, this explanation is that the new administration is just updating and transitioning and making sure that information from agencies conforms to its new policies. This is “business as usual.” Nothing to see here; relax; move on. Even some govinfo librarians minimize the significance of what is going on.
This sounds reasonable on the surface. Indeed, since even the entire Executive Office of the President, which includes the Council of Economic Advisers and the National Security Council and the Office of Management and Budget, has been offline since Inauguration day and a temporary page asks us to “Stay tuned as we continue to update whitehouse.gov,” perhaps we should just be patient? Surely those will be back, right?
I think we need to realize that this actually is pretty odd behavior. And we need to help our communities who still need access to important policies (like OMB Circular A-130 [IA copy]) that are gone but, presumably, still in effect.
We need to be aware that this administration presents difficulties for the public in just figuring out what it is actually doing. It appears that the administration has reversed or modified some of its initial information policies or that they were incorrectly reported, and that these reversals — if they are reversals and if they are permanent — seem to have come about because of public outcry.
I think we need to take the administration’s actions seriously and let them know when they are doing something unacceptable or uninformed. We need to stand up for public access and transparency.
I suggest that it is our professional duty to address these issues. I suggest that the communities that our libraries serve expect and need us to do this. This administration is doing many troubling and controversial things and everyone cannot fight every battle. Ensuring long-term free access to government information should be a job responsibility for every government information librarian.
What can we do? What should we do? How can we best allocate our resources?
- We need to keep our library administrators informed. We can do that by putting Government Information on committee agendas and preparing accurate and well-informed briefings that address how political changes will affect the library’s ability to provide content and service and how they will affect library users’ ability to find and get and use government information.
- We need to talk to our user-communities. We need to provide them with accurate and well-informed information about how political changes are already affecting their ability to find and get and use government information. We need to provide alternate sources where necessary and update library guides and catalogs. We need to learn from them when they identify issues and problems and solutions.
- We need to keep our professional colleagues informed through local library meetings, informal communication, and professional activities.
- We can still contribute to the EOT. There are lots of things you can do.
- We can make the case for digital collections.
- We need to remind our administrators that when we depend on pointing instead of collecting we lose information.
- We need to remind them that even though preservation sites like obamawhitehouse.archives.gov and the 2016 EOT crawl are worthwhile and valuable, they still create the problem of link rot. We need to remind library administrators that pointing to remote collections that move is not a cheap way to provide good service. It is a time-consuming, never-ending task that is neither easier than nor as reliable as building local digital collections.
Sample of News Stories
by Dino Grandoni (Jan. 26, 2017)
By MICHAEL BIESECKER and SETH BORENSTEIN (Jan. 26, 2017)
I was honored last week to be part of a panel hosted by OpenTheGovernment and the Bauman Foundation to talk about the End of Term project. Other presenters included Jess Kutch at Coworker.org and Micah Altman, Director of Research at MIT Libraries. I talked about what EOT is doing, as well as some of the other great projects, including Climate Mirror, Data Refuge and the Azimuth backup project, working in concert/parallel to preserve federal climate and environmental data.
I thought the Q&A segment was especially interesting because it raised and answered some of the common questions and concerns that EOT receives on a regular basis. I also learned about a cool project called Violation Tracker, a search engine on corporate misconduct. And I was also able to talk a bit about what are the needs going forward, including the idea of “Information Management Plans” for agencies similar to the idea of “Data Management Plans” for all federally funded research. I was heartened to know that there is interest in that as a wider policy advocacy effort!
The full recorded meeting can be viewed here from Bauman’s adobe connect account.
Here’s more information on the EOT crawl and how you can help.
Coalitions of government, university, and public interest organizations have been working to ensure as much information as possible is preserved and accessible, amid growing concern that important and sensitive government data on climate, labor, and other issues may disappear from the web once the Trump Administration takes office.
Last Thursday, OTG and the Bauman Foundation hosted a meeting of advocates interested in preserving access to government data, and individuals involved in web harvesting efforts. James Jacobs, a government information librarian at Stanford University Library who is working on the End of Term (EOT) web harvest – a joint project between the Internet Archive, the Library of Congress, the Government Publishing Office, and several universities – spoke about the EOT crawl, and explained the various targets of the harvest, including all .gov and .mil web sites, government social media accounts, and more.
Jess Kutch discussed efforts by Coworker.org with Cornell University to preserve information related to workers’ rights and labor protections, and other meeting attendees presented some of their own projects as well. Philip Mattera explained how Good Jobs First is using its Violation Tracker database to scrape and preserve government source material related to corporate misconduct.
Micah Altman, Director of Research at MIT Libraries, presented on the need for libraries and archives to build better infrastructure for the EOT harvest and other projects – including data portals, cloud infrastructure, and technologies that enhance discoverability – so that data and other government information can be made more easily accessible to the public.