It is January and time once again to review what last year brought to libraries and the FDLP and where we should put our energies in the coming year.
In 2016 GPO issued a series of policies that express its intentions to enhance both access to and preservation of government information. While we applaud GPO’s intentions, we are dismayed because the policies are fatally flawed and will endanger preservation and access rather than protect and sustain them.
The biggest threat to long-term free public access to government information is government control of that information. Regardless of the good intentions of the current GPO administration, and regardless of the hopes of government information librarians, GPO cannot guarantee long-term free public access to government information on its own.
There are many reasons for this, but they all boil down to the simple fact that, when digital government information is controlled solely by the government that created it, it is only as secure as the next budget, the next change in policy, and the next change in administration. We have written about this repeatedly here at FGI and elsewhere for sixteen years, so we will not repeat all of those arguments (philosophical, technical, legal, economic, and professional) here today. (For those who wish to catch up, please see the FGI Library or the selected links below.)
GPO has come a long way since its first early attempts to deal with the shift from paper-and-ink publications to born-digital information. To its credit, GPO today emphasizes in its policies (including the new ones) its intent to preserve as much digital government information as it can through its own actions as government publisher, through harvesting agency content, and through partnerships with others. GPO has also wisely reversed an earlier policy and is now partnering with LOCKSS to create copies of FDsys in thirty-seven Federal depository libraries. Indeed, supporting the LOCKSS partnership which puts copies of FDSys/govinfo.gov in the hands of FDLP libraries is the most positive step GPO has taken. The LOCKSS archives are not, however, publicly available, so this is only a first step.
These are good intentions and positive steps. But we must ask: Are these steps sufficient? We must ask not only how good they will be if they succeed but how bad can the damage be if they fail? Can GPO really guarantee long-term free public access to government information?
The simple answer to these questions is: No, GPO cannot guarantee long-term free public access to digital government information. Why? First, regardless of its current intentions, GPO does not have a legislative mandate for long-term preservation. The wording of the law (44 USC 4101) does not mention long-term preservation or specify any limitations on what can be excluded or discarded or taken offline. It is limited to providing “online access to” and “an electronic storage facility for” two titles (the Congressional Record and the Federal Register). Everything else is at the discretion of the Superintendent of Documents. Previous SoDs have had completely different priorities and those bad policies could easily return. Federal agencies may request that GPO include agency information, but GPO is only obliged to do so “to the extent practicable.” This means that GPO’s commitment to long-term preservation is subject to changes in GPO administrations. Further, regardless of the intentions of even the most preservation-minded GPO administration, it can only do what Congress funds it to do and there are ongoing and repeated efforts to reduce GPO funding and privatize it.
Second, GPO does not have a legislative mandate to provide free public access. In fact, the law (44 USC 4102) explicitly authorizes GPO to charge reasonable fees for access. GPO’s current intentions are noble, but, alas, they lack the legislative and regulatory foundation necessary to provide guarantees.
So, even if GPO policies are successful in the short-term, the policies make the preservation and long-term free access ecosystem vulnerable to budget shortfalls and political influence because they are designed to consolidate GPO’s control of that information.
The shortcomings of such an approach have become more apparent to more people after the recent presidential election. Scientists, scholars, historians, news organizations, politicians, and even some government information librarians have announced their fears that government information is at risk of being altered, lost, or intentionally deleted because of drastic policy changes and leadership of the incoming presidential administration. (See a list of articles about this issue.)
To be clear, no one has suggested (yet) that information will be deleted from FDSys/govinfo.gov. And we are not predicting that the new President and his executive branch agencies will erase any valuable government information. We are simply saying that they have the authority to do so and, if we keep all our eggs in one GPO/government-agencies basket, they have the technical ability to do so. This is not a new problem. Agencies and politicians have a long history of attempting to privatize, withdraw, censor, and alter government information. Between 1981 until 1998, Anne Heanue and the fine folks at the Washington Office of the American Library Association (ALA) published an amazing series called Less Access to Less Information by and about the U.S. Government that chronicled such efforts to restrict access to government information.
What is new to this problem is the ability of a government that controls access to that information to remove access with the flick of a switch. Here at FGI we have written about this specific problem again and again and again and again and again and again and again and again.
To make matters worse, by explicit intent and inevitable effect, the new GPO policies will further consolidate GPO power and control and further weaken every individual FDLP library and the FDLP system as a whole.
What can government information librarians do in 2017? How should we focus our resources and actions?
- Monitor implementation of the Discard Policy. If successful, the new Regional Discard Policy (along with GPO’s National Plan and other policies) will, by design, further shift both access and preservation away from the FDLP into GPO. In addition, although the Policy claims that the digital surrogates it will rely on will be “complete and unaltered,” it lacks procedures to ensure this. At this point, the best we can do is hold GPO (and the Regionals that will be discarding documents) to their claims and not let the policy do even more harm than it is designed to do.
- Participate in PEGI. A loose group of individuals and organizations met last Spring and Fall to organize an effort called Preservation of Electronic Government Information (PEGI). Watch for developments and opportunities to participate in actions developed by this group.
- Support the EOT Crawl. 2016 will be the third national End of Term Crawl. The goal of the EOTs is to document the change in federal administrations by harvesting as much government information from .gov, .mil, and other domains before and after the inauguration. Follow their activities, contribute “seeds” and databases that need to be harvested, and promote their activities and visibility within your own communities.
- Support changes to OMB A-130. The Office of Management and Budget’s Circular A-130 lays out regulations for “Managing Information as a Strategic Resource.” The government policies that have done the most to affect preservation of information collected with government funding have been those that required “Data Management Plans” of those who get government research grants. These policies have prompted the creation of many new positions and programs to support data preservation in libraries. Oddly, there is no parallel regulation that requires government agencies to guarantee the preservation of the information they create. FGI recommended amending A-130 to require every government agency to have an “Information Management Plan” for the public information it acquires, assembles, creates, and disseminates. We will continue to push for this change. Watch for opportunities to support it.
- Support changing SOD 301. GPO’s Dissemination/Distribution Policy for the Federal Depository Library Program (“SOD 301”) is the policy that allows GPO to deposit digital government information with FDLP libraries, but limits the deposit to only those “products” that are the least preservable and most difficult to access. This policy actively impedes preservation and access and is the policy that GPO uses to enforce its consolidation of power and control over digital government information within the scope of the FDLP. Demand that GPO change this policy to allow FDLP libraries to select and acquire all digital government information.
- Support a truly digital FDLP. GPO’s policies since the mid-1990s have systematically minimized the participation of FDLP libraries in both preservation and access. At a time in which it is more obvious than ever that GPO needs legislatively mandated partners to guarantee long-term free public access to government information, support a truly digital, collaborative FDLP that uses new methods to support the traditional values of the FDLP.
James A. Jacobs, UCSD
James R. Jacobs, Stanford
Back when I helped teach new data librarians about data, one of the themes my colleagues and I liked to repeat was that “data should tell a story.” By that we meant that raw facts are literally without meaning until we analyze them and understand the stories they tell. “Understanding” is more than facts. As John Seely Brown and Paul Duguid said in their book The Social Life of Information, “information” is something that we put in a database, but knowledge “is something we digest rather than merely hold. It entails the knower’s understanding and some degree of commitment. Thus while one person often has conflicting information, he or she will not usually have conflicting knowledge” (p 119-120).
In those early days of data-librarianship, the tools we had for finding and acquiring and using data were very primitive (and often expensive) compared to the tools available today. Today, one can download and install very sophisticated free software for statistical analysis, data visualization, and even data animation. And one can download enormous data time series directly from the web and do analysis on the fly.
One big source of data is, of course, the federal government. Of course, we shouldn’t just hope that the government will preserve and provide free access to its data. Libraries need to take action to ensure long-term free availability of data.
I say all that as an introduction to an article that I recommend to you as a source of inspiration toward action, an example of what can be done with government data today, and a cautionary tale of how data can be manipulated to tell stories that appear “true” but which actually distort the story the data really tell.
- 2016 Will Be The Warmest Year, But This Is How Deniers Will Spin It. Peter Aldhous. (December 20, 2016).
Aldhous provides code for using R and ImageMagick and Adobe Illustrator to load data on the average global temperature for each month since January 1880 directly from from the National Oceanic and Atmospheric Administration. He analyzes the data, animates it, and demonstrates how changing the timeline can make the data tell a false story.
A lot has been written about “fake news” in the last few months. Too much of that writing has (IMHO) muddled the differences between just-plain-lies and everything else that divides the country at the moment. But the basic issue of politicians who distort the truth because they are more interested in the zero-sum game of political power than they are in governance is an old one. When politicians do this consistently and with coordination and determination, all that distortion ends up in “the news” as if it were true, when really it is just spreading Fear, Uncertainty, and Doubt. As an article in Science reports, that process is starting in the new Congress with a renewed attack on the Census and the American Community Survey.
- Scientists fear pending attack on federal statistics collection By Jeffrey Mervis, Science (Jan 3, 2017).
As the Science story says, this is not a new attack, but part of “a broader attack on the survey that goes back several years.” Indeed, we covered it here. Read all about the false-facts, bent-facts, unsubstantiated speculation, and ideological faith-healing that typically are used to try to persuade people to support really, really bad policy ideas.
In a recent thread on the Govdoc-l mailing list about a Congressional Publications Hub (or “pub hub” — more of the thread here), one commenter said that The American Memory project’s digital surrogates of the pre-Congressional Record publications "probably aren’t salvageable" because the TIFFs were captured at 300 ppi resolution and then converted to 2-bit bitonal black and white and that most of the text is too faded or pixelated to be accurately interpreted by optical character recognition (OCR) software. He concluded that this was "Kind of a shame."
It is indeed a "shame" that many of the American Memory Project’s "digital surrogates" probably are not salvageable. But the real shame is that we keep making the same mistakes with the same bad assumptions today that we did 10-15 years ago in regard to digitization projects.
The mistake we keep making is thinking that we’ve learned our lesson and are doing things correctly today, that our next digitizations will serve future users better than our last digitizations serve current users. We are making a series of bad assumptions.
- We assume, because today’s digitization technologies are so much better than yesterday’s technologies, that today’s digitizations will not become tomorrow’s obsolete, unsalvageable rejects.
- We assume, because we have good guidelines (like Federal Agencies Digital Guidelines Initiative (FADGI)) for digitization, that the digitizations we make today will be the "best" by conforming to the guidelines.
- We assume, because we have experience of making "bad" digitizations, that we will not make those mistakes any more and will only make "good" digitizations.
Why are these assumptions wrong?
- Yes, digitization technologies have improved a lot, but that does not mean that they will stop improving. We will, inevitably, have new digitization techniques tomorrow that we do not have today. That means that, in the future, when we look back at the digitizations we are doing today, we will once again marvel at the primitive technologies and wish we had better digitizations.
- Yes, we have good guidelines for digitization but we overlook the fact that they are just guidelines not guarantees of perfection, or even guarantees of future usability. Those guidelines offer a range of options for different starting points (e.g., different kinds of originals: color vs. B&W, images vs. text, old paper vs. new paper, etc.) and different end-purposes (e.g., page-images and OCR require different specs) and for different users and uses (e.g. searching vs reading, reading vs. computational analysis). There is no "best" digitization format. There is only a guideline for matching a given corpus with a given purpose and, particularly in mass-digitization projects, the given corpus is not uniform and the end-point purpose is either unspecified or vague. And, too often, mass-digitization projects are compelled to choose a less-than-ideal, one-size-does-not-fit-all, compromise standard in order to meet the demands of budget constraints rather than the ideals of the "best" digitization.
- Yes, we have experiences of past "bad" digitizations so that we could, theoretically, avoid making the same mistakes, but we overlook the fact that use-cases change over time, users become more sophisticated, user-technologies advance and improve. We try to avoid making past mistakes, but, in doing so, we make new mistakes. Mass digitization projects seldom "look forward" to future uses. They too often "look backward" to old models of use — to page-images and flawed OCR — because those are improvements over the past, not advances for the future. But those decisions are only "improvements" when we compare them to print — or more accurately, comparing physical access to and distribution of print vs digital access to and distribution over the Internet. When we compare those choices to future needs, they look like bad choices: page-images that are useless on higher-definition displays or smaller, hand-held devices; OCR that is inaccurate and misleading; digital text that loses the meaning imparted by layout and structure of the original presentation; digital text that lacks markup for repurposing; and digital objects that lack fine-grained markup and metadata that are necessary for accurate and precise search results finer than volume or page level. (There are good examples of digitization projects that make the right decisions, but these are mostly small, specialized projects; mass digitization projects rarely if ever make the right decisions.) Worse, we compound previous mistakes when we digitize microfilm copies of paper originals thus carrying over limitations from the last-generation technology.
So, yes, it is a shame that we have bad digitizations now. But not just in the sense of regrettable or unfortunate. More in the sense of humiliating and shameful. The real "shame" is that FDLP libraries are accepting the GPO Regional Discard policy that will result in fewer paper copies. That means fewer copies to consult when bad digitizations are inadequate, incomplete, or unusable as "surrogates"; and fewer copies to use for re-digitization when the bad digitizations fail to meet evolving requirements of users.
We could, of course, rely on the private sector (which understands the value of acquiring and building digital collections) for future access. We do this to save the expense of digitizing well and acquiring and building our own public domain digital collections. But by doing so, we do not save money in the long-run; we merely lock our libraries into the perpetual tradeoff of paying every year for subscription access or losing access.
UPDATE: Senate torture report to be kept from public for 12 years after Obama decision by Spencer Ackerman, The Guardian (12 December 2016). President Obama has agreed to preserve the report, but his decision ensures that the document remains out of public view for at least 12 years and probably longer. Obama’s decision prevents Republican Senator Richard Burr from destroying existing classified copies of the December 2014 report. Daniel Jones, a former committee staffer criticized the preservation as inadequate. “Preserving the full 6,700-page report under the Presidential Records Act only ensures the report will not be destroyed,” Jones said. “It does little else.”
Declaring that the written history of the U.S. torture program is in jeopardy, two former United States Senators have called upon President Obama to take steps now to make it difficult for a future administration to erase the historical record.
- The Torture Report Must Be Saved By CARL LEVIN and JAY ROCKEFELLER. New York Times (DEC. 9, 2016)
They suggest that President Obama can do this by declaring that the Senate Intelligence Committee’s full, classified 6,700-page report on torture is a "federal record." This will allow government departments and agencies that already possess the full report to retain it and it will make it more difficult for a future administration to destroy existing copies of the document.
Traditionally we think of "government documents" as those publications that government agencies produce for public consumption and "government records" as information that provide evidence of the operations of the agency. Government "records" are defined in 44 U.S. Code § 3301:
…includes all recorded information, regardless of form or characteristics, made or received by a Federal agency under Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency or its legitimate successor as evidence of the organization, functions, policies, decisions, procedures, operations, or other activities of the United States Government or because of the informational value of data in them…
The National Archives and Record Administration uses this definition to guide agencies in their records retention and disposition policies.
As Levin and Rockefeller point out, "the roughly 500-page summary of the Senate report [available as a free PDF and as a $29 print copy] that was declassified and made public at the end of 2014 is only a small part of the story. The full report remains classified." They say that the full report contains:
"information that leads to a more complete understanding of how this program happened, and how it became so misaligned with our values as a nation. Most important, the full report contains information that is critical to ensuring that these mistakes are never made again.
In 2014, the report was sent by the Committee to the Obama administration. Senator Richard Burr of North Carolina, has tried to recall the full report to prevent it from ever being widely read or declassified and specifically asked that it "should not be entered into any executive branch system of records." So far, the Obama administration has not returned the report to Senator Burr. Levin and Rockefeller say that "Given the rhetoric of President-elect Trump, there is a grave risk that the new administration will return the Senate report to Senator Burr, after which it could be hidden indefinitely, or destroyed."