Home » Posts tagged 'digitization' (Page 2)

Tag Archives: digitization

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

FDLP Digitization Projects Registry is operational

Well, it’s been almost a year, but the FDLP digitization registry is back online, complete with a fresh new look!

via The Digitization Projects Registry is Now Available.

The FDLP Historical Collections

A recent question on the govdoc-l mailing list asked if GPO had ever officially defined the term “legacy collection” or “legacy document” and if the definition goes beyond something that has historical value or importance. I posted a short answer there. Here, I document and explain that brief response.

The term was introduced by the Association of Research Libraries (ARL) and by Superintendent of Documents Judy C. Russell in 2003. The phrase has been used almost exclusively in the documents community in the context of digitizing and discarding FDLP historical paper collections ever since.

Before 2003

Before 2003, documents and articles that discuss collections (even in the context of digitizing them) rarely if ever used the adjective “legacy” to describe FDLP collections. For example, a 2002 GODORT report on digitizing government information did not include the word “legacy” to describe the collections to be targeted for digitization.

I did not find any references to “legacy collections” in DttP or govdoc-l or Google Scholar or Library, Information Science & Technology Abstracts before 2003.

2003: Introduction of a New Term

Judy C. Russell, then GPO Superintendent of Documents, apparently introduced the term in late 2003 in an announcement of an agreement between ARL and GPO to “digitize a complete legacy collection.” GODORT mentioned it at ALA in January 2004 and GPO included it in its Strategic Vision for the 21st Century in December 2004.

Russell also referred to “legacy content” at the Center For Research Libraries Forum on “Building Blocks of a National Print Preservation Network.” And, in a 2005 Dissemination Implementation Plan, GPO referred to the “legacy collection of tangible U.S. Government publications held in libraries participating in the Federal Depository Library Program (FDLP).”

Adoption of the term

After 2004, the mentions of an FDLP “legacy collection” increased. Documents librarians adopted the phrase to refer to the paper (and sometimes microform, and, occasionally, even “tangible” digital) documents that GPO had actually deposited into FDLP libraries.

What is the FDLP “Legacy Collection?”

Russell described the “legacy collection” as “tangible items in your libraries” in her remarks to DLC in April 2004. She also said that the legacy collection of U.S. government documents consisted of “an estimated 2.2 million print publications totaling approximately 60 million pages.” A report of the 2004 GPO meeting of experts on digital preservation described the legacy collection as “U.S. government documents currently held in depositories, estimated to be about 2.2 million items (excluding microfiche).”

It is worth noting here that the term was applied to all paper; there was no singling-out of any documents that would have more historical value or importance. The key to inclusion within the definition of “legacy collection” was, apparently, that they were paper and were in FDLP libraries and were targets of digitization (and, as we will see in a moment, targets for discarding).


As noted above, the introduction of the phrase accompanied a plan to digitize the paper FDLP collections. GODORT referred to the initiative as “Digitizing Legacy Federal Documents Collections.” The GPO Strategic Vision described converting “printed legacy documents” into digital format. The Dissemination Implementation Plan enumerated priorities for digitization of the “Legacy Collection.” The purpose of the Experts Meeting was to address digitizing “the entire legacy collection of U.S. government documents.”

The government information community adopted that context along with the phrase. Every use of the phrase that I found was in the context of digitizing paper.

Why “Legacy”?

Why did ARL and Russell choose the term “legacy collection”? Since the use of the phrase was directly and explicitly tied to digitization of those collections, why not describe those collections as “analog materials” or “historical collections” or “paper collections” or even (ugh!) “tangible collections?” Why “legacy“?

The Merriam-Webster dictionary says that the word “legacy” was not used as an adjective until 1990. That use comes not from the libraries but from the computing world. It is used by IT managers to describe software or systems that are outdated and unwanted. Wikipedia says that it is often considered a “pejorative term” and is used to describe systems that are “potentially problematic.” And the New Oxford American Dictionary defines it as “software or hardware that has been superseded.” In practice, IT managers would like to stop supporting “legacy software” and discard it. Sound familiar?

It is, of course, possible that the choice of the term to describe the FDLP Historical Collections was not well thought out and no one intended to imply that the collections are problems that need to be discarded. But it is revealing that GPO’s own 2004 Strategic Vision statement not only used “legacy” to describe “printed documents,” but also said that GPO needed to reduce costs associated with the operation and maintenance of “stand alone, legacy computer systems.” This was not a mysterious, obscure word with an ambiguous meaning — even within the walls of GPO.

Legacy (adj.). Unwanted.

Thus, the use of the term “legacy” as an adjective to describe print FDLP collections reflects a particular attitude (one might even say a bias) about the FDLP Historical Collections. It defines the FDLP Historical Collections as out-of-date, unnecessary, and unwanted. Using this term pre-determines the fate of the collections. Those who use this term are expressly saying that they have already decided that they want to throw the collections away – even if they say that what they want is better access.

Using such terminology helps explain why the discussions about these collections have not focused on their intrinsic value, or their value to specific user communities, or the quality of the digital surrogates being used to replace (not supplement) them. Instead, the discussion has returned to a single question again and again and again: How many copies should we keep? – which is the wrong question.

Digitize and Discard

The phrase fits in well with ARL’s long-term advocacy of digitizing paper collections and then discarding them. See for example its 2008 report in which it proposed “a small number of physical regional legacy collections” and its 2010 report when it recommended that there should be “a distributed system for storage of print legacy collections that involves no more than 15 regionally distributed comprehensive print collections.” These recommendations to discard Historical Collections in order to reduce the number of paper copies in the FDLP are not supported with any evidence that such policies will either meet the needs of our communities or preserve the written record of the government.

Let me be clear. I am not an advocate of saving print collections for the sake of print collections. Tautologies are not useful for planning. But, in the same way, vague promises to enhance access through digitization are also not useful. Vague promises need to be backed up with procedures to minimize the risk of loss of information and long-term planning that provides adequate resources for preservation, access, and service. As James R. Jacobs and I have repeatedly argued (see endnotes), decisions about retention and discarding need to be premised on the needs of our communities and the ability of libraries to preserve and provide free access to the FDLP collections. Just labeling the collections as unwanted and out of date may be a clever way to try to persuade librarians to discard their collections without examining the outcomes of doing so. But labeling without evidence is not an application of Library and Information Science. It is rhetorical misdirection.

Libraries are free to digitize their collections (and they should!). If enhanced access is the goal, this can be done today without unnecessarily discarding a single document. But ARL and their supporters have been adamant that digitization must be linked to “flexibility … for the efficient management of the legacy collections” and reducing the number of print copies by requiring only a “small number of physical regional legacy collections (print and microforms).” And some libraries are using digitization as an excuse and a technique for discarding.

A better term: FDLP Historical Collections

I suggest that librarians use the term “FDLP Historical Collections.”

“Historical” because these documents tell us something about the past. Indeed, these documents are also, in a very real sense, “historic” in that they are the unique official record of our democracy.

“Collections” (plural) because we have many separate collections – not one big one – and we do not have an accurate and complete inventory of holdings across all FDLP libraries that would allow us to call it a single “collection.”

Legacy (noun). Gift, Inheritance.

I think it is fine to use the word “legacy” as a noun when speaking of our historical collections because they have been handed down to us. They are more like a valuable inheritance than an unwanted copy of WordStar. Who will preserve and take care of this legacy? Only FDLP libraries have this as their mission. Only FDLP libraries are responsible for the stewardship of this legacy.

For us to discard those paper publications without ensuring the accurate and complete preservation of the information in them would be to discard a valuable inheritance and ignore our responsibility.


Words matter. Library professionals are supposed to be professional and should be clear and unambiguous when they choose their terminology. This is important when making plans for the future and it is even more important when the planning involves irreversible decisions. Librarians should reject the use of the term “legacy collection” when discussing the FDLP Historical Collections and challenge those who use it.

But choosing a different term is not enough. We should clearly articulate both the inherent value of the FDLP Historical Collections and their specific value to our designated communities.

The documents in the FDLP Historical Collections may not exist anywhere outside of FDLP libraries. Even Judy Russell had to admit that discarding paper collections without a clear preservation and access strategy can be a big mistake. In her remarks to ARL in 2003, Russell said:

Many years ago GPO turned over its historical collection to the National Archives and almost immediately we began to regret the absence of a tangible collection. We have decided to re-establish a comprehensive collection of tangible and electronic documents as a collection of last resort for the program, and the new organization will dedicate staff resources to that effort.

Unfortunately, there has, apparently, been little progress in rebuilding GPO’s paper collection as a Collection of Last Resort. Instead, GPO is actively promoting changes that will make it easier to discard more paper collections.

While individual documents or volumes may exist elsewhere, FDLP libraries have collections that put those individual documents in context of their provenance. Although casual internet users may not understand the value of context and provenance, librarians do (or should) and researchers require it. Before FDLP libraries use digitization as an excuse and a technique for discarding these collections, librarians should insist on several essential criteria. My colleague James R. Jacobs has developed a preliminary checklist in his What Are We To Keep? (FAQ). Let’s think about that checklist and think carefully before we assign pejorative labels to our valuable legacy.


Association of Research Libraries. 2008. Future Directions for the Federal Depository Library Program (Dec 4, 2008).

Association of Research Libraries. 2010. Statement of Principles on the Federal Depository Library Program (October 2010).

Center for Research Libraries. 2004. Building Blocks of a National Print Preservation Network. Focus on Global Resources, Vol. 24, Num. 1 (Fall 2004).

Depository Library Council. 2004. Advice to the Public Printer (January 22, 2004).

Federal Depository Library Program. 2014. Future Roles and Opportunities: An FDLP Forecast Study Working Paper (March 28, 2014).

GODORT. 2002. Report: Digitization Of Government Information. Ad Hoc Committee on Digitization Of Government Information, Cathy Nelson Hartman Committee Chair (June 14, 2002).

GODORT. 2004. First Steering Committee Meeting Agenda. 2004. ALA Midwinter Conference, San Diego, Friday, January 9, 2004.

Jacobs, James A. and James R. Jacobs. 2013. The Digital-Surrogate Seal of Approval: a Consumer-oriented Standard. D-Lib Magazine (2013).

Jacobs, James A. 2015. “An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates. FreeGovInfo (March 1, 2015).

Jacobs, James A. 2015. Legacy collections. “Discussion of Government Document Issues” (25 Jun 2015).

Jacobs, James R. 2014. Why GPO’s proposed policy to allow Regionals to discard is a bad idea. FreeGovInfo (August 27, 2014).

Jacobs, James R., What are we to keep?, Documents to the People (Spring 2015).

Jacobs, James R., What Are We To Keep? (FAQ). FreeGovInfo (April 30, 2015.).

Rossmann, Brian W. 2005. Legacy Documents Collections: Separate the Wheat from the Chaff. DttP: Documents to the People Volume 33, No. 4 (Winter 2005).

Russell, Judith. 2003. Remarks by Judy Russell, 142nd ARL Membership Meeting, 142nd ARL Membership Meeting, Federal Relations Luncheon (May 15, 2003).

Russell, Judy C. 2003. Information Dissemination Operations. Remarks by Judy C. Russell Superintendent of Documents Depository Library Conference/Fall Council Meeting October 20, 2003, Administrative Notes Vol. 24, no. 13 (November 15, 2003).

Russell, Judy C. 2004. Remarks of Superintendent of Documents Depository Library Conference St. Louis, Missouri (April 18, 2004).

U.S. Government Printing Office. 2004. A Strategic Vision for the 21st Century, (Dec. 2004).

U.S. Government Printing Office. 2004. Report on the Meeting of Experts on Digital Preservation. U.S. Government Printing Office Washington, D.C. (March 12, 2004).

U.S. Government Printing Office. Office of Information Dissemination. 2005. Information Dissemination Implementation Plan: Priorities For Digitization Of Legacy Collection. Washington, D.C. (September 15, 2005).

What Are We To Keep? (FAQ)

This document is meant to accompany the article, “What are we to Keep?” by James R. Jacobs, Documents to the People (Spring 2015) p 13-19.


  • What is a Preservation Copy?

    Research that was prompted by JSTOR’s desire to determine how to guarantee that all of the printed material within its journals would remain available defined preservation copies as “clean copies that retain full information accuracy from the vantage point of the researcher” (Yano). Thus when we think about “preservation copies” we are looking to be able to ensure that copies are available for the long-term and that those copies are complete and accurate. “Informational Accuracy” a “perfect copy” — a copy that is as good as new. A preservation copy is, therefore, a “clean” copy that is quality-checked and repaired, if necessary, on a page by page basis.

  • Why do we need Preservation Copies?

    Even if we had perfect digital copies of paper documents, we still need preservation paper copies for two reasons. First, there is evidence that digital documents degrade more rapidly than print material (Rosenthal), so it is necessary to have a paper copy that could be used to re-digitize. Second, Digitization does not magically preserve paper; or, to put it another way, digital copies are not the same as print copies and may inherently lose information by the very dint of reformatting to a new presentation.

  • Why do we need Access-Copies?
  • Unless we have perfect, page-verified digitizations that are as complete, as accurate, and as easily usable as the original paper copies (Jacobs and Jacobs), users will inevitably need to go back to the original paper copy in order to get either the complete and accurate content or the functional usability of the original paper medium. Some libraries have already reported that digitization of paper copies has increased the demand for access to the paper copies. Additionally, some users/uses will require access to physical copies via Interlibrary Borrowing. ILL can only happen if there is a surplus of copies. As the # of copies goes toward 0 (scarcity), libraries will no longer be willing to lend to ILL. Therefore, it is imperative that there not be a dearth of geographically distributed copies.

  • Why do we need re-digitization copies?

    Unless we create perfect copies that adequately anticipate the future needs of users, we will need to create new digitizations in order to meet those future needs. (See “An alarmingly casual indifference to accuracy and authenticity” What we know about digital surrogates.)

What should I think about before discarding government documents?

“An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates

In a new article in Portal, Diana Kichuk examines the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America. She focuses particularly on the accuracy and utility of the digital text for reading in e-book formats and on the accuracy of metadata derived from extracted text.

This study, along with a couple of others cited below, are very relevant to the repeated calls by some within the Federal Depository Library Program to digitize and discard the historic FDLP paper collections. These studies, even though they do not focus on government publications, provide examples, data, and standards that should be critical to review before the depository community implements discarding policies that will have irreversible effects.

* * *

Kichuk’s article is well worth reading in its entirety as she identifies many problems with digital text created during digitization of paper books by OCR (Optical Character Recognition) technologies, and she gives specific examples. The two most important problems that she highlights are that digitized texts often fail to accurately represent the original, and that the metadata that is automatically created from such text is too often woefully inaccurate. These problems have real effects on libraries and library users. Readers will find it difficult to accurately identify and even find the books they are looking for in digital libraries and libraries will find it difficult to confidently attribute authenticity and provenance to digitized books.

Kichuk says that digitized text versions of print books are often unrecognizable as surrogates for the print book and it may be “misleading at best” to refer to them even as “equivalent” to the original. Although she only examined a small number of e-books (approximately seventy-five), she found “abundant evidence” of OCR problems that suggest to her the likelihood of widespread and endemic problems.

A 2012 report by the HathiTrust Research Center reinforces Kichuk’s findings. That study found that 84.9 percent of the volumes it examined had one or more OCR errors, 11% of the pages had one or more errors, and the average number of errors per volume was 156 (HathiTrust, Update on February 2012 Activities March 9, 2012).

* * *

Most of the examples we have of current-generation digitization projects, particularly mass-digitization projects, provide access to digital “page images” (essentially pictures of pages) of books in addition to OCR’d digital text. So, to get a more complete picture of the state of digitization it is instructive to compare Kichuk’s study of OCR’d text to a study by Paul Conway of page images in the HathiTrust.

Fully one-quarter of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” Presumably, that means more than 35% of the volumes examined were not reliable surrogates.

Conway’s study reinforces the findings of the Center for Research Libraries when it certified HathiTrust as a Trusted Digital Repository in 2011. (Full disclosure: I was part of the team that audited HT.) CRL said explicitly that, although some libraries will want to discard print copies of books that are in HT, “the quality assurance measures for HathiTrust digital content do not yet support this goal.”

Currently, and despite significant efforts to identify and correct systemic problems in digitization, HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort. This may impact institutions’ workflow for print archiving and divestiture. (Certification Report on the HathiTrust Digital Repository).

* * *

Together, these reports provide some solid (if preliminary) data which should help libraries make informed decisions. Specifically, all these studies show that it would be risky to use digitized copies of FDLP historic collections as reliable surrogates for the original paper copies. That means it would be risky to discard original paper copies of documents simply because they had been digitized.

Although Conway suggests, as others have, that libraries (and users) may have to accept incomplete, inaccurate page images as a “new norm” and accept that they are not faithful copies, he also realizes that “questions remain about the advisability of withdrawing from libraries the hard-copy original volumes that are the sources of the surrogates.”

Kichuk goes further in her conclusions. She wisely envisions that the “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. She looks specifically to a future when users will want and expect ebooks created from digitized text. She warns that current digitization standards, coupled with insufficient funding, are not creating text that is accurate or complete enough to meet the needs of users in the near future. And she recognizes that librarians are not stepping up to correct this situation. She describes “an alarmingly casual indifference to accuracy and authenticity” of OCR’d text and says that this “willful blindness” to the OCR problem is suppressing any sense of urgency to remedy the problem.

She concludes from her small sample that there should be a more systematic review by the digital repository community prior to the development of a new digitized e-book standard, especially for metadata and text file formats.

I agree with Kichuk and Conway and CRL that more work needs to be done before libraries discard their paper collections. Librarians and their communities need to have a better understanding of the quality of page images and digitized text that digitization projects produce. With that in mind, James R. Jacobs and I addressed this very problem in 2013 and suggested a new standard for the quality of page images — which we call the “digital Surrogate Seal of Approval” (DSSOA)) in 2013:

Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects.

FDLP libraries have a special obligation to the country to preserve the historic collections in their charge. It would be irresponsible to discard the complete, original record of our democracy and preserve only an incomplete, inaccurate record it.

History of immigration: Map of US immigrant populations in 1903

immigration-infographic-bad-scanThe Slate Vault today highlighted a “data-packed” map of American immigration in 1903 from the annual report of the Commissioner-General of Immigration. The Vault always posts interesting and beautiful maps, images etc. They also linked to anew-to-me site called Handsome Atlas that has some beautiful scans and visualizations of historic US atlases. GO and check them out.

But what they didn’t mention was that this Annual Report — technically titled the “Annual report of the Commissioner-General of Immigration to the Secretary of the Treasury for the fiscal year ended …” — is available in libraries around the country as it was distributed by the Federal Depository Library Program (FDLP) AND that the map “Race and occupation of immigrants by destination” is just one of the many maps, statistical tables, infographics, and photographs embedded in these annual reports. Stanford University Library, where I work, has the annual report going back to 1892!

And, yes, you can find this publication in Google Books, HathiTrust, and the Internet Archive, BUT you WON’T find any of the many foldout maps/infographics because they simply weren’t weren’t scanned.

A reader could use the map to see which proportion of the immigrant population of a state came from each of six “races or peoples”: “Teutonic,” “Keltic,” “Slavic,” “Iberic,” “Mongolic,” or Other. These designations echoed popular eugenic racial ideologies of the time, which used quasi-scientific theories to lump people into basic groups of origin understood to share common characteristics. The bars showing percentages of immigrants in each state color-code the newcomers according to “race or people,” so that these can be seen at a glance, then use text to explain which countries these “Mongolians” or “Slavics” came from.

The map was put together as part of an annual report made for the Commission-General of Immigration, and printed by the Government Printing Office in 1903.

via History of immigration: Map of United States immigrant populations in 1903..