Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

“An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates

In a new article in Portal, Diana Kichuk examines the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America. She focuses particularly on the accuracy and utility of the digital text for reading in e-book formats and on the accuracy of metadata derived from extracted text.

This study, along with a couple of others cited below, are very relevant to the repeated calls by some within the Federal Depository Library Program to digitize and discard the historic FDLP paper collections. These studies, even though they do not focus on government publications, provide examples, data, and standards that should be critical to review before the depository community implements discarding policies that will have irreversible effects.

Kichuk’s article is well worth reading in its entirety as she identifies many problems with digital text created during digitization of paper books by OCR (Optical Character Recognition) technologies, and she gives specific examples. The two most important problems that she highlights are that digitized texts often fail to accurately represent the original, and that the metadata that is automatically created from such text is too often woefully inaccurate. These problems have real effects on libraries and library users. Readers will find it difficult to accurately identify and even find the books they are looking for in digital libraries and libraries will find it difficult to confidently attribute authenticity and provenance to digitized books.

Kichuk says that digitized text versions of print books are often unrecognizable as surrogates for the print book and it may be “misleading at best” to refer to them even as “equivalent” to the original. Although she only examined a small number of e-books (approximately seventy-five), she found “abundant evidence” of OCR problems that suggest to her the likelihood of widespread and endemic problems.

A 2012 report by the HathiTrust Research Center reinforces Kichuk’s findings. That study found that 84.9 percent of the volumes it examined had one or more OCR errors, 11% of the pages had one or more errors, and the average number of errors per volume was 156 (HathiTrust, Update on February 2012 Activities March 9, 2012).

Most of the examples we have of current-generation digitization projects, particularly mass-digitization projects, provide access to digital “page images” (essentially pictures of pages) of books in addition to OCR’d digital text. So, to get a more complete picture of the state of digitization it is instructive to compare Kichuk’s study of OCR’d text to a study by Paul Conway of page images in the HathiTrust.

Fully one-quarter of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” Presumably, that means more than 35% of the volumes examined were not reliable surrogates.

Conway’s study reinforces the findings of the Center for Research Libraries when it certified HathiTrust as a Trusted Digital Repository in 2011. (Full disclosure: I was part of the team that audited HT.) CRL said explicitly that, although some libraries will want to discard print copies of books that are in HT, “the quality assurance measures for HathiTrust digital content do not yet support this goal.”

Currently, and despite significant efforts to identify and correct systemic problems in digitization, HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort. This may impact institutions’ workflow for print archiving and divestiture. (Certification Report on the HathiTrust Digital Repository).

Together, these reports provide some solid (if preliminary) data which should help libraries make informed decisions. Specifically, all these studies show that it would be risky to use digitized copies of FDLP historic collections as reliable surrogates for the original paper copies. That means it would be risky to discard original paper copies of documents simply because they had been digitized.

Although Conway suggests, as others have, that libraries (and users) may have to accept incomplete, inaccurate page images as a “new norm” and accept that they are not faithful copies, he also realizes that “questions remain about the advisability of withdrawing from libraries the hard-copy original volumes that are the sources of the surrogates.”

Kichuk goes further in her conclusions. She wisely envisions that the “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. She looks specifically to a future when users will want and expect ebooks created from digitized text. She warns that current digitization standards, coupled with insufficient funding, are not creating text that is accurate or complete enough to meet the needs of users in the near future. And she recognizes that librarians are not stepping up to correct this situation. She describes “an alarmingly casual indifference to accuracy and authenticity” of OCR’d text and says that this “willful blindness” to the OCR problem is suppressing any sense of urgency to remedy the problem.

She concludes from her small sample that there should be a more systematic review by the digital repository community prior to the development of a new digitized e-book standard, especially for metadata and text file formats.

I agree with Kichuk and Conway and CRL that more work needs to be done before libraries discard their paper collections. Librarians and their communities need to have a better understanding of the quality of page images and digitized text that digitization projects produce. With that in mind, James R. Jacobs and I addressed this very problem in 2013 and suggested a new standard for the quality of page images — which we call the “digital Surrogate Seal of Approval” (DSSOA)) in 2013:

Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects.

FDLP libraries have a special obligation to the country to preserve the historic collections in their charge. It would be irresponsible to discard the complete, original record of our democracy and preserve only an incomplete, inaccurate record it.

Digital Surrogate Seal of Approval: a Consumer-oriented Standard

“The Digital-Surrogate Seal of Approval: a Consumer-oriented Standard.” James A. Jacobs, University of California San Diego and James R. Jacobs, Stanford University. D-Lib Magazine, March/April 2013, Volume 19, Number 3/4. Also available in the Stanford Digital Repository and the University of California Escholarship Repository.


We propose the “Digital-Surrogate Seal of Approval” (DSSOA) as a simple way of describing digital objects created from printed books and other non-digital originals as surrogates for the analog original. The DSSOA denotes that a digitization accurately and completely replicates the content and presentation of the original. It can be used to express an intended goal during the planning stages of digitization and to guarantee the quality of existing digital surrogates. The DSSOA Criteria can be used to evaluate individual digital objects or entire completed collections. DSSOA is independent of production technologies and methodologies and focuses instead on the perspective of consumers — including libraries that rely on digital surrogates.


The Digital-Surrogate Seal of Approval

[Update: The article is also available in the Stanford Digital Repository and the University of California Escholarship Repository.]

James and I are happy to announce that our new article appears in the current edition of D-Lib Magazine:

In the last few years, there have been a series of articles, reports and proposals that rely on the promises of digitization to address issues of physical space, cost control, access, and collection management for FDLP libraries. One of the reasons we created this Seal of Approval standard is to provide a clear, consistent way to help evaluate some of these promises of digitization.

There are those who continue to insist that we have too many copies of federal documents, that preserving those copies is too expensive, that GPO is being unreasonable when it does not allow libraries to discard materials, and so forth. Although proposals to digitize FDLP collections are often couched in terms of enhancing access, libraries can digitize and enhance access without discarding paper copies. The underlying motivation of such proposals is often explicitly to weed the paper collections and, when not explicit, it is always implied. These proposals raise many questions in our minds. For example:

  • Will digitizations include digital text as well as images and will the text be accurate and complete and (re)usable?
  • Will the digitizations be readable and usable on modern e-book devices?
  • Will digitizations create digital objects that are as good as the originals, or worse, or better?
  • Will digitizations be deposited into Trusted Digital Repositories to ensure their long-term preservation and access?
  • Will the library that contributes the original be in control of the digital copy, or will control be ceded to large mega-libraries?
  • Will the digitizations include adequate metadata for management, preservation, and discovery?
  • Will libraries develop and maintain discovery and delivery mechanisms that address the special requirements of federal documents?
  • Will libraries provide adequate digital services for the digital collections?
  • Will any cost savings be applied to collection management and services for these collections or will the cost savings be redirected to other collections and services?
  • Will there actually be cost savings if we adequately address the above questions?

But there is one other question that is more important than all of the above. The question we must ask first is: Are the digitizations accurate and complete? If they are not, the other questions become moot or irrelevant. The DS-SOA is intended to help us answer that question. The DS-SOA denotes that a digitization accurately and completely replicates the content and presentation of the original.

The standard is designed to be easily understood and usable, not just by digitization-specialists, but also by library administrators, collection managers, service providers, preservation officers, business managers, and others who are responsible for library collections and services. It is also meant to help communicate clearly to end users the accuracy and completeness of the digitizations libraries provide to them.

We believe that libraries fulfill a unique role in society, one that is different from that of producers, agencies, publishers, authors, and vendors. We believe that the value of libraries is dependent upon the collections we select, acquire, preserve, and maintain and the services that we provide for those collections. The FDLP collections are unique; they provide a primary-source, historical record of our democracy. The FDLP print collections are not “legacy collections” as they are often called by those who wish to discard them; (the use of the word “legacy” as an adjective means “outdated” and “unwanted”). They are, however, our legacy. The use of the word “legacy” as a noun means bequest, heritage, endowment, gift, and birthright. The DS-SOA is a simple tool that libraries can use to ensure the value of their digital collections and communicate that value to library users. We believe that failure to ensure completeness and accuracy of our digital collections will reduce the value of libraries. We believe that replacing paper-and-ink books with digital copies without first ensuring and documenting that those copies are complete and accurate representations of the original would be tantamount to redacting the historical record of our democracy.