In a new article in Portal, Diana Kichuk examines the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America. She focuses particularly on the accuracy and utility of the digital text for reading in e-book formats and on the accuracy of metadata derived from extracted text.
- Kichuk, Diana. “Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books.” Portal: Libraries and the Academy 15, no. 1 (2015): 59–91. doi:10.1353/pla.2015.0005.
This study, along with a couple of others cited below, are very relevant to the repeated calls by some within the Federal Depository Library Program to digitize and discard the historic FDLP paper collections. These studies, even though they do not focus on government publications, provide examples, data, and standards that should be critical to review before the depository community implements discarding policies that will have irreversible effects.
Kichuk’s article is well worth reading in its entirety as she identifies many problems with digital text created during digitization of paper books by OCR (Optical Character Recognition) technologies, and she gives specific examples. The two most important problems that she highlights are that digitized texts often fail to accurately represent the original, and that the metadata that is automatically created from such text is too often woefully inaccurate. These problems have real effects on libraries and library users. Readers will find it difficult to accurately identify and even find the books they are looking for in digital libraries and libraries will find it difficult to confidently attribute authenticity and provenance to digitized books.
Kichuk says that digitized text versions of print books are often unrecognizable as surrogates for the print book and it may be “misleading at best” to refer to them even as “equivalent” to the original. Although she only examined a small number of e-books (approximately seventy-five), she found “abundant evidence” of OCR problems that suggest to her the likelihood of widespread and endemic problems.
A 2012 report by the HathiTrust Research Center reinforces Kichuk’s findings. That study found that 84.9 percent of the volumes it examined had one or more OCR errors, 11% of the pages had one or more errors, and the average number of errors per volume was 156 (HathiTrust, Update on February 2012 Activities March 9, 2012).
Most of the examples we have of current-generation digitization projects, particularly mass-digitization projects, provide access to digital “page images” (essentially pictures of pages) of books in addition to OCR’d digital text. So, to get a more complete picture of the state of digitization it is instructive to compare Kichuk’s study of OCR’d text to a study by Paul Conway of page images in the HathiTrust.
- Conway, Paul. “Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust,” 2013. .
Fully one-quarter of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” Presumably, that means more than 35% of the volumes examined were not reliable surrogates.
Conway’s study reinforces the findings of the Center for Research Libraries when it certified HathiTrust as a Trusted Digital Repository in 2011. (Full disclosure: I was part of the team that audited HT.) CRL said explicitly that, although some libraries will want to discard print copies of books that are in HT, “the quality assurance measures for HathiTrust digital content do not yet support this goal.”
Currently, and despite significant efforts to identify and correct systemic problems in digitization, HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort. This may impact institutions’ workflow for print archiving and divestiture. (Certification Report on the HathiTrust Digital Repository).
Together, these reports provide some solid (if preliminary) data which should help libraries make informed decisions. Specifically, all these studies show that it would be risky to use digitized copies of FDLP historic collections as reliable surrogates for the original paper copies. That means it would be risky to discard original paper copies of documents simply because they had been digitized.
Although Conway suggests, as others have, that libraries (and users) may have to accept incomplete, inaccurate page images as a “new norm” and accept that they are not faithful copies, he also realizes that “questions remain about the advisability of withdrawing from libraries the hard-copy original volumes that are the sources of the surrogates.”
Kichuk goes further in her conclusions. She wisely envisions that the “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. She looks specifically to a future when users will want and expect ebooks created from digitized text. She warns that current digitization standards, coupled with insufficient funding, are not creating text that is accurate or complete enough to meet the needs of users in the near future. And she recognizes that librarians are not stepping up to correct this situation. She describes “an alarmingly casual indifference to accuracy and authenticity” of OCR’d text and says that this “willful blindness” to the OCR problem is suppressing any sense of urgency to remedy the problem.
She concludes from her small sample that there should be a more systematic review by the digital repository community prior to the development of a new digitized e-book standard, especially for metadata and text file formats.
I agree with Kichuk and Conway and CRL that more work needs to be done before libraries discard their paper collections. Librarians and their communities need to have a better understanding of the quality of page images and digitized text that digitization projects produce. With that in mind, James R. Jacobs and I addressed this very problem in 2013 and suggested a new standard for the quality of page images — which we call the “digital Surrogate Seal of Approval” (DSSOA)) in 2013:
- Jacobs, James A., and James R. Jacobs. “The Digital-Surrogate Seal of Approval: A Consumer-Oriented Standard.” D-Lib Magazine 19, no. 3/4 (March 2013). doi:10.1045/march2013-jacobs.
Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects.
FDLP libraries have a special obligation to the country to preserve the historic collections in their charge. It would be irresponsible to discard the complete, original record of our democracy and preserve only an incomplete, inaccurate record it.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Why are we referring to the OCR text as the object surrogate? Normally you would use the page image as the surrogate for preservation purposes. The text derived from OCR is simply an augmentation to aid machine-readability and is typically stored separately from your master files (either embedded in an access file or indexed in a repository). An end-user will likely never encounter the OCR text. He or she will read the page image. I don’t see how the OCR text would have any other application beside data mining and/or keyword indexing really. You could certainly make the argument that inaccurate OCR negatively effects those operations and that will be a big problem in the future, but that’s a different discussion. You–or rather Kichuk–are drawing a comparison with the state of the paper copy. Just because your OCR is off doesn’t necessarily mean there is a flaw in your digital page image. The page image can still be 100% human readable, even if it’s only 84% computer readable. That happens all the time, especially if you’re capturing antiquated typeface. And if that is the case, the digital image should be an acceptable surrogate to the paper; should surpasses it even since paper, after all, is 0% computer readable.
Thank you for your comment, Erik.
I hope you will not judge the value of Kichuk’s paper solely on my very brief summary of some of her arguments. She spells out her arguments in detail in her 34 page article. Although I do not wish to speak for Kichuk, if you read her complete paper you will find that she makes a point of saying that some digital repositories do, indeed, deliver “e-book” versions of digitized books using the OCR’d text. See her Table 2 on pages 65 and 66, for example. She also notes that “Not all of the repositories discussed include downloadable e-book text files derived from OCR.” But she explicitly “focuses on e-book text file formats and metadata, and the online reading experience.” She also notes that text versions of public domain digitized books are also finding their way into online ebook stores like B&N and Amazon. Although you may not think digitized text is being or should be presented as a surrogate, that is exactly what is being done in some cases and that is what Kichuk addresses in her article.
This is, actually, a very complex subject and no one–even in 34 pages–can present all the subtle details. Kichuk did, in my opinion, a very good job of digging into the details of the problem of libraries that do, indeed, present digitized-text ebooks to the public. She even quotes the policies and caveats that some libraries attach to those ebooks. (For example, she quotes the Project Gutenberg warning statement about the quality of its MOBI formats: “Some or all Project Gutenberg Mobipocket files may be buggy or may not work altogether” [page 74]).
It also is important to recognize that Kichuk criticizes the state of OCR for a second reason. She makes a detailed argument about the problem of relying (as is done in some cases) on bibliographic metadata that is created by automated processes that use incorrect and incomplete OCR’d text. This leads to misattribution, misidentification and other problems of authenticity and provenance–issues that are particularly important to documents librarians. We are lucky that HT has a project that is working on this http://www.hathitrust.org/usgovdocs_registry. That project even notes that “The quality of metadata for US federal materials is inconsistent, which has the potential to lead to false matching and/or metadata duplication” and my guess is that some (though surely not all) of that problem relates to the problem that Kichuk identifies. Incidentally, The title of Kichuk’s paper comes from a paper by Geoffrey Nunberg. Nunberg, you may remember, famously criticized the state of metadata in the Google Books project. Kichuk notes that “It appears that Google Books reacted to Nunberg’s stinging criticisms with a series of one-off corrections instead of overhauling its metadata standards.”
I think I do understand your argument that “normally” many libraries do indeed think of digitized text as a complement or supplement to digital page images and not as stand-alone products. To the extent that that is the case, it presents a slightly different set of issue that I do not think Kichuk addresses directly–again: no one can address every issue in a short article.
So, let me make some comments of my own here about that. Please do not attribute any of what follows to Kichuk; these are my own observations.
– As I said in my post, I think it very, very important for libraries and their communities to have a better understanding than we have now of the quality of page images and digitized text that digitization projects produce. If libraries want to provide inaccurate, incomplete, unchecked digital text, they should say so up front. (Kichuk does, in fact, suggest that we should call OCR text “RAW text” not full text as a starting place for being more transparent.)
– Libraries may be content to provide inaccurate OCR text because it provides so much better access than no digital text, but users will increasingly want to use that text for other purposes. Textual analysis, computational analysis, “distant reading,” dynamic corpus building, translation services, automatic subject classification, converting statistical tables into spreadsheets, and, yes, ebooks; these are just a few of the uses of digital texts that are already becoming more and more common every day. We will see much more in the future. There are technical work-arounds for inaccurate and incomplete text in many of these areas, but it is a simple fact that the better text we provide, the better those applications will be. We are not doing our users a favor when we make excuses for a lousy digital object by saying “we didn’t mean for you to use it that way.” I would suggest that libraries should be looking forward to the uses that our communities will want to make of digitizations in the future. When librarians compare current-generation digitizations with the past and say that this is better than what we had, they are looking backwards, not forwards.
– I don’t know about you, but I use digitized text every day in my work. You may not think I should be trying to do that, but when a library provides a digitized book and it has eye-readable page images, I want and expect good, accurate digital text as part of that file. I want to be able to accurately search and find text. You might be surprised at how often that simple tasks fails. Kichuk would not be surprised. I also want to cut and paste text, but that simple task is often either impossible or yields gibberish that is unusable. To me, this shows that the library does not care about the digital objects it is providing. That is not (or should not be) the response a library wants to get from its community of users.
– Finally, and most importantly, my main point above (and, let me say again, this was not Kichuk’s point), was that FDLP libraries should not be discarding paper copies of our historical documents collections just because they have been digitized, particularly when the digitizations we are creating are not really even good surrogates. This is why I made specific reference to the accuracy studies of page-images in digital libraries. What we know about digital surrogates (to answer the rhetorical question in the title of this post) is that they are incomplete and inaccurate. As I said above, it would be irresponsible to discard the complete, original record of our democracy and preserve only an incomplete, inaccurate record it.
Hi Erik, You should also read Conway’s article (cited above), as he examined page images, not just OCR:
“Fully one-quarter of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” Presumably, that means more than 35% of the volumes examined were not reliable surrogates.”
And, as Jim points out, *all* readers will in fact encounter OCR text either directly or indirectly. Kichuk I think did a pretty good job of showing how inaccurate OCR negatively effects the entire ecosystem from metadata creation to user access.
Thank you both for addressing my comment (you are two different people, correct?). I’m sorry for the delay replying. I neglected to check the “Notify me of followup comments via email” option on this comments widget. It seems safe to do that though since the comments on this site are every bit as informative as the posts and are thankfully free spam. You run a tight ship!
I think the problem presented in that Conway article is the more concerning of the two, and kind of alarming too (35% unreliable!?). As OCR software continues to improve, and it has pretty steadily over the past 15 years, we could potentially overcome the issues raised in Kichuk’s article by just doing OCR better. If the page image is intact, the possibility of reading it better still exists. But if the page image itself is flawed or is unreadable, improvement is impossible, and that’s a real preservation problem.
I found the Kichuk article to be a perfectly sound piece of scholarship and her argument is correct. I might have left out the praise and admiration from my comment for the sake of brevity. Also, it’s hard not to troll just a little bit in blog comments, and if it provokes such an excellent excursus from James A. then I’d say it was well worth it.
But as I was saying, I don’t dispute Kichuk’s claim that lousy full text exists and that it’s ruining metadata. 95% accuracy is usually considered good OCR for a text and that’s not very good at all. Also, I’ve tried and failed several times to extract metadata problematically from a text set. It doesn’t work to my liking and I call your work into question if it works for yours. My issue is how she frames her argument as a preservation matter. To my mind this is about access. Remember the old preservation vs. access, archivist vs. librarian divide? Well it’s back! and in digital form. For preservation we have very clearly defined standards and best practices which have been in development over the better half of a generation. There are rules around image quality, metadata collection, file format usage, and emulation which, if followed, will ensure that a digital object will remain usable and available for enhancements well into the future. Any digital repository worth its salt ought to have master files on disc somewhere, not necessarily available for public download, that meet the acceptable preservation standards.
Now access is a different story. There are no rules around creation of derivative access files (your pdfs, epubs, and so forth). It’s just whatever your user community needs and expects. There is a lush and vigorous conversation to be had about what standards for good access ought to be. And I think Kichuk’s article belongs in that discussion. If researchers and government information librarians require better RAW text accuracy to carry out their work, then it is right and good that they ask for it. But they will have to argue that the benefits justify the costs, because perfection costs. Heck, near-perfection costs.
Preservation standards cannot be compromised upon. Access standards maybe can be. It depends.
Thanks for your follow up response, Erik, and for your kind comments. And, yes, there are two of us! (James A. = “jim” and James R. = “James”)
I agree that the figures Conway came up with are troubling, particularly when libraries discard books based on an (often very flawed) assumption that there is a complete and accurate digital surrogate (see CRL comment on HT above).
I agree completely with your comments about the need to focus on needs of our user communities. But I wouldn’t call this about “access”; I would call it about the functionality of the digital objects. I wouldn’t call this a conflict of “preservation vs. access” either. We can preserve anything – even lousy digitiizations – and we can provide “access” to any digital object – even lousy ones. I think we agree that deciding what functionality our communities need is the key to creating the right digital objects so that we can provide access to them and preserve them. But we also need to be aware of and plan for future functionality and needs.
What do I mean by “functionality”? Kichuk describes the functionality of “a better e-book” in order to reach the goal of “preservation of our print cultural heritage.” I don’t want to speak for her, but I think she is suggesting that we need digital surrogates that are functionally equivalent in the digital-tablet and ebook-reader age as the original print books were in the paper and ink age. That is a reasonable goal and good idea. She focuses on the accuracy and completeness of OCR text as a way of doing that.
But what about some other already pretty common functionalities such as computational analysis, dynamic corpus building, and using statistical tables in spreadsheets (to repeat a few of the things I mentioned in my first comment)? We need digital text that is accurate and complete and machine-actionable for those things.
So, I would go even further than Kichuk does and suggest that we need to anticipate the evolving needs of our communities. I worry that too many digitization projects evaluate the quality of digitizations by comparing them to print. In my opinion, this is looking backwards. When you compare digitization with paper it is easy to conclude that any digitization is functionally better than paper (remote access, keyword searching, etc.). But I think we should look to the future and ask, How can we create digital objects that will meet the future needs of our communities? We don’t need to predict the future to do this. In the case of text, we only need to create digital text objects that are accurate and complete and flexible. Doing this will make the digital objects we create today usable in the future even as technologies and community expectations evolve. We can’t assume that we can digitize once and be done, especially given the many flaws in our current digital surrogates.
I totally agree with you that, to do that, we should demonstrate the benefits and justify the costs. The simplest elevator pitch is: “Do it once and do it right.”
I also want to elaborate a bit on a technical issue you bring up. I agree that OCR continues to improve. (Although I think it needs to improve more than most of us admit. The “accuracy” counts we get from digitization software are themselves often inaccurate or misleading. They may, for example, simply omit Named Entities – rivers, people, places – and numeric values in their evaluation of accuracy. Think of how important those are in government documents!) But I disagree with you on a couple of conclusions you draw about that.
Although it is certainly technically true that someone could re-OCR badly OCR’d pages when the technology improves, I think it would be unwise for us to rely on this as reason to put up with bad OCR for two reasons.
A) Since the the problems we see with OCR (and image quality) are created by choices made during digitization by projects that either did not have (or chose not to use) adequate resources to do the job right in the first place, it would (I think) be overly optimistic to believe that they will have (or be willing to spend) additional resources to fix the problems created by those original choices. I would be more optimistic about this if projects explicitly made such corrections and cleanup part of their long-term planning process, but how many do that?
B) Even if someone is willing to correct bad text, they may not be able to get better results with better software. Why? Because some OCR is bad because the original page image is of low quality and the OCR is already as good as it will get from that flawed image. (Why would a digitization project accept a low quality image? There are many reasons, but, usually, they all come down to cost. For example: to save money, a mass digitization project may use a single image standard even though different standards should be used for originals of different types, sizes, quality, damage, and so forth. Or, a project may choose a low-quality image standard intentionally in order to save money on hardware or software or even storage capacity. Or, a project may have no choice if it is digitizing from poorly-shot microfilm.)
I disagree a bit with your characterization of the “preservation vs. access” divide. We do indeed have some good standards for preserving digital bits. But, just as importantly, we also have some good “preservation” standards that are based on “access.” These are standards for creating digital objects that are designed to have certain kinds of functionality or intended uses. See, most particularly the Federal Agencies Digitization Initiative (FAGDI) and the most recent proposed standards for digital text and the guidelines for different “Content Categories & Digitization Objectives” for “Historical Printed Matter”. We could use more standards as you suggest. (That is why James and I wrote the Digital Surrogate Seal of Approval standard.) But I think that the problem is not just lack of agreement on how to make good digitizations; it is more often a “willful blindness” to the OCR problem and “an alarmingly casual indifference to accuracy and authenticity.”