Home » Commentary » Accuracy of Digitization
Our mission
Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.
Latest Posts
- John Oliver again nails it re environmental racism. Oh and EPA is sunsetting its online archive
- FGI’s recommendations for creating the “all-digital FDLP”
- Some facts about the born-digital “National Collection”
- New Dataset on FY2022 Congressionally Directed Spending
- Data is plural newsletter posts 2 amazing govinfo datasets: House Comm witnesses and 1900 census immigrant populations
Latest Comments
- Bernadine Abbott Hoduski on FGI’s recommendations for creating the “all-digital FDLP”
- James R. Jacobs on Data is plural newsletter posts 2 amazing govinfo datasets: House Comm witnesses and 1900 census immigrant populations
- James R. Jacobs on FGI comment on GPO RFC re Regional Online Selections Draft Policy
- Bernadine Abbott Hoduski on FGI comment on GPO RFC re Regional Online Selections Draft Policy
- Colman McMahon on Ruggles report on preservation and use of economic data liberated!
- James R. Jacobs on Analysis of GPO’s proposed Title 44 changes to FDLP and FGI’s suggestions
- Michael McCulley on Analysis of GPO’s proposed Title 44 changes to FDLP and FGI’s suggestions
- Aimee Quinn on Analysis of GPO’s proposed Title 44 changes to FDLP and FGI’s suggestions
- James A Jacobs on Analysis of GPO’s proposed Title 44 changes to FDLP and FGI’s suggestions
- James R. Jacobs on Analysis of GPO’s proposed Title 44 changes to FDLP and FGI’s suggestions
Blogroll
- ASU Gov Docs
- beSpacific
- Best. Titles. Ever. (Tumblr)
- Center for Effective Government
- Every CRS Report New Reports RSS Feed
- FDLP Desktop
- FDLP News & Events
- FullTextReports
- GISIG UW-SLIS: Gov Info, Sources, Data & Docs
- Government Book Talk
- Government Information Network (Canada)
- Government Information News from Fondren Library, Rice University
- GPO [twitter]
- INFOdocket
- Information Observatory
- Libraries+ Network
- Library Babel Fish by Barbara Fister
- NARA records express
- Open The Government
- Secrecy News
- SLA GovInfo [twitter]
- StatFountain
- Sunlight Foundation
- University of Washington Gov Pubs Finds
Accuracy of Digitization
The Update on February 2012 Activities of the HathiTrust reports on research being done by the HathiTrust Research Center (HTRC) to quantify occurrences of Optical Character Recognition (OCR) errors in the HathiTrust corpus. OCR is the technology that converts a scanned image to text that can be searched and analyzed. Members of the HTRC examined 256,000 non-Google digitized volumes from HathiTrust using a clever algorithm that compared OCR text to a dictionary of known words. Using a supercomputer and a set of rules that were verified by a human expert, they found that 84.9 percent of the volumes examined (217,754 of the 256,416) had one or more OCR errors and 11% of the pages (7,745,034 of the 69,297,000) had one or more errors. The average number of errors per volume was 156.
As we at FGI have argued here before, we believe that it essential to take into account OCR accuracy and error rates when digitizing paper collections. This is particularly important when digitizing books that contain statistical tables since it is harder to use current OCR technologies to accurately convert image scans to numbers than it is to convert scans to text. It is also harder to evaluate the accuracy of such conversions; you can’t use a dictionary of known statistics the way you can use a dictionary of known words. (The HTRC study did not, apparently, examine accuracy of statistical table conversions.) Because of the large volume of such information in government publications, this is a very important issue as we collectively try to digitize our paper collections, evaluate their accuracy and usability, and determine how many paper copies we need to keep after digitization (see Achieving a collaborative FDLP future).
Related
Tags: digitization, ocr