As we look to relying increasingly on digital texts for discovery, access, and use of government information, it is worth understanding the issues (and difficulties) in accurately extracting text from a variety of sources. Here is a thirteen page paper that outlines the issues.
- Herceg, Paul M., and Catherine N. Ball, Reliable Electronic Text: The Elusive Prerequisite for a Host of Human Language Technologies, Mitre Technical Report (Mitre, 30 September 2010).
Electronic text is a prerequisite for text-processing applications such as indexing and search, named entity recognition (NER), and machine translation (MT). For example, text that is printed on paper must first be converted to electronic form to make it searchable–typically, by scanning the pages and using Optical Character Recognition (OCR) to create electronic text.
Not all electronic text is created equal, however. Electronic text comes in a wide variety of containers, from Microsoft Office documents to Oracle database records. Electronic text comes in a wide variety of character set encodings, such as Chinese “Big 5” or Unicode UTF-8. Electronic text may be written in any of the world’s languages, including the special sublanguages of blogs, chat and forums. And from the end user’s perspective, even formatting aspects of the electronic text may be important: for example, if the use case is translating a document for a customer, and the customer requires the translated document to be formatted to look like the original.
From the application perspective, all aspects of electronic text potentially make a difference. Almost all products are language-specific, and will produce useful results only on supported languages. In terms of file types, some products accept Microsoft Word documents as input, while others may accept only plain text files. Some products may work best if all text is “normalized” to a single character set encoding such as UTF-8.
The purpose of this paper is to explore factors to be considered when trying to match up text-containing files (in a variety of file formats, character set encodings, languages, etc.) with text-processing applications and use cases.
Hat tip to Info Docket!