Roll Call reports:
The National Archives could be just months away from starting a long-planned project to create for the first time a searchable digital log of the archives of Congress.
The project, which had been discussed for about six years, would essentially catalog Congressional records dating back to 1789 and create a database where researchers could search for specific topics.
Though it wouldn't digitize the records themselves, the database would point researchers to places within the expansive records where that topic is discussed.
"The idea is to take those various sources ... and to make it a state-of-the-art finding aid," Senate Archivist Karen Paul said.
+ Final Plan Approved Yesterday
+ Archivist of the United States David Ferriero, expressed concerned over the cost
+ Cost Estimate Should Be Provided in Six Months
+ Size? 500 Million Pages (200,000 Cubic Feet of Records)
+ Expected To Take Five Years to Complete
Building the Digital Smithsonian Libraries, by Erin Thomas, Smithsonian Libraries (May 13, 2011).
As you may already know, the Libraries has been busy digitizing scientific legacy literature as part of the global partnership that makes up the Biodiversity Heritage Library for some time now — the BHL recently published its 90,500th volume! But as you may not have yet noticed, the Libraries has also begun scanning select titles from our History, Art, and Culture collections as well.
The Libraries scans roughly 150 History, Art, and Culture titles each month, and those scans are freely available from the Smithsonian Collection at the Internet Archive. At press time, the collection holds 3,838 items!
As we look to relying increasingly on digital texts for discovery, access, and use of government information, it is worth understanding the issues (and difficulties) in accurately extracting text from a variety of sources. Here is a thirteen page paper that outlines the issues.
- Herceg, Paul M., and Catherine N. Ball, Reliable Electronic Text: The Elusive Prerequisite for a Host of Human Language Technologies, Mitre Technical Report (Mitre, 30 September 2010).
Electronic text is a prerequisite for text-processing applications such as indexing and search, named entity recognition (NER), and machine translation (MT). For example, text that is printed on paper must first be converted to electronic form to make it searchable--typically, by scanning the pages and using Optical Character Recognition (OCR) to create electronic text.
Not all electronic text is created equal, however. Electronic text comes in a wide variety of containers, from Microsoft Office documents to Oracle database records. Electronic text comes in a wide variety of character set encodings, such as Chinese "Big 5" or Unicode UTF-8. Electronic text may be written in any of the world's languages, including the special sublanguages of blogs, chat and forums. And from the end user's perspective, even formatting aspects of the electronic text may be important: for example, if the use case is translating a document for a customer, and the customer requires the translated document to be formatted to look like the original.
From the application perspective, all aspects of electronic text potentially make a difference. Almost all products are language-specific, and will produce useful results only on supported languages. In terms of file types, some products accept Microsoft Word documents as input, while others may accept only plain text files. Some products may work best if all text is "normalized" to a single character set encoding such as UTF-8.
The purpose of this paper is to explore factors to be considered when trying to match up text-containing files (in a variety of file formats, character set encodings, languages, etc.) with text-processing applications and use cases.
Hat tip to Info Docket!
At the last couple of depository library council meetings, I've heard comments from documents librarians -- especially from librarians at smaller institutions -- that they'd love to participate in the digitization process of historic government documents, but for various reasons (lack of $$, staffing, time, technical infrastructure etc) could not undertake large scale digitization projects.
Now there's a way for lots of libraries to chip in on the greater goal of increased access to historic government documents with very little $$ or infrastructure. We've mentioned before about BookLiberator and DIYbookscanner, two projects working on low cost hardware solutions for digitizing books using off the shelf digital cameras and free opensource software called Book Scan Wizard.
But there were still 2 pieces missing to make the whole workflow run smoothly for libraries and government documents collections of all sizes. The third piece to the puzzle just became a reality with yesterday's announcement that Book Scan Wizard had teamed up with the Internet Archive to provide automatic uploads of scans to the Internet Archive (directions and more information here). Hardware: check. Software: check. Digital infrastructure: check.
With the new version of Book Scan Wizard, or even through just uploading directly to the Internet Archive, any PDF composed of images of book pages or organized zip file filled with images of book pages will be automatically processed. The Internet Archive’s servers will then automatically perform optical character recognition (OCR) on the book and make a pdf, epub, kindle (mobi), daisy, djvu, and text file copy of the entire book available for download by anyone, anywhere. You can see a sample book from this process to get a better idea. All this happens within a few hours of the book being uploaded and then anyone can download it. This is free OCR for anyone in the world.
Now there's one last piece needed: Scan on demand. This idea has already been put into practice by the Internet Archive's Open Library and their partnership with the Boston Public Library. What we need is to open up the Catalog of Government Publications (CGP) -- which will soon include over 1 million records from GPO's historic shelflist spanning 1870s - 1992 -- similar to the way the BPL's scan on demand project (now retired it seems) allowed users to request a scan of a public domain book directly from the Open Library catalog.
GPO could manage this scan on demand process -- or allow libraries to pick and choose documents from the CGP -- connect the bibliographic metadata from the historic shelflist, and upload to both the Internet Archive and FDsys. The circle is complete. Am I missing anything? Would love to hear readers' thoughts.
Here's some good news on this stormy day (at least in NorCal). GPO and the Library of Congress are set to work together on better digital access for the historic United States Statutes at Large and the United States Constitution. Anyone want to add this to the the Conan the Librarian wikipedia page?
The U.S. Government Printing Office (GPO) and the Library of Congress (LOC) recently received approval from the Joint Committee on Printing (JCP) to proceed on two collaborative efforts. One project involves the digitization of some of our nation's most important legal and legislative documents and the other involves enhanced public online access to the Constitution of the United States: Analysis and Interpretation (CONAN).
The digitization project will include the public and private laws, and proposed constitutional amendments passed by Congress as published in the official Statutes at Large from 1951-2002. GPO and LOC will also work on digitizing official debates of Congress from the permanent volumes of the Congressional Record from 1873-1998. These laws and documents will be authenticated and available to the public on GPO’s Federal Digital System (FDsys) and the Library of Congress’s THOMAS legislative information system.
The other project will provide enhanced public online access to the Constitution of the United States: Analysis and Interpretation (CONAN), a Senate Document that analyzes Supreme Court cases relevant to the Constitution. The project involves creating an enhanced version of CONAN, where updates to the publication will be made available on FDsys as soon as they are prepared. In addition to more timely access to these updates, new online features will also be added, including greater ease of searching and authentication.
GPO authenticates the documents on FDsys by digital signature and these authenticated documents are also available on the Library’s THOMAS system. This signature assures the public that the document has not been changed or altered since receipt by GPO. This digital signature, viewed through the GPO Seal of Authenticity, verifies the document’s integrity and authenticity.
I'm in 2 minds about this as well as similar digitization plans. On the one hand, the digitization of Smithsonian collections -- books, research reports, data, music, film and other sounds (like frog vocalizations!) -- will mean potentially a boon to online access to some really amazing materials.
On the other hand, this quote from the executive summary worries me:
To preserve our collections, the Smithsonian constantly battles the destructive forces of time and environment. Despite our best efforts, plastics discolor, wax cylinder recordings distort, and botanical specimens become brittle. Digitization offers a way to make objects — and the valuable information they contain — available without jeopardizing their integrity by handling or by exposure to the elements.
While they mention a "life cycle-management approach to digitization," there doesn't seem to be a serious amount of thought given to the fact that digital objects degrade faster than physical objects, and that digital preservation is an ongoing and potentially more expensive effort. I worry that SI.edu will broker the same kind of disastrous deal that GAO did with Thomson-West whereby a whole swath of public domain information was privatized.
I would call on SI.edu and ALL .gov agencies to insert a clause into ANY digitization contract that ALL digital files and metadata will be accessible via free and open sites. That means where applicable, copies of all digital content would be ingested into GPO's FDsys, Library of Congress, NARA and/or publicly accessible non-profit sites (eg. UNT digital library or Internet Archive). Please help us get this message across to your friends in the .gov sector. Public information should remain public!
As I suggested in my tweet a few minutes ago, wouldn't it be great if lots of depository libraries bought cheap book scanners like the Decapod (A Mellon funded project), digitized government documents and uploaded them to the Open Library? There are tons of records for government documents just waiting for the attachment of a digital file. And GPO could help by sharing their records from the Catalog of Government Publications (CGP) with the Open Library where librarians and others could enhance to make more robust metadata (which could be fed back in to the CGP!). Lots of libraries with Decapods make light work!
(Full disclosure: I'm on the board of QuestionCopyright, a 501(c)(3) non-profit which has its own book scanning hardware/software project called Book Liberator. BL developers are in close contact with Decapod folks. But I get no economic benefit from either Book Liberator or Decapod.)
Carl Malamud announced yesterday the inaugural meeting of the International Amateur Scanning League (IASL) (I'm already imagining cool swag!). Malamud is taking FedFlix program to the streets! Fedflix, a joint venture between the National Technical Information Service and Public.Resource.Org, digitizes NTIS video and makes them available on YouTube, the Internet Archive, and the public.resource.org Stock Footage Library.
Well now a gang of volunteers including members of DC CopyNight and Smithsonian employees working on their own time are going to the National Archives and Records Administration (NARA) and copying over 1,500 DVDs to be uploaded to the net.
What makes this grassroots digitization effort so remarkable is that it has the full support of the government. Indeed, David Ferriero, the U.S. Archivist, joined me in the initial meeting where we taught volunteers how to rip DVDs!
Kudos to Malamud and the IASL!
And this makes me think that more libraries and librarians should be doing the same thing for govt documents. Why not set up your own scanning operations in your depository library (Book Liberator or DIY Book Scanner can show you how to digitize on the cheap!) and then deposit those scans into the Internet Archive's US Documents Collection (don't forget to follow FDLP digitization standards!). Scans could also be ingested into FDSys (when they've got that capability working ;-)). So get to it; what are you waiting for?!
Carl Malamud posed this question over on twitter: "What if our national cultural institutions all worked together on a common problem, attracted White House support?" In his post on the O'Reilly blog, "A National Scan Center: A Public Works Project", Malamud scopes out the issues and calls for Library of Congress, the Smithsonian Institution, the Government Printing Office, the National Archives and Records Administration, and the National Technical Information Service to come together and make the compelling case for funding a 5-year $500 million effort to create a National Scan Center. Here here Carl!
In the U.S., we face a similar deluge of paperwork that we faced in the 1930s. A huge backlog of paper, microfiche, audio, video, and other materials is located throughout the federal government. Little money has gone from Congress for digitization, and bureaucracies have resorted to a series of questionable private-public partnerships as a way of digitizing their materials. For example, the Government Accountability Office shipped 60 million pages of our Federal Legislative Histories (the record of each law from the initial bill through the hearings and conference reports) off to Thomson West, but didn't even get digital copies back. Another example is the recent failed effort by the Government Printing Office to digitize 60 million pages of the Federal Depository Library Program, an effort they tried to get through as a "zero dollar cost to the government" effort with the private sector.
There are no free lunches and there are no "no cost to the government" deals. The costs involve the government effort to supervise the contract, prepare the materials, and ship them, and in both the GAO and GPO cases, the government wasn't getting much back for its effort. What the government and the people usually get is a lien on the public domain, preventing the public from accessing these vital materials. Similar efforts are sprinkled throughout the government. I testified to Congress that I had learned that the National Archives was contemplating a scan of congressional hearings with LexisNexis under similar circumstances, and many may be aware of the questionable deal the Archives cut with Amazon where my favorite online superstore got de facto exclusive rights to 1,899 wonderful pieces of video.
[UPDATE: I spoke too soon. Seems that these are "early access" documents that "will be removed from this database, to be replaced by the fully edited version in the appropriate digital edition in the Rotunda American Founding Era collection."]
More than 200 years after they were written, some 5,000 previously unpublished documents of the founders of the United States — including Thomas Jefferson, John Adams and James Madison — are at long last available to the public at no cost.
The Documents Compass group of the Virginia Foundation for the Humanities at the University of Virginia has spent much of the last year proofreading and transcribing thousands of pages of letters and other papers.
The documents are now available online for free at the University of Virginia Press’ digital imprint called Rotunda...
...The online project is a federal pilot study that aims to expand public access to the papers of America’s founders. It is funded by a $250,000 grant from the National Historical Publications and Records Commission, which is a division of the National Archives.
[Thanks Resource Shelf!]