Open letter and petition to President Obama to create a federal scanning commission and digitize all .gov publications #FDLPSubmitted by jrjacobs on Tue, 2012-01-03 11:11.
John Podesta and Carl Malamud have written an open letter to President Obama (text below) asking for the creation of a Federal Scanning Commission and to greatly increase the pace of digitization of federal resources. They need 25,000 signatures on their petition by January 20, so your help would be greatly appreciated!
While I have some reservations about wholesale digitization that are glossed over in the letter -- I worry for example about the process and how current digitization methods basically destroy documents, how current OCR software is less than perfect, and about only making a digital equivalent to a paper document, NOT the ability to extract and re-use data and statistics etc. (to read more, see "Achieving a collaborative FDLP future") -- as Malamud says:
"Just imagine ... what if we could scan the contents of the FDLP, back issues of the CFR, the briefs before the Supreme Court? We'll never know if we can scan .gov unless we start asking the questions. Please help us get started!"
For that, I'm asking readers to sign the petition and forward to your friends. A national effort is just what is needed. Librarians must advocate for and participate in this process!
December 21, 2011
The White House
1600 Pennsylvania Avenue
Washington, D.C. 20500
Dear Mr. President:
Locked in our federal vaults is a tremendous storehouse of information that if digitized would form a core for our digital public libraries in America with huge benefit for our country: cutting costs in the Federal government, creating jobs throughout America, and revolutionizing how we educate our citizens, how we practice the law, and how we create news, art, and scholarly works.
Imagine if the riches contained in the National Archives, Library of Congress, Smithsonian Institution, Government Printing Office, National Library of Medicine, National Agricultural Library, National Technical Information Service, and scores of other federal organizations were made available, becoming the core of a national effort to make access to knowledge a right for all Americans. The dream is a big one, but if we do not begin the questions of what it would take to get there, we will never start down that road. Today, we don't know what it would take.
We are not necessarily suggesting that the federal government immediately undertake an ambitious effort to scan the holdings of .gov, but if we ever hope to begin even a small piece of making available our past for use by our future, we should at least begin to scope out the size of the problem. We believe it would require a decade-long commitment to digitization to make our nation's cultural, scientific, educational, and historical resources available, but we can't even begin that discussion unless we know how big the problem is. Such an effort is indeed ambitious to contemplate, but we can only ask if we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?
Over the last year, a number of efforts have sprung up to create comprehensive digital libraries. The European Union has created Europeana with a goal to “make a large part of the world's cultural heritage available to a large part of the world's population.” In the United States, efforts have included Google Books, the Hathi Trust, the Internet Archive, and the recently announced Digital Public Library of America, a planning initiative with a goal of “creating a large-scale digital public library that will make the cultural and scientific record available to all.”
No matter what the eventual shape of these efforts, we know that the holdings of the U.S. government will play a crucial role, a central part of our public domain. While there have been many well-intentioned efforts to digitize federal holdings, those efforts have been preliminary and tentative. Our national cultural and scientific organizations have never worked together to develop a coherent digitization strategy to scan at scale.
The PCAST report on Designing a Digital Future hits the nail on the head on investing in Networking and Information Technology (NIT), but does not address squarely the question of what it would to take to digitize the holdings of our national institutions. The Presidential Memorandum on Managing Government Records discusses how to make record-keeping move into the modern age in the future, but does not address how to rescue the past and make it useful for Americans.
One way to begin is to convene governmental and non-governmental experts, perhaps in the form of a Presidential Commission, Interagency Task Force, or other mechanism. The “Federal Scanning Commission” would be tasked to answer 6 questions and deliver a report within a year:
- What are the holdings of our national institutions? How many images, documents, videos, and other objects are there?
- How long would it take to digitize these materials?
- How much would it cost given current technology? Is there directed research or are there economies of scale that would bring those costs down?
- What is the strategy for digital preservation of these materials? How will we avoid digital obsolescence?
- What is the strategy for identifying restrictions on use of the material? How does one identify and safeguard materials that have copyright restrictions, contain personally identifiable information, or contain classified materials?
- What are the economic and non-economic benefits of such an effort?
- What are the cost savings to government?
- What are the economic benefits? Would this effort enable industries that build on top of scientific and technical information, spur innovation in the legal marketplace, or enable our creative industries to create more effectively?
- What are the non-economic benefits? Will such an effort lead to better STEM and other educational efforts? Will it promote a more informed citizenry and better access to justice?
To date, thinking about digitization has been piecemeal. Individual agencies have thought about the problem in terms of prototypes and pilots. Only the White House can bring these efforts together under one roof and begin to think in terms of a national digitization strategy for our federal government.
Bringing government agencies together with outside experts to solve a common problem related to our federal holdings has a precedent. When R. D. W. Connor was appointed as the first Archivist of the United States, he faced a herculean task, getting all the agencies of government to come together with a common vision of “safeguarding and preserving the records of our Government.” The idea of safeguarding and preserving the records of government was a new one, and Archivist Connor found “records mingled higgledy-piggledy with empty whiskey bottles.”
Archivist Connor appealed for help to President Roosevelt, asking for his assistance in forging a common vision among the agencies and for their cooperation. President Roosevelt formed a National Archives Council and convened the first meeting in the Cabinet Room, asking Secretary of State Cordell Hull to serve as chairman. By bringing the agencies together in one room, President Roosevelt made the dream of archiving the records of government a shared vision, and then made that vision real.
When Thomas Jefferson donated his books to create the cornerstone of the Library of Congress, his library contained a wealth of useful information, from an extensive collection on the law to books on agriculture, chemistry, surgery, and medicine. With this contribution, Jefferson saw to it that the government of the United States would play a central role in the increase and diffusion of knowledge. It is time now for us to lay the cornerstone for our own era, to anchor our digital age with the vast holdings of our government so that we may promote the useful arts and the progress of science.
We ask your help to achieve this 21st century dream, making the vast resources of our federal government available to all on the global Internet, making access to knowledge a right for all Americans and a defining contribution for our future.
John D. Podesta, Chair
Center for American Progress
Carl Malamud, President
Press Release from GPO:
GPO PARTNERS WITH UNIVERSITY OF IOWA TO PRESERVE HISTORIC COLLECTION
FOR IMMEDIATE RELEASE: October 5, 2011
WASHINGTON-The U.S. Government Printing Office (GPO) is partnering with the University of Iowa Libraries to preserve and make available in digital format historic Government-issued posters from pre-World War II to the 1990s. The University of Iowa Libraries is providing public access to the collection of 1,500 posters promoting services, programs, and initiatives by Government agencies. The University of Iowa Libraries is part of GPO's Federal Depository Library Program (FDLP), which provides public access to published information of all three branches of the Government through partnerships with more than 1,220 libraries nationwide. GPO is providing back-up support for the digital poster collection; in the event that the University of Iowa Libraries cannot provide access, GPO will make the digital poster collection available on the agency's Federal Digital System (FDsys), a one-stop site to authentic, published Government information.
Link to digital poster collection: http://digital.lib.uiowa.edu/gpc/index.php
"GPO's mission of keeping American informed is demonstrated through the agency's partnerships with more than 1,200 libraries nationwide and our joint effort to make Government information available to the public," said Public Printer Bill Boarman. "As GPO celebrates its 150th anniversary, we recognize the importance of history and preserving history for future generations and that is why GPO is thrilled to work with the University of Iowa Libraries to preserve and make available this historic collection."
"GPO is excited to partner with the University of Iowa Libraries to preserve and protect the future of this historic collection," said Superintendent of Documents Mary Alice Baish. "As we embrace the new digital age, GPO is seeking partnerships with libraries nationwide in order to safeguard historic collections and provide the American people with permanent public access to Government information."
"I'm delighted that this previously hidden collection is now available to anyone with an Internet connection. These posters often represent a graphic documentation of priorities of a given presidential administration or reflect social culture at a discrete point in time," said Marianne Mason, Federal Information Librarian University of Iowa Libraries. "Nearly all federal agencies, both past and present, have produced social marketing posters including the Works Project Administration, War Mobilization Office, EPA, Dept. of Interior and NASA. This visual collection has the potential to complement academic course work in public policy, history, communication studies, and health sciences and to enhance outreach activities to primary and secondary (K-12) students. The Government Printing Office has been a steadfast partner in preserving rich digital collections from many libraries and, happily, The University of Iowa has joined that partnership."
The National Archives asking volunteer transcribers at Wikisource to turn paper and ink historical manuscripts into simple, searchable Web text.
- National Archives' first Wikipedian in residence to bring more holdings to the public, by Joseph Marks, NextGov (07/11/2011).
An interesting part of this story is the issue of the quality of scans.
A major barrier, McDevitt-Parks said, is the quality of the Archives' digitized files, the most important of which were scanned in the 1990s using early technology that makes them difficult to read online.
...Unfortunately, it's hard to [make the case to] go back and scan things that are already scanned when there are millions and millions of things that aren't in any digitized form at all.
I think there is an important lesson here. As we develop policies for our individual libraries today and plans for the FDLP of the future, we should always remember that digitization technologies improve over time and the uses we make of digital documents evolve over time. We should avoid choices that are merely good-enough today.
We should aim for a future that will enable us to increase access and functionality in the future, not lock us into what we are technologically capable of today.
Roll Call reports:
The National Archives could be just months away from starting a long-planned project to create for the first time a searchable digital log of the archives of Congress.
The project, which had been discussed for about six years, would essentially catalog Congressional records dating back to 1789 and create a database where researchers could search for specific topics.
Though it wouldn't digitize the records themselves, the database would point researchers to places within the expansive records where that topic is discussed.
"The idea is to take those various sources ... and to make it a state-of-the-art finding aid," Senate Archivist Karen Paul said.
+ Final Plan Approved Yesterday
+ Archivist of the United States David Ferriero, expressed concerned over the cost
+ Cost Estimate Should Be Provided in Six Months
+ Size? 500 Million Pages (200,000 Cubic Feet of Records)
+ Expected To Take Five Years to Complete
Building the Digital Smithsonian Libraries, by Erin Thomas, Smithsonian Libraries (May 13, 2011).
As you may already know, the Libraries has been busy digitizing scientific legacy literature as part of the global partnership that makes up the Biodiversity Heritage Library for some time now — the BHL recently published its 90,500th volume! But as you may not have yet noticed, the Libraries has also begun scanning select titles from our History, Art, and Culture collections as well.
The Libraries scans roughly 150 History, Art, and Culture titles each month, and those scans are freely available from the Smithsonian Collection at the Internet Archive. At press time, the collection holds 3,838 items!
As we look to relying increasingly on digital texts for discovery, access, and use of government information, it is worth understanding the issues (and difficulties) in accurately extracting text from a variety of sources. Here is a thirteen page paper that outlines the issues.
- Herceg, Paul M., and Catherine N. Ball, Reliable Electronic Text: The Elusive Prerequisite for a Host of Human Language Technologies, Mitre Technical Report (Mitre, 30 September 2010).
Electronic text is a prerequisite for text-processing applications such as indexing and search, named entity recognition (NER), and machine translation (MT). For example, text that is printed on paper must first be converted to electronic form to make it searchable--typically, by scanning the pages and using Optical Character Recognition (OCR) to create electronic text.
Not all electronic text is created equal, however. Electronic text comes in a wide variety of containers, from Microsoft Office documents to Oracle database records. Electronic text comes in a wide variety of character set encodings, such as Chinese "Big 5" or Unicode UTF-8. Electronic text may be written in any of the world's languages, including the special sublanguages of blogs, chat and forums. And from the end user's perspective, even formatting aspects of the electronic text may be important: for example, if the use case is translating a document for a customer, and the customer requires the translated document to be formatted to look like the original.
From the application perspective, all aspects of electronic text potentially make a difference. Almost all products are language-specific, and will produce useful results only on supported languages. In terms of file types, some products accept Microsoft Word documents as input, while others may accept only plain text files. Some products may work best if all text is "normalized" to a single character set encoding such as UTF-8.
The purpose of this paper is to explore factors to be considered when trying to match up text-containing files (in a variety of file formats, character set encodings, languages, etc.) with text-processing applications and use cases.
Hat tip to Info Docket!
At the last couple of depository library council meetings, I've heard comments from documents librarians -- especially from librarians at smaller institutions -- that they'd love to participate in the digitization process of historic government documents, but for various reasons (lack of $$, staffing, time, technical infrastructure etc) could not undertake large scale digitization projects.
Now there's a way for lots of libraries to chip in on the greater goal of increased access to historic government documents with very little $$ or infrastructure. We've mentioned before about BookLiberator and DIYbookscanner, two projects working on low cost hardware solutions for digitizing books using off the shelf digital cameras and free opensource software called Book Scan Wizard.
But there were still 2 pieces missing to make the whole workflow run smoothly for libraries and government documents collections of all sizes. The third piece to the puzzle just became a reality with yesterday's announcement that Book Scan Wizard had teamed up with the Internet Archive to provide automatic uploads of scans to the Internet Archive (directions and more information here). Hardware: check. Software: check. Digital infrastructure: check.
With the new version of Book Scan Wizard, or even through just uploading directly to the Internet Archive, any PDF composed of images of book pages or organized zip file filled with images of book pages will be automatically processed. The Internet Archive’s servers will then automatically perform optical character recognition (OCR) on the book and make a pdf, epub, kindle (mobi), daisy, djvu, and text file copy of the entire book available for download by anyone, anywhere. You can see a sample book from this process to get a better idea. All this happens within a few hours of the book being uploaded and then anyone can download it. This is free OCR for anyone in the world.
Now there's one last piece needed: Scan on demand. This idea has already been put into practice by the Internet Archive's Open Library and their partnership with the Boston Public Library. What we need is to open up the Catalog of Government Publications (CGP) -- which will soon include over 1 million records from GPO's historic shelflist spanning 1870s - 1992 -- similar to the way the BPL's scan on demand project (now retired it seems) allowed users to request a scan of a public domain book directly from the Open Library catalog.
GPO could manage this scan on demand process -- or allow libraries to pick and choose documents from the CGP -- connect the bibliographic metadata from the historic shelflist, and upload to both the Internet Archive and FDsys. The circle is complete. Am I missing anything? Would love to hear readers' thoughts.
Here's some good news on this stormy day (at least in NorCal). GPO and the Library of Congress are set to work together on better digital access for the historic United States Statutes at Large and the United States Constitution. Anyone want to add this to the the Conan the Librarian wikipedia page?
The U.S. Government Printing Office (GPO) and the Library of Congress (LOC) recently received approval from the Joint Committee on Printing (JCP) to proceed on two collaborative efforts. One project involves the digitization of some of our nation's most important legal and legislative documents and the other involves enhanced public online access to the Constitution of the United States: Analysis and Interpretation (CONAN).
The digitization project will include the public and private laws, and proposed constitutional amendments passed by Congress as published in the official Statutes at Large from 1951-2002. GPO and LOC will also work on digitizing official debates of Congress from the permanent volumes of the Congressional Record from 1873-1998. These laws and documents will be authenticated and available to the public on GPO’s Federal Digital System (FDsys) and the Library of Congress’s THOMAS legislative information system.
The other project will provide enhanced public online access to the Constitution of the United States: Analysis and Interpretation (CONAN), a Senate Document that analyzes Supreme Court cases relevant to the Constitution. The project involves creating an enhanced version of CONAN, where updates to the publication will be made available on FDsys as soon as they are prepared. In addition to more timely access to these updates, new online features will also be added, including greater ease of searching and authentication.
GPO authenticates the documents on FDsys by digital signature and these authenticated documents are also available on the Library’s THOMAS system. This signature assures the public that the document has not been changed or altered since receipt by GPO. This digital signature, viewed through the GPO Seal of Authenticity, verifies the document’s integrity and authenticity.
I'm in 2 minds about this as well as similar digitization plans. On the one hand, the digitization of Smithsonian collections -- books, research reports, data, music, film and other sounds (like frog vocalizations!) -- will mean potentially a boon to online access to some really amazing materials.
On the other hand, this quote from the executive summary worries me:
To preserve our collections, the Smithsonian constantly battles the destructive forces of time and environment. Despite our best efforts, plastics discolor, wax cylinder recordings distort, and botanical specimens become brittle. Digitization offers a way to make objects — and the valuable information they contain — available without jeopardizing their integrity by handling or by exposure to the elements.
While they mention a "life cycle-management approach to digitization," there doesn't seem to be a serious amount of thought given to the fact that digital objects degrade faster than physical objects, and that digital preservation is an ongoing and potentially more expensive effort. I worry that SI.edu will broker the same kind of disastrous deal that GAO did with Thomson-West whereby a whole swath of public domain information was privatized.
I would call on SI.edu and ALL .gov agencies to insert a clause into ANY digitization contract that ALL digital files and metadata will be accessible via free and open sites. That means where applicable, copies of all digital content would be ingested into GPO's FDsys, Library of Congress, NARA and/or publicly accessible non-profit sites (eg. UNT digital library or Internet Archive). Please help us get this message across to your friends in the .gov sector. Public information should remain public!
As I suggested in my tweet a few minutes ago, wouldn't it be great if lots of depository libraries bought cheap book scanners like the Decapod (A Mellon funded project), digitized government documents and uploaded them to the Open Library? There are tons of records for government documents just waiting for the attachment of a digital file. And GPO could help by sharing their records from the Catalog of Government Publications (CGP) with the Open Library where librarians and others could enhance to make more robust metadata (which could be fed back in to the CGP!). Lots of libraries with Decapods make light work!
(Full disclosure: I'm on the board of QuestionCopyright, a 501(c)(3) non-profit which has its own book scanning hardware/software project called Book Liberator. BL developers are in close contact with Decapod folks. But I get no economic benefit from either Book Liberator or Decapod.)