The Canadian government's Library and Archives Canada (LAC) announced more details of its digitization project. In a "digitization partnership" with Canadiana.org, a not-for-profit charitable organization, there will be a large scale digitization project that will involve about 60 million images from numerous collections, including the indexing and description of millions of personal, administrative and government documents, as well as land grants, war diaries and photographs and the transcription of millions of handwritten pages. This is a "10-year agreement."
- Library and Archives Canada and Canadiana.org partner on digitization, online publication of millions of images from archival microfilm collection. Library and Archives Canada (2013-08-29).
The announcement says that Canadians will have "access" regardless of where they live, at no charge.
I just received an old (historic NOT legacy) Department of Commerce publication off of the needs and offers list called "Commercial handbook of China" by Julean Arnold, commercial attaché (WorldCat record). It's actually a 1975 reprint of a 1919 publication. It's chock full of statistics relating to provinces, cities, and consular districts -- agriculture, minerals and mining, populations, exports and imports, revenues, transportation, ports and shipping facilities etc. In short, this is a gold mine of historic information and statistics from the Republic of China (pre-Communist China). The document was digitized and is available in HathiTrust as well as the Internet Archive (see book reader below).
However, in comparing the digitized version with the paper version in hand, I came upon several issues:
- there are 3 foldout maps that were not digitized. These maps are critical information on railway lines and treaty ports in China. The bibliographic record has a physical description including "2 v. fronts., plates, fold. map, tables, diagrs., fold. charts" but no content note mentioning that the maps were not digitized.
- As I mentioned, the document is chock full of statistical tables. Have you ever tried copying and pasting tabular data from a PDF? It's even worse when the tables are displayed in landscape rather than portrait. I've verified that the OCR fails on those pages.
- Lots of readability/usability issues: The table of contents is partially obscured in one copy and the tables are often blurred or faint. also, HT is using a process of OCR now where you can search but not copy or paste.
- Lastly, I find it ... uh... interesting that this book says here "Copyright: Public Domain, Google-digitized." But, if you want to download the whole book, you have to be an HT partner.
Does this digitized version increase access to this important historic material? Yes, indeed, it does. But I'm rather glad to have a bibliographic record in my catalog that links to the the digital version AND points to the paper copy in our collection.
We have seen this happen before in the U.S. (See, for example: The NARA/TGN contract as a bad precedent and GAO *did* sell exclusive access to legislative history to Thomson West) and Canada (Help save the Library & Archives Canada), but this seems like a particularly bad, unjustifiable example of privatization of public information.
- Library and Archives Canada private deal would take millions of documents out of public domain, By Chris Cobb, OTTAWA CITIZEN (June 12, 2013).
Library and Archives Canada has entered a hush-hush deal with a private high-tech consortium that would hand over exclusive rights to publicly owned books and artifacts for 10 years.
...LAC is partnering with Canadiana.org in what is being billed as The Heritage Project -- digitizing 40 million images from more than 800 collections of publicly-held LAC material, much bought by Library and Archives over the years with taxpayers' money.
...Under the agreement, digital images will begin rolling back into the free public domain -- known as "open access" -- as the 10-year exclusive rights expire.
Hat tip to InfoDocket!
"The Digital-Surrogate Seal of Approval: a Consumer-oriented Standard." James A. Jacobs, University of California San Diego and James R. Jacobs, Stanford University. D-Lib Magazine, March/April 2013, Volume 19, Number 3/4. Also available in the Stanford Digital Repository and the University of California Escholarship Repository.
Some libraries, library organizations, and library managements believe they can "manage" their collections better by first digitizing historic collections of books and other paper and ink information sources and then weeding their collections of these materials. Such projects will reduce the number of copies held in the aggregate by all libraries (Lavoie, Schonfeld, Schottlaender, Yano). One problem that these projects often overlook is the subtle (and not so subtle) differences between the legal standing of paper and digital objects with regard to access and use. Too often, creators of digital objects attempt to impose copyright restrictions on the digital objects even if the originals were in the public domain. Additionally, digital objects are often encumbered with licenses and technological restrictions that limit how they can be used and who can use them. The digital objects are often just not as accessible or as usable as the original print. How bad would it be if we threw away our print collections in favor of digital collections that are less accessible and less usable?
Randal C. Picker, who is Leffmann Professor of Commercial Law and Senior Fellow at the The Computation Institute of the University of Chicago and Argonne National Laboratory University of Chicago Law School, has written a paper and created a presentation on just this issue.
- Picker, Randal. 2013. Access and the Public Domain. Rochester, NY: University of Chicago Institute for Law & Economics. Coase-Sandor Institute For Law And Economics Working Paper No. 631.
- Picker: Access and the Public Domain (Fordham IP Talk), YouTube (Apr 6, 2013).
"This is a version of a talk that I gave at the Fordham IP Conference on April 5, 2013. It is based on my paper Access and the Public Domain, which was published in the San Diego Law Review."
In the paper, he considers how legal issues affect digitization projects such as The Internet Archive, JSTOR, Google Book Search, HathiTrust, and THOMAS.
His take-aways from the presentation are:
- Access rights and use rights are different animals and operate in different legal settings.
- Even though the public domain is coming online, the financing models for the projects will result in efforts to restrict use ina variety of ways.
- Perhaps a truly public public domain, something like the DPLA perhaps, is required to avoid the path of non-copyright control over the public domain.
Hat Tip: ARL Policy Notes.
Lavoie, Brian F., Constance Malpas, and J.D. Shipengrover. 2012. Print Management at “Mega-scale”: a Regional Perspective on Print Book Collections in North America. Dublin, OH: OCLC Research. http://www.oclc.org/research/publications/library/2012/2012-05.pdf (Accessed July 19, 2012).
Schonfeld, Roger C., and Ross Housewright. 2009. 28 What to Withdraw: Print Collections Management in the Wake of Digitization. Ithaka S+R. http://www.sr.ithaka.org/research-publications/what-withdraw-print-colle....
Schottlaender, Brian E.C. et al. 2004. 82 Collection Management Strategies In A Digital Environment, A Project Of The Collection Management Initiative Of The University Of California Libraries, Final Report to the Andrew W. Mellon Foundation. University of California, Office of the President, Office of Systemwide Library Planning. http://www.ucop.edu/cmi/finalreport/index.html.
Yano, Candace Arai, Z.J. Max Shen, and Stephen Chan. 2008. Optimizing the Number of Copies for Print Preservation of Research Journals. Berkeley, CA: University of California Berkeley, Industrial Engineering & Operations Research. http://www.ieor.berkeley.edu/~shen/webpapers/V.8.pdf.
James and I are happy to announce that our new article appears in the current edition of D-Lib Magazine:
- The Digital-Surrogate Seal of Approval: a Consumer-oriented Standard. by James A. Jacobs and James R. Jacobs. D-Lib Magazine, 2013, 19(3/4). DOI: http://dx.doi.org/10.1045/march2013-jacobs
In the last few years, there have been a series of articles, reports and proposals that rely on the promises of digitization to address issues of physical space, cost control, access, and collection management for FDLP libraries. One of the reasons we created this Seal of Approval standard is to provide a clear, consistent way to help evaluate some of these promises of digitization.
Ever heard the term "bit rot" or wondered what actually happens when electronic files go bad? The Atlas of Digital Damages is a collection of files with corrupted bits, so you can visually see what happens. The Atlas is a flickr album, or rather "a staging area for collecting visual examples of digital preservation challenges, failed renderings, encoding damage, corrupt data, and visual evidence documenting #FAILs of any stripe." So, in addition to viewing these examples, you too can contribute examples to help build the Atlas' collection. A blog post by Barbara Sierman, from the National Library of the Netherlands, first posed the question and well, folks ran with the idea and created this "crowd sourced effort" to document digital degradation. See, "Where is our atlas of digital damages?".
I discovered this nifty item while reading through the November Digital Preservation Newsletter from the Library of Congress (there's lots of great project updates and information, especially on the geospatial digital preservation front in there - so go check it out!)
and check out the LOCKSS project for digital preservation approaches and methods to prevent bit rot on a large scale.
[This post was nicely sent to us by our pal Kris Kasianovitz, International, State and Local Government Information Librarian at Stanford. If others want to send us items of interest, please send them to freegovinfo AT gmail DOT com. Thanks Kris!!]
When we think about the historical paper-and-ink collections that FDLP libraries have built over the last 200 years, we often wish we could make them more accessible through digitization. But we have to be careful when we think this way. One thing I have learned repeatedly as I have worked with digital information over the last twenty five years is that, in the digital world, "access" and "preservation" have to go together. When we neglect either, we lose both.
Some recent writings have reinforced this old idea and are worth remembering:
- All Digital Objects are Born Digital Objects, by Trevor Owens, The Signal (May 15th, 2012).
There is no large red button that says "digitize" on it, we make decisions about what significant properties we want to record from a physical object and we work to ensure that those properties are recorded in the newly created digital object. When we talk about the scanner "digitizing" it's all too easy to forget the history of the creation of the digital object and we can easily forget that there are a range of individual and institutional authorial intentions that go into deciding what and how to digitize.
- Digitization is Different than Digital Preservation: Help Prevent Digital Orphans!, by Kristin Snawder, The Signal (July 15th, 2011).
Many institutions see the immediate value of having materials available electronically. This is valid reasoning. Many researchers no longer want to come and see the materials. They want access from the comfort of their own couch and fuzzy slippers. But, in the hurry to meet user expectations, institutions may scan large quantities of materials without having a solid plan for preserving the digital images into the future.
Approaching Digitisation Through A Digital Preservation Perspective. by Alenka Kavčič-Čolić. Presented at the SEEDI (South-Eastern European Digitisation Initiative) 2012, Ljubljana, Slovenia.
Most libraries still conceive digitisation as a digital reproduction aimed to provide access to library materials only. The master files resulted from digitisation are usually not digitally preserved and the digital collections run the risk of being lost for the future.
The above examples are about short-term thinking and lack of planning when libraries aim for access without planning for preservation. The same mistake can be made the other way, too: when libraries plan for preservation without access. Paul Conway made this point more than 15 years ago:
For years, preservation simply meant collecting. The sheer act of pulling a collection of manuscripts from a barn, a basement, or a parking garage and placing it intact in a dry building with locks on the door fulfilled the fundamental preservation mandate of the institution. In this regard, preservation and access have been mutually exclusive activities often in constant tension. "While preservation is a primary goal or responsibility, an equally compelling mandate--access and use--sets up a classic conflict that must be arbitrated by the custodians and caretakers of archival records," states a fundamental textbook in the field (Ritzenthaler, Mary Lynn. Preserving Archives and Manuscripts. Chicago: Society of American Archivists, 1993. p. 1). Access mechanisms, such as bibliographic records and archival finding aids, simply provide a notice of availability and are not an integral part of the object.
In the digital world, the concept of access is transformed from a convenient byproduct of the preservation process to its central motif. The content, structure, and integrity of the information object assume center stage; the ability of a machine to transport and display this information object becomes an assumed end result of preservation action rather than its primary goal. Preservation in the digital world is not simply the act of preserving access but also includes a description of the "thing" to be preserved. In the context of this report, the object of preservation is a high-quality, high-value, well-protected, and fully integrated version of an original source document.
-- Paul Conway Head, Preservation Department Yale University Library. Preservation in the Digital World Council on Library and Information Resources, Pub62 (March 1996).
The Update on February 2012 Activities of the HathiTrust reports on research being done by the HathiTrust Research Center (HTRC) to quantify occurrences of Optical Character Recognition (OCR) errors in the HathiTrust corpus. OCR is the technology that converts a scanned image to text that can be searched and analyzed. Members of the HTRC examined 256,000 non-Google digitized volumes from HathiTrust using a clever algorithm that compared OCR text to a dictionary of known words. Using a supercomputer and a set of rules that were verified by a human expert, they found that 84.9 percent of the volumes examined (217,754 of the 256,416) had one or more OCR errors and 11% of the pages (7,745,034 of the 69,297,000) had one or more errors. The average number of errors per volume was 156.
As we at FGI have argued here before, we believe that it essential to take into account OCR accuracy and error rates when digitizing paper collections. This is particularly important when digitizing books that contain statistical tables since it is harder to use current OCR technologies to accurately convert image scans to numbers than it is to convert scans to text. It is also harder to evaluate the accuracy of such conversions; you can't use a dictionary of known statistics the way you can use a dictionary of known words. (The HTRC study did not, apparently, examine accuracy of statistical table conversions.) Because of the large volume of such information in government publications, this is a very important issue as we collectively try to digitize our paper collections, evaluate their accuracy and usability, and determine how many paper copies we need to keep after digitization (see Achieving a collaborative FDLP future).
Earlier this month, we posted about the "Open letter and petition to President Obama to create a federal scanning commission and digitize all .gov publications". The petition closed on 1/20 and now David Ferriero, the Archivist of the US at the National Archives, has given the official NARA response. I'd say this is a positive first step, but much discussion is still needed. Please join the conversation over at the NARA Blog. I think documents librarians will be invaluable to this effort going forward!
Digitizing Federal Public Records
By David Ferriero
Thank you for signing a petition asking the Obama Administration to digitize all public records.
The Obama Administration believes increasing access to our collections by digitizing our records is a great idea. Our most recent efforts to do this ourselves as part of our OpenGov initiative, include the Citizen Archivist project, a Wikipedian in Residence, Tag it Tuesdays, and Scanathons. We are also moving forward on implementing the President’s recent Memorandum on Managing Government Records, which focuses on the need to update policies and practices for the digital age.
But all those things aren’t enough. Your petition, and the Yes We Scan effort broadly, calls for a national strategy, and even a Federal Scanning Commission, to figure out what it would take to digitize the holdings of many federal entities, from the Library of Congress to the Government Printing Office to the Smithsonian Institution.
These ideas bring up a host of questions that still need to be answered: What should the National Archives’ priorities be? Do we focus on preserving deteriorating paper records, still bound with red ribbons from two centuries ago? Do we make digital copies of Vietnam Era film footage? Should we focus on preserving those older paper records while citizens volunteer to digitize more recent, and better preserved, records?
The National Archives – which houses the Nation’s permanent records – is looking for your input to help answer these important questions on how we move forward. What are your thoughts on how the National Archives and other agencies should proceed? What questions should we be asking ourselves?
You can add your thoughts over on the National Archives blog, and I’m looking forward to having a longer discussion with the creators and signers of this petition on this important issue in the coming weeks– more details on that will follow.
Thank you again for your interest in this important issue. I’m looking forward to your ideas on how we can proceed with digitizing federal public records.
David Ferriero is the Archivist of the United States