Home » Posts tagged 'hathitrust'
Tag Archives: hathitrust
This is big news indeed. According to a press release, HathiTrust’s Federal Documents Registry — pulled together from the records of over 40 libraries — is now available as a beta release! Mike Furlough and Valerie Glenn gave a very good presentation of the project, methodology, etc. at IFLA last week.
A HUGE thanks to HathiTrust and especially to Valerie Glenn who took on this immense project. Hopefully, it’ll not only help HathiTrust identify materials not yet digitized in its corpus, but also the 1100+ depository libraries and public information users. Hopefully the FDLP will be able to use the registry to put together a union list and holdings of all documents in order to further FDLP’s goal of preservation of all historic FDLP documents. I know I’ll be using the registry to compare Stanford Library’s federal documents collection to the registry in order to find and fill gaps in our collections and also hopefully as copy for documents not yet cataloged.
The Registry is intended to be a comprehensive source of metadata for the US federal documents corpus – material produced at government expense since 1789. While many potential use cases exist, an important use will be the identification of materials that have not yet been digitized and/or deposited into the HathiTrust repository.The Registry was conceived in 2012 as a mechanism to determine how far HathiTrust had progressed in meeting its goal of a comprehensive digital corpus, as outlined in the ballot initiative from the 2011 Constitutional Convention. In the fall of 2013, we issued a broad call for records, and thanks to the more than 40 libraries who responded we received more than 25 million records. With such a large aggregation of records, the project team needed to develop multiple approaches for detecting and grouping duplicate records (records describing the same work).
This document is meant to accompany the article, “What are we to Keep?” by James R. Jacobs, Documents to the People (Spring 2015) p 13-19.
- What is a Preservation Copy?
Research that was prompted by JSTOR’s desire to determine how to guarantee that all of the printed material within its journals would remain available defined preservation copies as “clean copies that retain full information accuracy from the vantage point of the researcher” (Yano). Thus when we think about “preservation copies” we are looking to be able to ensure that copies are available for the long-term and that those copies are complete and accurate. “Informational Accuracy” a “perfect copy” — a copy that is as good as new. A preservation copy is, therefore, a “clean” copy that is quality-checked and repaired, if necessary, on a page by page basis.
- Why do we need Preservation Copies?
Even if we had perfect digital copies of paper documents, we still need preservation paper copies for two reasons. First, there is evidence that digital documents degrade more rapidly than print material (Rosenthal), so it is necessary to have a paper copy that could be used to re-digitize. Second, Digitization does not magically preserve paper; or, to put it another way, digital copies are not the same as print copies and may inherently lose information by the very dint of reformatting to a new presentation.
- Why do we need Access-Copies?
- Why do we need re-digitization copies?
Unless we create perfect copies that adequately anticipate the future needs of users, we will need to create new digitizations in order to meet those future needs. (See “An alarmingly casual indifference to accuracy and authenticity” What we know about digital surrogates.)
Unless we have perfect, page-verified digitizations that are as complete, as accurate, and as easily usable as the original paper copies (Jacobs and Jacobs), users will inevitably need to go back to the original paper copy in order to get either the complete and accurate content or the functional usability of the original paper medium. Some libraries have already reported that digitization of paper copies has increased the demand for access to the paper copies. Additionally, some users/uses will require access to physical copies via Interlibrary Borrowing. ILL can only happen if there is a surplus of copies. As the # of copies goes toward 0 (scarcity), libraries will no longer be willing to lend to ILL. Therefore, it is imperative that there not be a dearth of geographically distributed copies.
What should I think about before discarding government documents?
1. In General
- Does the document have long-term historical value? and if it is a recently published document, *will* it have historic value?
- Does the document include tabular data and statistics?
- Does the document include maps, fold-outs, color illustrations, and other non-textual content?
- Does the library have adequate metadata representation in the library’s catalog for the document?
- Is the document discoverable and accessible?
- How many other libraries are listed in the OCLC record as having a copy?
- Are there other copies in nearby FDLs?
- Are there MOU’s for shared collections with nearby libraries/consortia in place?
- Does the digitization meet the requirements of the Digital Surrogate Seal of Approval (DSSOA)?
- Is the digital copy adequately cataloged?
- Does the digitization include digital full-text (aka OCR)?
- Is the full-text searchable for item-level discovery?
- Is the full-text searchable within an item?
- Can the digital text be accurately copied or extracted?
- How accurate is the digital text — particularly with regard to tabular numeric data, dates, and named people places and things?
- Does the digitized text preserve the original layout of the print text — particularly with regard to tables, footnotes, sidebars, and headers and footers?
- Is the document freely and publicly available in a trusted digital repository?
- Does your community have complete access and use rights to the digital copy?
- Has anyone checked the digital document page-by-page to assure it’s accuracy, legibility, usability, and searchability?
- Does your library have any control over the long-term availability of the document?
2. About Paper Copies
3. About Digital Copies
Ames, Eric. “So We Can Throw These Out Now, Right?”: What We Learned From Microfilming Newspapers and How It Shapes Our Digitization Strategy. The Baylor University Libraries Digital Collections Blog (August 23, 2012).
Center for Research Libraries. 2011. Certification Report on the HathiTrust Digital Repository (March 2011).
Jacobs, James A., and James R. Jacobs. 2013. “The Digital-Surrogate Seal of Approval: A Consumer-Oriented Standard.” D-Lib Magazine 19, no. 3/4 (March 2013). doi:10.1045/march2013-jacobs.
Kichuk, Diana. 2015. “Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books.” Portal: Libraries and the Academy 15, no. 1 (2015): 59–91. doi:10.1353/pla.2015.0005.
Ladd, Ken. 2010. An Examination of the Failure Rate and Content Equivalency of Electronic Surrogates and the Implications for Print Equivalent Preservation. Evidence Based Library and Information Practice (2010) 5.4.
McEathron, Scott R. An Assessment of Image Quality in Geology Works from the HathiTrust Digital Library. Proceedings, Geoscience Information Society, Volume 41, October 27, 2011.
Nadal, Jacob, and Annie Peterson. 2009. Scarce and Endangered Works: Using Network-Level Holdings Data in Preservation Decision Making and Stewardship of the Printed Record. Preprint, accepted for publication in ALCTS Monographs.
Schonfeld, Roger C., and Ross Housewright. 2009. Documents for a Digital Democracy: A Model for the Federal Depository Library Program in the 21st Century. Ithaka S+R (December 17, 2009).
Schonfeld, Roger C., and Ross Housewright. 2009. What to Withdraw: Print Collections Management in the Wake of Digitization. Ithaka S+R, (September 29, 2009).
Yano, Candace Arai, Z.J. Max Shen, and Stephen Chan. 2008. Optimizing the Number of Copies for Print Preservation of Research Journals Berkeley, CA: University of California Berkeley, Industrial Engineering & Operations Research, (October 2008). [originally published at http://www.ieor.berkeley.edu/~shen/webpapers/V.8.pdf]
The Spring 2015 issue of Documents to the People (DttP) just arrived at my door. The feature article in this issue is titled “Thoughts on the National Collection” and was collaboratively written by myself, James R. Jacobs, along with Shari Laster, Aimee C. Quinn, and Barbie Selby. I’m posting my segment titled “What Are We to Keep?” as it was written under a Creative Commons Attribution-NonCommercial-Share-Alike CC BY-NC-SA license. The other pieces include: “Segmenting the Government Information Corpus” by Shari Laster; “Who is Responsible for Permanent Public Access?” by Aimee C. Quinn; and “Where Do We Go From Here?: Some Thoughts” by Barbie Selby. I’ll post the other segments if I get permission from my collaborators.
The question of “how many copies” of print documents the FDLP should collectively keep is the wrong question asked for the wrong reasons and trying to answer it will only lead to the wrong answers and irreparable loss of information. For me, even thinking about answering it raises more questions. How can we know how many copies to keep unless we specify the purposes for which we wish to keep them? What are those purposes? How will we know if we are meeting our goals? How will discarding paper benefit users? How can we be sure that we are not losing information when we discard paper copies if we do not have an inventory of the paper copies that exist? How can we implement a policy that is so vague that it doesn’t define things like “a requisite number of copies,” and how decisions will be made, and which apparently treats a born-digital XML document created by GPO and an indifferent digitization without OCR text and missing its maps and foldouts as of equal value?
Let’s be clear. We are talking about the records of our democracy. Loss of even a single page could damage the ability of historians, journalists, economists, and citizens to understand our history and hold our government accountable for it successes and its failures. We have those documents now in our libraries; there are not hundreds or even dozens of copies of these documents floating around in used bookstores or elsewhere. They are in our charge.
Also see the What Are We To Keep FAQ for further context and bibliography.
I just received an old (historic NOT legacy) Department of Commerce publication off of the needs and offers list called “Commercial handbook of China” by Julean Arnold, commercial attaché (WorldCat record). It’s actually a 1975 reprint of a 1919 publication. It’s chock full of statistics relating to provinces, cities, and consular districts — agriculture, minerals and mining, populations, exports and imports, revenues, transportation, ports and shipping facilities etc. In short, this is a gold mine of historic information and statistics from the Republic of China (pre-Communist China). The document was digitized and is available in HathiTrust as well as the Internet Archive (see book reader below).
However, in comparing the digitized version with the paper version in hand, I came upon several issues:
- there are 3 foldout maps that were not digitized. These maps are critical information on railway lines and treaty ports in China. The bibliographic record has a physical description including “2 v. fronts., plates, fold. map, tables, diagrs., fold. charts” but no content note mentioning that the maps were not digitized.
- As I mentioned, the document is chock full of statistical tables. Have you ever tried copying and pasting tabular data from a PDF? It’s even worse when the tables are displayed in landscape rather than portrait. I’ve verified that the OCR fails on those pages.
- Lots of readability/usability issues: The table of contents is partially obscured in one copy and the tables are often blurred or faint. also, HT is using a process of OCR now where you can search but not copy or paste.
- Lastly, I find it … uh… interesting that this book says here “Copyright: Public Domain, Google-digitized.” But, if you want to download the whole book, you have to be an HT partner.
Does this digitized version increase access to this important historic material? Yes, indeed, it does. But I’m rather glad to have a bibliographic record in my catalog that links to the the digital version AND points to the paper copy in our collection.
Here are 2 recent items analyzing Hathitrust and Google books for their efficacy in giving access to Federal government documents. The first is an article by Laura Sare (Texas A&M) and compares Hathitrust with Google Books. The second is a presentation by Brian Vetruba (Washington U in St Louis) at “Leveraging Your Strengths: Regional Government Documents Conference” at the Federal Reserve Bank St. Louis on May 4, 2012.
A Comparison of HathiTrust and Google Books Using Federal Publications. Laura Sare. Practical Academic Librarianship: The International Journal of the SLA Academic Division. 2(1) 2012 p. 1-25. (attached below. Fair use claim)