Home » Posts tagged 'digital collections' (Page 2)

Tag Archives: digital collections

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Lunchtime Listen: Future Crimes

future crimes

Lunchtime Listen: Book Discussion on Future Crimes, C-SPAN, Book-TV (February 25, 2015).

Marc Goodman talks about his book, Future crimes: Everything is connected, everyone is vulnerable, and what we can do about it (New York : Doubleday, 2015), about how criminals, corporations, and governments use technology to disrupt the lives of people around the world.

Although Goodman does not address the preservation of government information, his book provides a useful context to the challenges of successfully protecting any large store of data. His analysis of the state of cyber security should make government information professionals question the wisdom of relying solely on individual government agencies to secure long-term access to essential government information.

A good alternative is to build digital FDLP collections in FDLP libraries. The LOCKSS Digital Federal Depository Library Program is one, partial, model for this because it provides duplicate copies of GPO’s FDSys distributed in more than three dozen libraries using the proven technology of the LOCKSS system.

An additional and even better model would be for more FDLP libraries to build their own digital collections of federal government documents. By building separate collections catered to the needs of their own (geographically unlimited) communities, such collections would have the added security benefit of being separately funded, separately administered and managed, and separately secured using different technologies.

“An alarmingly casual indifference to accuracy and authenticity.” What we know about digital surrogates

In a new article in Portal, Diana Kichuk examines the reliability and accuracy of digital text extracted from printed books in five digital libraries: the Internet Archive, Project Gutenberg, the HathiTrust, Google Books, and the Digital Public Library of America. She focuses particularly on the accuracy and utility of the digital text for reading in e-book formats and on the accuracy of metadata derived from extracted text.

This study, along with a couple of others cited below, are very relevant to the repeated calls by some within the Federal Depository Library Program to digitize and discard the historic FDLP paper collections. These studies, even though they do not focus on government publications, provide examples, data, and standards that should be critical to review before the depository community implements discarding policies that will have irreversible effects.

* * *

Kichuk’s article is well worth reading in its entirety as she identifies many problems with digital text created during digitization of paper books by OCR (Optical Character Recognition) technologies, and she gives specific examples. The two most important problems that she highlights are that digitized texts often fail to accurately represent the original, and that the metadata that is automatically created from such text is too often woefully inaccurate. These problems have real effects on libraries and library users. Readers will find it difficult to accurately identify and even find the books they are looking for in digital libraries and libraries will find it difficult to confidently attribute authenticity and provenance to digitized books.

Kichuk says that digitized text versions of print books are often unrecognizable as surrogates for the print book and it may be “misleading at best” to refer to them even as “equivalent” to the original. Although she only examined a small number of e-books (approximately seventy-five), she found “abundant evidence” of OCR problems that suggest to her the likelihood of widespread and endemic problems.

A 2012 report by the HathiTrust Research Center reinforces Kichuk’s findings. That study found that 84.9 percent of the volumes it examined had one or more OCR errors, 11% of the pages had one or more errors, and the average number of errors per volume was 156 (HathiTrust, Update on February 2012 Activities March 9, 2012).

* * *

Most of the examples we have of current-generation digitization projects, particularly mass-digitization projects, provide access to digital “page images” (essentially pictures of pages) of books in addition to OCR’d digital text. So, to get a more complete picture of the state of digitization it is instructive to compare Kichuk’s study of OCR’d text to a study by Paul Conway of page images in the HathiTrust.

Fully one-quarter of the 1000 volumes examined by Conway contained at least one page image whose content was “unreadable.” Only 64.9% of the volumes examined were considered accurate and complete enough to be considered “reliably intelligible surrogates.” Presumably, that means more than 35% of the volumes examined were not reliable surrogates.

Conway’s study reinforces the findings of the Center for Research Libraries when it certified HathiTrust as a Trusted Digital Repository in 2011. (Full disclosure: I was part of the team that audited HT.) CRL said explicitly that, although some libraries will want to discard print copies of books that are in HT, “the quality assurance measures for HathiTrust digital content do not yet support this goal.”

Currently, and despite significant efforts to identify and correct systemic problems in digitization, HathiTrust only attests to the integrity of the transferred file, and not to the completeness of the original digitization effort. This may impact institutions’ workflow for print archiving and divestiture. (Certification Report on the HathiTrust Digital Repository).

* * *

Together, these reports provide some solid (if preliminary) data which should help libraries make informed decisions. Specifically, all these studies show that it would be risky to use digitized copies of FDLP historic collections as reliable surrogates for the original paper copies. That means it would be risky to discard original paper copies of documents simply because they had been digitized.

Although Conway suggests, as others have, that libraries (and users) may have to accept incomplete, inaccurate page images as a “new norm” and accept that they are not faithful copies, he also realizes that “questions remain about the advisability of withdrawing from libraries the hard-copy original volumes that are the sources of the surrogates.”

Kichuk goes further in her conclusions. She wisely envisions that the “uncorrected, often unreadable, raw OCR text” that most mass-digitization projects produce today, will be inadequate for future, more sophisticated uses. She looks specifically to a future when users will want and expect ebooks created from digitized text. She warns that current digitization standards, coupled with insufficient funding, are not creating text that is accurate or complete enough to meet the needs of users in the near future. And she recognizes that librarians are not stepping up to correct this situation. She describes “an alarmingly casual indifference to accuracy and authenticity” of OCR’d text and says that this “willful blindness” to the OCR problem is suppressing any sense of urgency to remedy the problem.

She concludes from her small sample that there should be a more systematic review by the digital repository community prior to the development of a new digitized e-book standard, especially for metadata and text file formats.

I agree with Kichuk and Conway and CRL that more work needs to be done before libraries discard their paper collections. Librarians and their communities need to have a better understanding of the quality of page images and digitized text that digitization projects produce. With that in mind, James R. Jacobs and I addressed this very problem in 2013 and suggested a new standard for the quality of page images — which we call the “digital Surrogate Seal of Approval” (DSSOA)) in 2013:

Libraries that are concerned about their future and their role in the information ecosystem should look to the future needs of users when evaluating digitization projects.

FDLP libraries have a special obligation to the country to preserve the historic collections in their charge. It would be irresponsible to discard the complete, original record of our democracy and preserve only an incomplete, inaccurate record it.

The Official Senate CIA Torture Report

Update


GPO has released an official version of the “THE SENATE CIA REPORT” as Senate Report 113-228. The digital version is available on GPO’s Federal Digital System (FDsys):
http://www.gpo.gov/fdsys/pkg/CRPT-113srpt288/pdf/CRPT-113srpt288.pdf
The print version is available for purchase at GPO’s retail and online bookstore for $29.
http://bookstore.gpo.gov/products/sku/052-071-01571-0

This is a single-volume, 712 page version. It contains:

Letter of Transmittal to Senate from Chairman Feinstein — i
Foreword of Chairman Feinstein — iii
Findings and Conclusions — x
Executive Summary — 1
Additional Views of Senator Rockefeller — 500
Additional Views of Senator Wyden — 503
Additional Views of Senator Udall of Colorado — 506
Additional Views of Senator Heinrich — 510
Additional Views of Senator King — 512
Additional Views of Senator Collins — 515
Minority Views of Vice Chairman Chambliss, Senators Burr, Risch, Coats, Rubio, and Coburn — 520
Minority Views of Senator Coburn, Vice Chairman Chambliss, Senators Burr, Risch, Coats, and Rubio — 678
Minority Views of Senators Risch, Coats, and Rubio — 682

GPO Press Release:

FOR IMMEDIATE RELEASE: December 15, 2014

GPO RELEASES THE OFFICAL DIGITAL & PRINT VERSIONS OF THE SENATE CIA REPORT

WASHINGTON – – The U.S. Government Printing Office (GPO) makes available the official and authentic digital and print versions of the Report of the Senate Select Committee on Intelligence Committee Study of the Central Intelligence Agency’s Detention and Interrogation Program, together with a forward by Chairman Feinstein and Additional and Minority Views (Senate Report 113-288).

This document comprises the declassified Executive Summary and Findings and Conclusions, including declassified additional and minority views. The full classified report will be maintained by the Committee and has been provided to the Executive Branch for dissemination to all relevant agencies.


The release of the Senate’s Study of the CIA’s Detention and Interrogation Program presents some interesting issues for government documents collections.

Issues

There are 3 separate documents and they are easily findable on the web on different web sites, but not all sites have all 3 documents and the the different copies of the individual documents are not the same.

The “official” copies are (at least today) listed on the home page of Senate Committee’s web site [see below)], but are not listed on the Committee’s Publications Page or its Press Release page – perhaps because the report is not an official committee document with an assigned “Document” or “Report” number. Presumably it will not be in FDsys unless or until it gets an official Document or Report designation.

(Why isn’t it “official”? The report was initially intended to be a full committee report. In 2009 the Committee voted 14–1 to initiate the study. But in 2009 Republicans on the Committee withdrew from active participation in the study.)

My speculation is that the different PDF files that you can find on the web are slightly different because each one was produced by scanning a paper copy with different software. I do not know if the Committee only distributed a paper copy but I do know that even its own PDF copy is (apparently) a scanned copy. (You can tell because, if you try to copy the text from the PDF, you will discover that it is badly OCR’d (optical character recognition) text. For example, the digital text of names of Senators is sometimes badly converted: Chambliss becomes “CHAMBUSS” and Rubio becomes “Rvbio”). The official copies were created using Adobe PDF Scan Library 3.1 and ScandAll PRO V2.0.12.

Official Reports and Statements

The Senate Select Committee on Intelligence currently has links to three documents on its home page.

The CIA has its own responses to the report, currently listed on its Reports page.

Other official statements.

Unofficial Copies

A web search for the title of the title (“Committee Study of the Central Intelligence Agency’s Detention and Interrogation Program”) leads to many sites with copies. Many of these are, apparently directly from the Committee site, but at least one news organization (the New York Times) evidently made its own scanned copy and digitized text version of the main report.

  • The New York Times has a PDF copy [108.4MB, 528 pages] and a plain text copy. The PDF version was created using Acrobat 11.0.9 Paper Capture Plug-in and Xerox WorkCentre 5150. Both are stored with an Amazon cloud service.

Timeline

ProPublica has created a useful timeline to put the report in perspective.

FDLP Library Actions

What can FDLP Libraries (or any library) do to ensure that their uses will be able to find and get unaltered, official, copies in the future? Just relying on the web may not be adequate, secure, consistent, transparent, or guaranteed. There are several issues. The existing links to even the official documents may not be stable. The official digital copies are only digital surrogates of the original paper copy. There are already other alternative digital surrogates available. The quality of the surrogates varies and the links to those copies may also not be stable.

I suggest the following actions by libraries:

  • Get copies of the official digital versions directly from the Committee web site as soon as possible (see links above).
  • Create a digital “hash” or “checksum” of the documents you download. (See a list of various tools and a discussion of checksums for preservation, if you are unfamiliar with the concepts.)
  • Catalog your copies and include them in your OPAC or other official library inventory and discovery databases. Include adequate metadata that describes how, when, and where you got your copies.
  • Ideally, you should store your copies in a Trusted Digital Repository. Unfortunately, there are, as yet, very few certified TDRs. Short of that, be sure that you have copies stored in more than one geographic location and that you have a way of verifying over time (using the checksum) that the files you stored have not been altered or corrupted.

Reflections on the end of a year and the beginning of a new year

 

James and I were both impressed to see that MLIS student Crystal Vicente posted a link on govdoc-l to her paper about the effects on government information of the government shutdown. It is great to see library students addressing the issues that confront the future of government information. We’d like to see more students posting their papers to promote engagement between students and practioners. [Note: the link to Vicente’s paper has changed. Here is the correct link (as of 4/4/14) to Online electronic government information and the impact of the government shutdown on public access.]

Ms. Vincente’s paper prompted us to reflect on the issues that face government information and FDLP libraries. As we end the year and begin a new one, we offer these reflections on some of the big issues that still face us.

  1. e-government. We think it is important for librarians to differentiate between “e-government” and information dissemination by the government. The two terms may overlap in common usage, but we think it is useful to draw a distinction between e-government, which is a service that uses government information and the information itself. We think it is useful to distinguish between interactions (or transactions) with the government (which is two-way communication) and the government disseminating information (which is one-way communication, an instantiation of the government’s work and processes).E-government and the provision of information services by the government on the web will surely increase over time and this is a very important thing to monitor and evaluate, but we believe that librarians need to pay even closer attention to the (public) information that the government gathers, assembles, and creates and then uses to create such services.
     

    A citizen might go to a government web site and submit a form to find out the current population of their city or the phone number for their Representative or the current laws or regulations on a subject. This is an e-government information service driven by government information. Will the government preserve and make that information available after it is “out of date” or when it is not popular enough to warrant a full-fledged service? This isn’t hypothetical: we have already seen the Census Bureau remove whole Decennial Censuses — one of the biggest, most important collections of government information, from its American FactFinder service and announce its intention of continuing to do so as new data become available. (See: American FactFinder, American FactFinder Communications,AFF2 Expansion and Legacy Sunset, and The Future of the Decennial Census: Where is it Going?, and IASST-L For users of the US Census Bureau’s American Factfinder: that “include archived products” checkbox does nothing, officially.)

    Services may come and go, but the information itself needs to be preserved, and free access to it and services for it will need to be provided for the long term — regardless of the short-term services the government provides with that information.

    The data behind such e-government information services can be (and, in our view, should be) acquired by libraries to ensure its long-term preservation and availability. In some cases, this will be more like acquiring databases than acquiring static documents — a task which has its own challenges, but is not without precedent. (See the long history of relations between ICPSR and libraries in the preservation of social science data and delivery of data services, and and the current trend for libraries to be involved in “data management.”) E-Government initiatives are less likely to worry about long-term, life-cycle data preservation and access since many (most?) focus instead on currency of information and responses to individual queries. As we see more e-government services (such as healthcare.gov and online facilities for applying for driver’s licenses, etc.), this will become an even bigger issue and the distinction between e-government transactions and the information behind such services will become even more important.

    E-government can also create a big impact and burden on library staff (most often, unfunded), but the instantiation of the workings of government has never been more at risk as we deal with the many issues and difficulties surrounding born-digital government information.

    The availability of e-government information services should be seen, not as an excuse for reducing library involvement in government information, but as evidence of the need for even more involvement in the long-term preservation of and access to information!

    (See also E-Gov: are we citizens or customers?)

  2. the single source problem. Although the recent government shut-down was a very visible and extreme reminder of the problems of relying on a single source for any information, complete shut-downs are not the only source of problems. When we rely on a single source for creation, preservation and access to information, any decision made by that single source can potentially affect the availability or accuracy of the information. (See, for example: When we depend on pointing instead of collecting and The Technical is Political). And when all we have to rely on is commercial sources for public information, it is a sign of the failure of libraries to do their job of ensuring long-term free public access to public information.
  3.  

  4. discovery. The issue of information discovery is different from, but connected to, these other issues. When we rely on a single information provider (such as GPO or an information agency like BLS), we are limited to their interface and discovery tools and their choices of what to deliver and how. We are also limited by the “stove-piping” of information into silos. But, if libraries build collections along subject and discipline and user-community lines (instead of provenance or agency lines), one result will be new ways to discover and get information and new ways to combine information from government sources and non-government sources in ways that work better for users and that the government itself cannot do.
  5.  

  6. authenticity. The question of authenticity is not a technical problem, though technology can help. The thing that makes it hardest for information to be forged or altered or damaged is replication with trusted parties. (See LOCKSS Preservation Principles and Who do you Trust? The Authentication Problem).
  7.  

  8. fugitives. We all know that fugitive documents are an issue, especially as agencies publish more and more of their information online and outside of the traditional FDLP. We would hope that in 2014 and beyond, the community will put more energy toward fugitives. We need to collectively figure out which agencies are best and which are worst at complying with Title 44. We need to identify the current community activities that deal with fugitives (see, for example, Lost Docs), and gaps where more needs to be done. This also relates to the above issue of e-government services that provide short-term access to information, but that do not instantiate it for long-term preservation and access.
  9.  

  10. bottlenecks. We should also explore what bottlenecks to access exist besides government shutdowns. This gets into the areas of government secrecy, less access to less information territory, the digital divide, and the “information deluge.” We used to talk about “drinking from the firehose” — but the issue of finding the right information amidst too many google search results is becoming even more important. (Howard Rheingold calls this “crap detection.”) This affects librarians as well as the public. We need to select, describe, and preserve the information our communities need and then curate those collections and create digital tools and services and educational materials that will help our communities find and use the government information they need.
  11.  

  12. finding solutions. We believe that the the government information community needs to act in concert to develop solutions to digital information access and preservation. There are many possible solutions and they should all be explored. At FGI, we’ve long advocated for digital deposit to fend off the single point of failure issue (see When we depend on pointing instead of collecting), and FDL’s have put energy toward building digital collections (e.g. lockss-usdocs, archive-it, EEMs). But we think there are other avenues to explore besides the end-of-the-publication-cycle solutions. What about things like “adopt a federal agency” — where one or a group of FDL’s liaise with an agency to assure their publications and data make their way into the FDLP — FDLP libraries advocating to federal agencies that they get together to collaboratively build information architectures that facilitate preservation rather than obfuscate it etc? In other words, are there any beginning-of-the-information-lifecycle ideas that could be explored?
  13.  

  14. beyond docs…. Finally, how do the issues we’re dealing with in government information carry over and affect other library collections, services, and policies in general? To what extent is government information the canary in the library coal mine (see, e.g., What’s love got to do with it? further thoughts on libraries and collections #lovegate)?

 

There is plenty to work on! Here’s to a happy, healthy, collectible and preservable 2014!

— James and Jim

Freedom Summer online archive now available

Great news: now there’s a digital archive to access the historically important “[[Freedom_Summer|Freedom Summer]]”, a seminal moment in the US civil rights movement. The Wisconsin Historical Society has just released the 1964 Freedom Summer Project. Not only are there 25,000 manuscripts and key documents, but there are finding aids to help users access the information and instructional materials for teachers.


Dear colleagues,

We’ve just released an online collection of 25,000 manuscripts related to the 1964 Mississippi Freedom Summer project. It’s free and open to anyone for non-profit educational purposes at

www.wisconsinhistory.org/freedomsummer

Besides thousands of archival documents from COFO, CORE and SNCC and papers from dozens of individual activists, the site includes a downloadable Powerpoint about Freedom Summer and a PDF Sourcebook of key documents for teachers.

I’d be grateful if you’d forward this note to colleagues and educators who might be interested. As the 50th anniversary of Freedom Summer approaches, we want teachers, students, historians, librarians, museum curators, the media, and anyone else to use these primary sources in their 50th anniversary programming.

We’ll be adding a few thousand more pages this year, so please “like” us on Facebook and follow along:

www.facebook.com/WHS.Freedom.Summer.collection?fref=ts

Best wishes,

Michael Edmonds

Deputy Director,
Library-Archives Division
Wisconsin Historical Society

Archives