Note: This post is a longer response to a recent comments left on the LJ article written by Jim A. Jacobs and Melody Kelly.
Recently, Library Journal published a short article that Melody Kelly and I wrote (The Future of the FDLP: From Conversation to Confrontation). In such a short opinion piece, we did not have the space to document our arguments and expand on our reasoning. Since the article was posted, it has received a few comments. So we are using this post to expand on our LJ piece a bit and re-post here our replies to those comments.
As we said in LJ, the discussions about shared regional depositories have morphed from a conversation into a confrontation. This was our main point. As Daniel Cornwall said here recently, just as GPO needs to work with FDLP, so all FDLP libraries should be working with GPO, not working against it. We are concerned that the tactics adopted by ARL and many of the association’s prominent members are more confrontational than cooperative and that such tactics may harm GPO at a time when it needs our support.
Some of the issues involved are complex and confusing and we imagine that many non-FDLP librarians — and even some in the FDLP community — find it difficult to follow the sometimes arcane, legal details of the arguments on both sides. This is particularly true when attempts are made outside the current GPO rules to redefine the scope or procedures of an existing Regional.
Because the FDLP is based upon federal law in Title 44 of the US Code, any changes in the role of the regional depository libraries would require legislative revision and will therefore be slow. As the FDLP community knows from past experience, legislative revisions to Title 44 must be carefully orchestrated, may take several Congressional sessions, will require the support of the entire FDLP community working with GPO and Congress, and is probably unlikely. But there is much that can be done, and done now — by individual libraries and groups of libraries and by GPO that will move us all forward toward our shared goals of permanent no-fee access to government information regardless of format.
Moving forward together.
In comments to the LJ article, Bill Sudduth and John Burger said that the article has inaccuracies, innuendos, and fabrications. Mr. Sudduth says that we raise “narrow issues” and straw-man arguments and implies that we have irrelevant standards of “satisfaction.” Strong words! But we believe that both Mr. Sudduth and Mr. Burger badly misinterpret our concerns. Here is our reply to their comments:
The comments of Mr. Burger and Mr. Sudduth imply that we object to the goals of the ASERL proposal, but we do not. We share their goals. We have expressed concerns about whether or not the ASERL plan will accomplish its goals. They have not addressed what we see as legitimate concerns.
As government information librarians who have for years been urging GPO and FDLP libraries to move to a digital FDLP, we support the goals of ASERL to improve bibliographic control of our collections and to provide digital access to older printed materials and we share the frustration of not being able to move forward more quickly. But as digital librarians who have created and managed government-information digital-library projects and who have a combined professional experience of over 40 years of designing, building, supervising, and evaluating digital library and digital preservation projects of all kinds, we also question some of the specific means ASERL is proposing to reach those goals.
The good news is that ASERL can accomplish the majority of its goals (create better inventories of their collections, increase and enhance cataloging, digitize documents to provide “additional access points”) without GPO approval. They can also build their Centers of Excellence under existing FDLP procedures for Shared Housing Agreements.
The only thing they cannot do without GPO’s (or, indeed, the Joint Committee on Printing’s) consent, is weed the existing print collections in the regional depositories. Although Mr. Burger says that ASERL does not “advocate” “wholesale weeding” or replacing tangible copies with digital surrogates, we believe the plan will permit just that. Indeed, the ASERL Implementation Plan explicitly allows for “the region” to have “at least two complete cataloged sets of print publications.” To us, this explicitly permits a reduction from twelve copies to two. Although the Plan allows for the possibility of retaining more than two, it does not require more than two nor does it justify how only two copies might be adequate.
We are further led to this conclusion because for years we have read how ARL libraries advocate weeding their collections, minimizing the number of paper copies, and using digital surrogates to replace paper. (See for example Burger, et al., ASERL’s Virtual Storage/Preservation Concept; ARL, Future Directions for the Federal Depository Library Program; Russell, Remarks by Judy Russell, 142nd ARL Membership Meeting; and Ithaka S+R, Documents for a Digital Democracy.) Now that ASERL has a concrete proposal that could apparently do just that, we are concerned that the plan lacks the necessary safeguards that would ensure it can meet the needs of users for both paper and authenticated digital copies.
We suggest that ASERL continue to move forward now on its project goals that do not require GPO’s approval, and simultaneously work with GPO and the entire FDLP community to resolve the legitimate, long-term issues regarding digitization and national collection preservation. This will benefit not just ASERL but all FDLP libraries and all users of government information, both now and in the future.
Concerns about weeding.
As noted above, we are concerned that the ASERL proposal in particular will result in weeding from regional depositories. One of the issues we raised in LJ was that we do not yet have adequate information to justify reducing the number of paper copies in the FDLP system. We believe that we need to be cautious about weeding and should determine how many paper copies are needed to ensure both preservation and access. There are studies that address this issue (e.g., Schonfeld What to Withdraw, Schottlaender, Yano) but we believe there are two reasons that they do not provide adequate information to apply them to our FDLP paper collections.
First, these studies mostly focus on substituting digital surrogates for paper journals articles. While scholarly journals are a relatively homogeneous body of literature about which we can make generalizations, government publications present a very heterogeneous body of literature about which it is difficult to generalize. By focusing on journals, they do not address the particular qualities of government publications and our ability to produce even adequate digital copies of them. These qualities include:
- We lack adequate bibliographic control and granularity of descriptive information for much of our collections making it more difficult to control, preserve, and provide access to digitized collections;
- Government publications come in a wide variety of sizes and shapes and bindings (and non-bindings) and types and have many serials and multi-volume sets and looseleaf updates making them difficult to digitize adequately at a reasonable cost (GPO, 2004);
- Many publications are old and have brittle and yellowed paper which will be more difficult to accurately digitize;
- Many publications have tables of statistical information which is difficult or expensive to digitize accurately. (See below for more information about this);
- Many publications also have charts, graphs, photographs, drawings, models, and other types of images and too little is known about how to digitize them accurately.
Second, these studies rely on having a perfect digital copy as a preservation copy. For reasons we explain below (“concerns about digitization”), we believe it is premature to assume we can create such digital copies for government publications.
When we consider weeding our valuable collections we need to consider access as well as preservation. It is not just about keeping an emergency copy-of-last-resort in a vault or about Title 44 legal requirements. It is about keeping an adequate number of working, usable, loanable copies geographically near their users. We do not think this is a controversial position. Even Mr. Burger says libraries need to retain an adequate number of paper copies for direct user examination. So the issue is, How do we determine what “an adequate number” is?
To summarize, we believe we should be cautious about weeding and discarding paper copies. We do not believe this is a contentious issue, but we do believe that any given project that involves
“Many years ago GPO turned over its historical collection to the National Archives and almost immediately we began to regret the absence of a tangible collection. We have decided to re-establish a comprehensive collection of tangible and electronic documents as a collection of last resort for the program, and the new organization will dedicate staff resources to that effort.” (Russell, 2003)
Concerns about digitization.
Another concern we expressed in the LJ article was that we do not yet have enough research to guarantee we can make accurate, usable digital copies of government publications.
We are particularly concerned with the accuracy of Optical Character Recognition (OCR) of the many statistical tables in government publications. The key study of this issue was done at Yale (Green, Linden) and demonstrates the difficulty and expense of accurately digitizing statistical tables. Shafait has studied the difficulty of just locating tables during scanning. Bicknese noted the difficulty of OCR scanning of tables and so excluded tables from a sample in testing the accuracy of Making of America OCR scans. In trying to determine OCR accuracy, Blando found that many tables were illegible and generally useless. In a 2004 study, Joseph examined quality of images and figures in journals scanned by Elsevier after removing paper copy journals to a remote location. He found that 73.6% of the issues had at least one image with unacceptable quality. In a test of OCR accuracy for searching only, the Harvard LDI team excluded 19th century materials because they assumed that failure rates would be higher for them because of the lower contrast of older printed pages. Schonfeld, in a study of “What to Withdraw,” notes that too little is known about the share of the intellectual content of charts, graphs, data tables, photographs, drawings, models, and other types of images that is captured by existing scanning and format standards. And Tanner looked at the OCR accuracy of “groups of numbers” in scanned online newspapers, and found the accuracy only 64.1% for 19th century newspapers and only 59.3% for 17th and 18th century newspapers.
(It is important to understand that digital images of statistical tables — if legible to the eye — will probably serve the needs of many users and are welcome as an additional access point to government information. Our concern here is not about adding an access point that is at least as good as, but no better than, the paper copies. That can be done now without the approval of GPO. As noted above, we do have concerns about that, too, because even that low standard often results in illegible images (GPO 2004, Joseph) which means that users will need to have paper copies to consult and libraries will need paper copies to re-digitize to fix errors. But we are expressing a separate concern here.)
Our concern here is with the accuracy of the actual numbers if a user wanted to (for example) cut and paste those numbers. Our concern is that the tables contain a wealth of metadata about the numbers themselves and this information is rarely captured and associated with the number in a usable way. Searching for statistical information can be greatly hindered without accurate information about the numeric information.
In short, if all we want is to make simple images of our documents more available, then almost any legible scan will suffice. But, if we want to do more than create surrogates for paper, if we want to turn paper into usable and re-usable, machine-actionable digital objects, if we want to go beyond replicating paper in the digital world, we can only do so if we have a reliable way of creating accurate, functional digital objects — and we do not yet have a single, reliable, affordable way of doing that (Green). That means that any scanning we do today will have to be re-done sometime in the future with new technologies that are capable of better, more accurate conversions of documents to usable digital objects. (This is not a hypothetical concern; we have already seen that the National Archives digitized documents as recently as the 1990s using then-current technology that makes them difficult to read online. [Marks]) And that means that we need to account for digitizing at least twice (and possibly multiple times) and we have to make sure we have enough usable print copies around for those re-digitizations. Saving enough copies to ensure we can do this is essential because several digitization methods, particularly high-quality digitization, effectively destroy the documents and older, more brittle documents are at a greater risk of destruction with any digitization.
In summary, we believe that digitization projects should account for preserving an adequate number of paper copies for re-digitization with better technologies in the future and for direct user examination when digitization is flawed or inaccurate. And we believe that, as we work toward a fully digital FDLP, we must aim for fully functional digital objects and not be satisfied with simple digital surrogates of paper.
Concerns about authentication.
Being able to authenticate digital documents is very important and not something that can be ignored until later or done retrospectively. It should be relatively easy for GPO to work with the FDLP community to create a Superintendent of Documents Policy Statement (“SOD”) that would provide authentication credentials in the form of Preservation Description Information (Consultative Committee for Space Data Systems) when an FDLP library digitizes a legally deposited paper document (GPO 2004). Such a SOD could specify chain of custody and provenance through a documentable workflow, consistent metadata standards, and a digital signature. It could also specify that participating libraries would sign on as “digital partners” with the GPO and be recognized as such. Standards such as these would provide documented assurance to users of the authenticity of digitized documents and would provide the participating FDL host institution a framework for a continuing commitment to funding and staffing to ensure these digital collections are maintained as technologies evolve.
Moving forward now.
As noted above, there are many things that libraries can do today.
ASERL, for example, can enhance bibliographic access and improve its inventory of its existing collections without asking permission from GPO;
Libraries can scan documents to provide an additional access point today and many are doing so. (See the Digitization Projects Registry at the FDLP web site;)
Existing procedures allow libraries to house documents in shared facilities using “Selective Housing Agreements” (GPO, 2011);
GPO can work with the FDLP community to add and modify Superintendent of Documents Policy Statements to account for digitization, authentication, and digital deposit; and
The FDLP community can use existing studies that examine replacing print with digital surrogates as a starting point and contribute to this knowledge base by systemically researching the issues specific to our older, paper collections.
We can do all this without turning GPO into an adversary. To repeat what Daniel Cornwall said here on this same issue:
We at FGI yield to no one in our desire for a fully functional digital FDLP. We have been advocating that for seven years and are just as anxious — if not more so — as anyone to move forward to the next phase of the FDLP. Over the years, we have criticized GPO and its policies — and will continue to do so when we believe such criticism is warranted. But our intention has always been to challenge GPO to do more and to do better, and to work with FDLP libraries, not work against them or arrogate responsibilities from them. Just as we want GPO to work with FDLP libraries, we want FDLP libraries to work with, not in opposition to, GPO. We at FGI believe that the future of the FDLP will be most secure if GPO and FDLP libraries work together to a common end.
We all in the FDLP community share a common set of goals and beliefs about the value of government information. We can do this together.
— Melody Specht Kelly – Documents Librarian University of North Texas Libraries 1974 – 2001; Associate Dean of Libraries, 2001 – 2009. Adjunct Professor College of Information, 1984 – present: “Government Information Services.”
— James A. Jacobs. Librarian Emeritus University of California San Diego 2006- present; Center For Research Libraries, Technical consultant for digital library certification and long-lived repositories 2008- present; Instructor ICPSR Summer Program “Providing Social Science Data Services: Strategies for Design and Operation,” 1990- present; Data Services Librarian, University of California San Diego 1985-2006.
Association of Research Libraries, “Future Directions for the Federal Depository Library Program.” (December 2008).
Bicknese, Douglas A., Measuring the Accuracy of the OCR in the Making of America, Ann Arbor, Mich.,: University of Michigan, School of Information (1998)
Blando, Luis R., Junichi Kanai, Thomas A. Nartker, and Juan Gonzalez, “Prediction of OCR Accuracy.”
Burger, John, Paul M. Gherman, and Flo Wilson, ASERL’s Virtual Storage/Preservation Concept, ACRL Twelfth National Conference, Minneapolis, MN (April 2005).
Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System, (OAIS) CCSDS 650.0-B-1 BLUE BOOK, January 2002. CCSDS Secretariat (2002).
Green, Ann, Sandra K. Peterson, and Julie Linden. Supporting Economic Development Research: A Collaborative Project to Create Access to Statistical Sources Not Born Digital, A Report to the Andrew W. Mellon Foundation. New Haven, CT: Yale University (2005).
Joseph, Lura E., “Image and Figure Quality: A Study of Elsevier’s Earth and Planetary Sciences Electronic Journal Back File Package” Library Collections, Acquisitions, and Technical Services 30 (September 2006), 162-168.
LDI Project Team. Harvard University Library, Measuring Search Retrieval Accuracy of Uncorrected OCR: Findings from the Harvard-Radcliffe Online Historical Reference Shelf Digitization Project. (August 2001).
Linden, Julie, and Ann Green, “Don’t Leave the Data in the Dark”, D-Lib Magazine, 12 (2006).
Marks, Joseph. “National Archives’ first Wikipedian in residence to bring more holdings to the public”, NextGov (07/11/2011).
Russell, Judith. Remarks by Judy Russell, 142nd ARL Membership Meeting, 142nd ARL Membership Meeting, Federal Relations Luncheon (May 15, 2003).
Russell, Judith C. Preservation And Authentication Of Government Information: Are We Ready For The 21st Century?, IS&T Archiving Conference, San Antonio, Texas, (April 23, 2004) The Society for Imaging Science and Technology.
Schonfeld, Roger C., and Ross Housewright, Documents for a Digital Democracy, Ithaka S+R (December 17, 2009).
Schonfeld, Roger C., and Ross Housewright, What to Withdraw: Print Collections Management in the Wake of Digitization, Ithaka S+R, (September 29, 2009).
Schottlaender, Brian E.C., Gary S. Lawrence, Cecily Johns, Claire Le Donne, and Laura Fosbender, Collection Management Strategies In A Digital Environment, A Project Of The Collection Management Initiative Of The University Of California Libraries, Final Report to the Andrew W. Mellon Foundation. University of California, Office of the President, Office of Systemwide Library Planning (January 2004).
Shafait, Faisal, and Ray Smith. “Table Detection in Heterogeneous Documents”, 9th IAPR Workshop on Document Analysis Systems, DAS’10. Boston, MA, USA, (June 2010).
Tanner, Simon, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness”, D-Lib Magazine, 15 (2009).
U.S. Government Printing Office, Office of the Superintendent of Documents. Legal Requirements & Program Regulations of the Federal Depository Library Program. Washington, D.C. U.S. Government Printing Office (June 2011).
U.S. Government Printing Office. Report on the Meeting of Experts on Digital Preservation: Metadata Specifications, Washington, D.C.: U.S. Government Printing Office (14 June 2004).
Yano, Candace Arai, Z.J. Max Shen, and Stephen Chan, Optimizing the Number of Copies for Print Preservation of Research Journals, Berkeley, CA: University of California Berkeley, Industrial Engineering & Operations Research (October 2008).
Update: Jim Jacobs made some small changes to the above item for clarity on January 2, 2012.