This week’s posting, our third on the National Security Archive, describes an area of activity that may be of particular interest to the librarians among this site’s readers – the methodology we use to put together the Archive’s flagship publication series, the “Digital National Security Archive.” DNSA is a constantly growing, highly curated collection of declassified documentation covering topics in the history of U.S. foreign policy from the 1940s to the present. It is published by ProQuest. Today’s blog is written by staff Indexer Stacey Chambers, on behalf of the National Security Archive’s Production Team.
* * * *
The word “production” may evoke machinery and factory workers churning out widgets, but at the National Security Archive, the word refers to the careful analysis and description of documents for publication in the Digital National Security Archive (DNSA) and its print and microfiche counterparts.
DNSA currently consists of 38 aggregated, large-scale publications on a wide array of subjects – from nuclear history to the Cuban missile crisis to the Soviet invasion of Afghanistan; from the U.S. intelligence community to the military uses of space; and from U.S. policy toward its great Cold War rivals – the USSR and China – to America’s relations with a host of other countries: Japan, Korea, South Africa, Nicaragua, El Salvador, the Philippines and elsewhere. The collections average about 2,500 documents apiece, and altogether total more than 650,000 pages of declassified records to date. The front matter – along with the cataloging and indexing that our staff of three trained librarians produces – amounts to as much as 1,200 pages of printed text per set.
We in the National Security Archive’s Production Team do type mass quantities of words and characters into database records – but the data in question is the result of extensive research, editing, and review – and before that, the product of the expert selection, legwork, and meticulous preliminary cataloging of project analysts and their assistants.
Indeed, the life of a DNSA collection, or “set,” begins with an analyst. By the time the Production Team enters the picture, an analyst will have already spent three-to-five years (sometimes more) identifying and amassing a large pool of documents – arduously obtained through Freedom of Information Act requests, research at relevant archives and repositories, and occasional donations – before carefully choosing the records that will go into the final publication. (Each project has an advisory board of outside experts who are consulted about set content.) The analyst’s team will have also begun or completed work on an introductory essay, chronology, glossaries, and other invaluable front matter.
Drawing upon this information, a project analyst delivers a detailed briefing about the subject matter of the collection to the Production Team, to provide the context and scope that is so important to have in mind as we start our phase of the process. To supplement this briefing, indexers also take a “reading period” to absorb the content of analyst-recommended and well-indexed books that we consult continually.
Next, the analyst-selected boxes of paper documents, or scanned PDF documents viewable through specially designed repository software (originally produced by Captaris/Alchemy), arrive in the Production Team’s seventh-floor office at Gelman Library – already crowded with boxes of documents, reference works, and views of goings-on outside nearby George Washington University buildings – and Production Team members each claim batches of documents and set about cataloging them.
In doing so, we typically capture the following information from a document: its date (or estimated “circa date”); title (or “constructed title,” for those documents lacking a formal or descriptive title); originating agency; document type and number; highest official classification; length; excisions and annotations; bricks-and-mortar location or Web site from which the document was originally retrieved, if not through FOIA; keywords; and personal names and organizations cited. We also record information about the physical state of a document – for example, if it is missing pages or is difficult to read – and construct a brief, one-paragraph abstract, or précis, to describe the document’s content. We have refined these metadata fields over the course of 20 years of producing these sets and obtaining feedback from users.
To populate the foregoing fields in our Cuadra STAR bibliographic database, we consult a range of reference sources, from Wikipedia as a starting point – we love its articles’ External Links section – to the State Department’s Office of the Historian Web pages and prized copies of decades-old State Department and Defense Department telephone books – to the wealth of subscription reference databases freely available to us through GWU’s Gelman Library.
Further, we consult the Getty Thesaurus of Geographic Names to verify geographic terms, and the Library of Congress Authorities to verify personal names [e.g., the Korean name “Ahn, S.H. (Se Hee)”] and sometimes subjects – though we primarily base the concepts contained in our authority file on the United Nations Bibliographic Information System. (Our internally generated authority file currently approaches 56,000 entries.) Where the UNBIS is lacking for our purposes – particularly in military and intelligence terminology – we may consult military branch Web pages, hard copies of specialized encyclopedias, old volumes of the Europa World Year Book, or other authoritative sources to establish or update terms, while adhering to such general guidelines as our in-house cataloging manual, the Chicago Manual of Style, and of course, the common sense of a user’s perspective.
Despite the excellent resources at our disposal, we often encounter unknowns in cataloging – but find that solving a mystery in one document frequently solves those in others. For example, in many cases, information about the agency that produced a document is missing, but by examining the document’s context, FOIA or repository information, format, and the analyst’s expertise, we can reach an educated conclusion.
Similarly, we may also safely deduce the real name of a person to whom a misspelled name refers. In one fun example in which human indexing proves indispensable, while working on a collection of “The Kissinger Telephone Conversations: A Verbatim Record of U.S. Diplomacy: 1969-1977,” indexers encountered a memorandum of a telephone conversation in which the secretary who was secretly transcribing Henry Kissinger’s words recorded what she heard as “Nelson’s tongue,” when in fact they were talking about Mao Zedong! In such cases, we document in an internal-memo field the process we used to arrive at our decision. However, in cases when we have too many doubts due to too few context clues to verify facts, we must resort to entering a field value of “Origin Unknown,” or to leaving non-required fields blank.
The same principle applies to the document-level indexing and abstracting part of the Production Team’s work: one archival document often informs another. For example, a memorandum or quickly jotted note may not expressly state its context, but through our understanding of the context in which it was created, we add the words – in the subject or abstract field – that render documents on similar subjects retrievable. Nobody sits down at a meeting and declares, “Now let’s talk about human rights”; they just do it. So, our job is to grasp the context and determine the subject, especially when it is not explicitly stated.
On occasions when we cannot resolve a quandary ourselves, we may turn directly to the project analyst for answers – including to question whether a particular document belongs in a set. For example, in the forthcoming collection “Argentina, 1975-1990: The Making of U.S. Human Rights Policy,” analyst Carlos Osorio confirmed that even though a 1979 briefing memorandum did not mention Argentina specifically, the document should be retained because it showed how U.S. policymakers were shifting their attention to Central America.
Throughout the four-to-six months we typically spend producing a set, we will have taken several steps along the way to preserve quality: held regular terms meetings in an effort to control the consistency and usability of the set’s vocabulary; reviewed printouts of completed catalog records against the documents; addressed any outstanding copyright issues; and edited abstracts multiple times.
Once we have finished all indexing, abstracting, and reviewing, we proceed to a final quality-control phase involving lots of coffee and reading aloud from oversized sheets of paper, to ensure that records are in their proper order; and to resolve any errors, inconsistencies, or outstanding issues. The process begins with assigning each record a sequence number, dictated by the records’ correct arrangement by date, and then alphabetically by a series of descending elements. Then indexers work in pairs to verify the accuracy of each data element. When we’re done, we repeat the process to ensure that the correct catalog record is assigned to the document it describes. This process may from the outside appear tedious or even torturous, but it is needed to deliver a clean, finished product to our co-publisher, ProQuest.
Meanwhile, the Production Team will have also taken time to create and review hierarchical cross references among the set’s records, so that users are appropriately redirected – not only to broader or narrower terms used in the set, but also from commonly used acronyms or plain-language terms to the set’s controlled vocabulary. Also during this wrap-up stage, the Production Director and Publications Director will have been generating last-minute lists to be checked, editing the set’s vital front matter in collaboration with the project analyst and Research Director, and tending to other publishing demands ... that is, until the analyst calls with “a few more essential documents … ”
The National Security Archive
The Working Group of the Future of Bibliographic Control, as it examined technology for the future, wrote that the Library community’s data carrier, MARC, is “based on forty-year-old techniques for data management and is out of step with programming styles of today.” The Working Group called for a format that will “accommodate and distinguish expert-, automated-, and self-generated metadata, including annotations (reviews, comments, and usage data.” The Working Group agreed that MARC has served the library community well in the pre-Web environment, but something new is now needed to implement the recommendations made in the Working Group’s seminal report. In its recommendations, the Working Group called upon the Library of Congress to take action. In recommendation 3.1.1, the members wrote:
“Recognizing that Z39.2/MARC are no longer fit for the purpose, work with the library and other interested communities to specify and implement a carrier for bibliographic information that is capable of representing the full range of data of interest to libraries, and of facilitating the exchange of such data both within the library community and with related communities.”
This same theme emerged from the recent test of the Resource Description and Access (RDA) conducted by the National Agricultural Library, the National Library of Medicine, and the Library of Congress. Our 26 test partners also noted that, were the limitations of the MARC standard lifted, the full capabilities of RDA would be more useful to the library community. Many of the libraries taking part in the test indicated that they had little confidence RDA changes would yield significant benefits without a change to the underlying MARC carrier. Several of the test organizations were especially concerned that the MARC structure would hinder the separation of elements and ability to use URLs in a linked data environment.
With these strong statements from two expert groups, the Library of Congress is committed to developing, in collaboration with librarians, standards experts, and technologists a new bibliographic framework that will serve the associated communities well into the future. Within the Library, staff from the Network Development and Standards Office (within the Technology Policy directorate) and the Policy and Standards Division (within the Acquisitions and Bibliographic Access directorate) have been meeting with Beacher Wiggins (Director, ABA), Ruth Scovill (Director, Technology Policy), and me to craft a plan for proceeding with the development of a bibliographic framework for the future.
We at the Library are committed to finding the necessary funding for supporting this initiative, and we expect to work with diverse and wide-ranging partners in completing the task. Even at the earliest stages of the project, we believe two types of groups are needed: an advisory committee that will articulate and frame the principles and ideals of the bibliographic framework and a technical committee that has the in-depth knowledge to establish the framework, itself.
- Read the Complete Summary and Bibliographic Framework Initiative General Plan
- The complete summary and plan are available as a PDF (10 pages)
This isn't new information, but I don't think we've mentioned it here before. UNT, one of the participants in the End-of-Term Web Archive project (EOT), which aimed to capture the entirety of the federal government's public Web presence before and after the 2009 change in presidential administrations, is hosting a project to investigate innovative solutions to issues around web archives. The issues include being able to to identify and select materials in accord with collection development policies and being able to characterize archived materials using common metrics in order to communicate the scope and value of these materials to administrators.
The project will use 10 librarian Subject Matter Experts who will classify the EOT collection according to the Superintendent of Documents (SuDocs) Classification System. The project will also develop a set of metrics to enable characterization of materials in Web archives in units of measurement familiar to libraries and their administrations.
- Classification of the End-of-Term Archive: Extending Collection Development to Web Archives, University of North Texas Libraries (21 April 2011).
- End-of-Term Web Archive
This just in: GPO is gearing up to facilitate cooperative cataloging projects in FDLP libraries. This is great news for all those uncataloged-bound-with-unanalyzed series that every agency seems to have (yeah I'm talking to you Department of Agriculture Bulletin!). This push to catalog will make depository collections much more findable and usable!
Many libraries in the FDLP have voiced interest in establishing cooperative cataloging partnerships with GPO to exchange cataloging records, work together to catalog older materials, or enhance existing cataloging records to meet current cataloging standards. GPO has created guidelines for the establishment of partnerships that have a cataloging component. Federal depository libraries that are interested in possible cataloging partnerships are encouraged to review the guidelines and contact the partnership coordinator.
A GPO staffer has asked that I post the notice below about a pilot MARC record distribution project to "ensure the automatic dissemination of bibliographic records to FDLP libraries." I hope libraries will volunteer to help out with this project as it seems like a significant step for gpo to take. We've talked for a while about collaborative cataloging of govt information; while this is primarily a "push" project, perhaps it could be the first step toward GPO opening up the cataloging workflow to depository libraries (many hands make light work right?!) and lead to other data sharing opportunities (XML, OAI, RSS, APIs etc.) both within the FDLP and with the public. This could be a significant piece of the FDLP ecosystem.
Calling all depositories! FREE Records! FREE Records!
GPO is looking for libraries who wish to take part in the Cataloging Record Distribution Pilot. Applications are being accepted now through January 11, 2010.
Federal depository libraries will be chosen to participate in this pilot program in which GPO bibliographic records will be distributed from GPO’s Integrated Library System (ILS) to these libraries. GPO will be accepting a group of 30 – 35 FDLP libraries to participate.
GPO is looking for a mixture of different library sizes and types. Of that group, GPO would like some current MARCIVE subscribers, as well as some non-subscribers. GPO is also aiming to select a variety of libraries that use a diverse group of ILS vendors.
Visit the Cataloging Record Distribution Pilot Web page for more information on the project, including details on how to apply and an informational FAQ sheet on the details of the project.
We started the LostDocs blog back in September 2009 to collect e-mail receipts for items that were reported to GPO as "fugitive documents" -- agency documents that should have made it into the Federal Depository Library Program and/or the Catalog of Government Publications.
In the process of running this blog, we have identified 40 documents reported since April 2008 that were cataloged by GPO after being reported as "fugitive documents." These fall into the "found documents" category of our blog.
You can find our list of 40 (and counting) cataloged fugitives here. This spreadsheet will be updated whenever we identify new GPO cataloging for items that had been reported as fugitive documents.
The results are interesting and somewhat disturbing, but not definitive.
The 40 items were cataloged in times varying from three days to 524 days. The mean cataloging time was 213 days. The median cataloging time was 184 days or about six months.
If the cataloging times above were typical of all documents reported through the LostDocs process, we think this would be a major problem for GPO that would require some serious soul searching and dialog about how this result could be changed and what tradeoffs and/or extra community involvement would be required as a result.
We are NOT making the claim that these cataloging times are typical for reported fugitive documents. We honestly do not know what is typical. Jim Jacobs, FGI's resident data librarian, had this to say about our sample of cataloged documents:
As for sample size and relevance: the number of items in the sample can't tell us the significance or accuracy of the results. We'd have to know two other things: the size of the universe (of all reported lost docs), and the accuracy of the sample. Since the sample was self- selected (by those reporting) rather than random, and since we don't know if the sample is 1% or 85% of all submitted lostdocs, we can't claim that the findings necessarily reflect the status of the whole universe. (does that make sense? If only people w/ long waits reported to us, our sample does not accurately reflect all lostdocs.)
When we first thought about making lostdocs reports available to the community at large, we first approached GPO with a partnering opportunity. We would maintain the blog, and offer them the opportunity to comment on the blog whether something was out of scope for CGP or already in the catalog. In return, we asked them to modify their LostDocs form so that when they received a report, the blog would automatically get a copy. If this partnership had been accepted, then we would know the two facts Jim cited above that are needed to tell us whether we have typical results or not. GPO declined to accept our partnership agreement, citing their workload. We're not questioning that they are overworked.
We do feel that the results above deserve further investigation. Perhaps GPO could prepare a report on documents cataloged as a result of fugitive reports over the past few years. Unless they've discarded the e-mail receipts (which would be defensible), they have the dates of when documents were reported. The CGP lists when an item was first added to the CGP. They could have an intern make a semester project of putting the two together and then posting the results to fdlp.gov.
If they have tossed previous e-mail receipts, they could start saving them for a year starting in January 2010 and do the analysis we propose above in 2011. But in either case we feel the analysis should be done. If it confirms our results then it will be good ammunition in Congress to procure more cataloging staff or to start cataloging collaborations with FDLP members. If the GPO analysis concludes that items reported to lost docs are in fact cataloged in a timely manner, then that will help build trust with the documents community and motivate more people to report fugitive documents. Either way it is a win-win for GPO.
My apologies for anyone who relied on my post "Catloging Gets Results in Alaska." Revised data has forced me to retract my claim. Please see details at http://freegovinfo.info/node/1940.
But don't be afraid to share information and new ideas. Sometimes we're going to be wrong. That's just the nature of the game. But we as a community are stronger when we share information and admit our mistakes as well as celebrate our successes.
Library Services and Content Management is continually working to improve the Catalog of U.S. Government Publications and the services it provides. One of the upcoming services that we are excited about is the creation of a login page for depository libraries that will enable them to take advantage of a range of authenticated services not otherwise available. These include:
- Selective dissemination of information. This will give depositories the ability to direct the system to send emails when resources in a particular area of interest are cataloged. Depository libraries will be able to set up notifications by item number or by SuDocs stem, for example;
- “Save records to local pc”. Currently the options are to email records to a defined email address up to twenty at a time, or to search, retrieve, and download up to one thousand records from the CGP per session.
- RSS feeds;
- Retained preferences that will persist across sessions;
- Links to FDLP-related pages including the FDLP Desktop and the Federal Depository Library Directory.
We are anticipating a demonstration of the FDLP login page at the Fall Conference and a subsequent December release of this functionality.
Also on the agenda is an enhanced Federal Depository Library Directory. We would like to ask for input from users for improvements we could make to the FDLD to enhance the user experience. Please submit suggestions through AskGPO at http://gpo.custhelp.com/cgi-bin/gpo.cfg/php/enduser/ask.php. Use the category Federal Depository Libraries, subcategory Catalog of U.S. Government Publications, then CGP Enhancements/Suggestions.
Libraries need to provide attractive and exciting discovery tools to draw patrons to the valuable resources in their catalogs. The authors conducted a pilot project to explore the free version of Google Earth as such a discover tool for Portland State Library's digital collection of urban planning documents. They created eye- catching placemarks with links to parts of this collection, as well as to other pertinent materials like books, images, and historical background information. The detailed how-to-do part of this article is preceded by a discussion about discovery of library materials and followed by possible applications of this Google Earth project. In Calhoun's report to the Library of Congress, it becomes clear that staff time and resources will need to move from cataloging traditional formats, like books, to cataloging unique primary sources, and then providing access to these sources from many different angles. "Organize, digitize, expose unique special collections" (Calhoun 2006).
My pal, Jenna Freedman, the Lower East Side Librarian started subscribing to the Library of Congress Subject Heading Weekly list (probably out of love for Libraries' great good friend and cataloger extraordinaire Sandy Berman!). This week she came across a strange one that I hope our readers can expound on in the comments, especially since I'm not a cataloger.
150 Electronic government information [May Subd Geog]
* 450 UF Electronic government publications [EARLIER FORM OF HEADING]
* 550 BT Government publications
Is this LC's documentation of a move away from government publications as the instantiation of our government's work toward e-government and government information as transaction? Should we be worried about this change in the heading? Is this just semantics? Is there a cataloger in the house?
The other one that I found strange was:
(C) 150 Global cooling [Not Subd Geog]
450 UF Cooling, Global
550 BT Global temperature changes
Is that some sort of Newspeak?!?!