Inside the "Digital National Security Archive"
This week’s posting, our third on the National Security Archive, describes an area of activity that may be of particular interest to the librarians among this site’s readers – the methodology we use to put together the Archive’s flagship publication series, the “Digital National Security Archive.” DNSA is a constantly growing, highly curated collection of declassified documentation covering topics in the history of U.S. foreign policy from the 1940s to the present. It is published by ProQuest. Today’s blog is written by staff Indexer Stacey Chambers, on behalf of the National Security Archive’s Production Team.
* * * *
The word “production” may evoke machinery and factory workers churning out widgets, but at the National Security Archive, the word refers to the careful analysis and description of documents for publication in the Digital National Security Archive (DNSA) and its print and microfiche counterparts.
DNSA currently consists of 38 aggregated, large-scale publications on a wide array of subjects – from nuclear history to the Cuban missile crisis to the Soviet invasion of Afghanistan; from the U.S. intelligence community to the military uses of space; and from U.S. policy toward its great Cold War rivals – the USSR and China – to America’s relations with a host of other countries: Japan, Korea, South Africa, Nicaragua, El Salvador, the Philippines and elsewhere. The collections average about 2,500 documents apiece, and altogether total more than 650,000 pages of declassified records to date. The front matter – along with the cataloging and indexing that our staff of three trained librarians produces – amounts to as much as 1,200 pages of printed text per set.
We in the National Security Archive’s Production Team do type mass quantities of words and characters into database records – but the data in question is the result of extensive research, editing, and review – and before that, the product of the expert selection, legwork, and meticulous preliminary cataloging of project analysts and their assistants.
Indeed, the life of a DNSA collection, or “set,” begins with an analyst. By the time the Production Team enters the picture, an analyst will have already spent three-to-five years (sometimes more) identifying and amassing a large pool of documents – arduously obtained through Freedom of Information Act requests, research at relevant archives and repositories, and occasional donations – before carefully choosing the records that will go into the final publication. (Each project has an advisory board of outside experts who are consulted about set content.) The analyst’s team will have also begun or completed work on an introductory essay, chronology, glossaries, and other invaluable front matter.
Drawing upon this information, a project analyst delivers a detailed briefing about the subject matter of the collection to the Production Team, to provide the context and scope that is so important to have in mind as we start our phase of the process. To supplement this briefing, indexers also take a “reading period” to absorb the content of analyst-recommended and well-indexed books that we consult continually.
Next, the analyst-selected boxes of paper documents, or scanned PDF documents viewable through specially designed repository software (originally produced by Captaris/Alchemy), arrive in the Production Team’s seventh-floor office at Gelman Library – already crowded with boxes of documents, reference works, and views of goings-on outside nearby George Washington University buildings – and Production Team members each claim batches of documents and set about cataloging them.
In doing so, we typically capture the following information from a document: its date (or estimated “circa date”); title (or “constructed title,” for those documents lacking a formal or descriptive title); originating agency; document type and number; highest official classification; length; excisions and annotations; bricks-and-mortar location or Web site from which the document was originally retrieved, if not through FOIA; keywords; and personal names and organizations cited. We also record information about the physical state of a document – for example, if it is missing pages or is difficult to read – and construct a brief, one-paragraph abstract, or précis, to describe the document’s content. We have refined these metadata fields over the course of 20 years of producing these sets and obtaining feedback from users.
To populate the foregoing fields in our Cuadra STAR bibliographic database, we consult a range of reference sources, from Wikipedia as a starting point – we love its articles’ External Links section – to the State Department’s Office of the Historian Web pages and prized copies of decades-old State Department and Defense Department telephone books – to the wealth of subscription reference databases freely available to us through GWU’s Gelman Library.
Further, we consult the Getty Thesaurus of Geographic Names to verify geographic terms, and the Library of Congress Authorities to verify personal names [e.g., the Korean name “Ahn, S.H. (Se Hee)”] and sometimes subjects – though we primarily base the concepts contained in our authority file on the United Nations Bibliographic Information System. (Our internally generated authority file currently approaches 56,000 entries.) Where the UNBIS is lacking for our purposes – particularly in military and intelligence terminology – we may consult military branch Web pages, hard copies of specialized encyclopedias, old volumes of the Europa World Year Book, or other authoritative sources to establish or update terms, while adhering to such general guidelines as our in-house cataloging manual, the Chicago Manual of Style, and of course, the common sense of a user’s perspective.
Despite the excellent resources at our disposal, we often encounter unknowns in cataloging – but find that solving a mystery in one document frequently solves those in others. For example, in many cases, information about the agency that produced a document is missing, but by examining the document’s context, FOIA or repository information, format, and the analyst’s expertise, we can reach an educated conclusion.
Similarly, we may also safely deduce the real name of a person to whom a misspelled name refers. In one fun example in which human indexing proves indispensable, while working on a collection of “The Kissinger Telephone Conversations: A Verbatim Record of U.S. Diplomacy: 1969-1977,” indexers encountered a memorandum of a telephone conversation in which the secretary who was secretly transcribing Henry Kissinger’s words recorded what she heard as “Nelson’s tongue,” when in fact they were talking about Mao Zedong! In such cases, we document in an internal-memo field the process we used to arrive at our decision. However, in cases when we have too many doubts due to too few context clues to verify facts, we must resort to entering a field value of “Origin Unknown,” or to leaving non-required fields blank.
The same principle applies to the document-level indexing and abstracting part of the Production Team’s work: one archival document often informs another. For example, a memorandum or quickly jotted note may not expressly state its context, but through our understanding of the context in which it was created, we add the words – in the subject or abstract field – that render documents on similar subjects retrievable. Nobody sits down at a meeting and declares, “Now let’s talk about human rights”; they just do it. So, our job is to grasp the context and determine the subject, especially when it is not explicitly stated.
On occasions when we cannot resolve a quandary ourselves, we may turn directly to the project analyst for answers – including to question whether a particular document belongs in a set. For example, in the forthcoming collection “Argentina, 1975-1990: The Making of U.S. Human Rights Policy,” analyst Carlos Osorio confirmed that even though a 1979 briefing memorandum did not mention Argentina specifically, the document should be retained because it showed how U.S. policymakers were shifting their attention to Central America.
Throughout the four-to-six months we typically spend producing a set, we will have taken several steps along the way to preserve quality: held regular terms meetings in an effort to control the consistency and usability of the set’s vocabulary; reviewed printouts of completed catalog records against the documents; addressed any outstanding copyright issues; and edited abstracts multiple times.
Once we have finished all indexing, abstracting, and reviewing, we proceed to a final quality-control phase involving lots of coffee and reading aloud from oversized sheets of paper, to ensure that records are in their proper order; and to resolve any errors, inconsistencies, or outstanding issues. The process begins with assigning each record a sequence number, dictated by the records’ correct arrangement by date, and then alphabetically by a series of descending elements. Then indexers work in pairs to verify the accuracy of each data element. When we’re done, we repeat the process to ensure that the correct catalog record is assigned to the document it describes. This process may from the outside appear tedious or even torturous, but it is needed to deliver a clean, finished product to our co-publisher, ProQuest.
Meanwhile, the Production Team will have also taken time to create and review hierarchical cross references among the set’s records, so that users are appropriately redirected – not only to broader or narrower terms used in the set, but also from commonly used acronyms or plain-language terms to the set’s controlled vocabulary. Also during this wrap-up stage, the Production Director and Publications Director will have been generating last-minute lists to be checked, editing the set’s vital front matter in collaboration with the project analyst and Research Director, and tending to other publishing demands ... that is, until the analyst calls with “a few more essential documents … ”
The National Security Archive