Building the 2008 End of Term Web Archive
This is the first in a series of guest posts from members of the End of Term (EOT) project. We look forward to blogging here this month and telling you more about our efforts to archive and preserve U.S. government websites.
The 2008-2009 End of Term Web Archive contains over 3,300 U.S. Federal Government websites harvested between September 2008 and November 2009 during the transition from the Bush to Obama administrations. The recent DttP article “It Takes a Village to Save the Web” (mentioned in the introduction post yesterday) focuses in detail on the collaborative work that took place at every level, including site nomination, harvesting, data transfer, preservation, analysis and access. This post will briefly highlight some of the more remarkable aspects of building the 2008-2009 archive.
The first is that the collaboration came together so quickly and effectively. The group formed in response to the announcement in 2008 that NARA would not be archiving the .gov domain during this transitional period. With no funding and very little notice, the organizations involved were able to set up a means to identify and archive an extremely large body of content. The harvesting process, for example, was carried out across institutions over the course of the year, and while the bulk of the crawls were run by the Internet Archive (IA), the California Digital Library, the University of North Texas and the Library of Congress were able to fill in the gaps at times when the IA could not run crawls, and were also able to target selected areas of content in particular depth.
The 16 terabyte body of data that you’re able to browse, search and display at the End of Term Web Archive represents the entirety of what all four organizations harvested, and is just one of three copies of that content. Once the capture phase of the project was complete, each organization transferred its share of content to other project partners to build the complete archive. An August 2011 FGI post Archiving the End of Term Bush Administration points to an article that details the content transfer aspect of this project. The access copy is held at the Internet Archive (with an access gateway provided by CDL), a preservation copy is held at the Library of Congress, and a copy of the data for research and analysis is held at the University of North Texas.
Another noteworthy aspect of this project is the number of new, emerging and experimental technologies it either generated or made use of. The Nomination Tool, which will be described in more detail in a future post, was built by the University of North Texas to support selection work for this project, and remains a valuable resource for the web archiving community.
The most significant technical work on the End of Term data was conducted by the University of North Texas and the Internet Archive, as they used link graph analysis and other methods to explore the potential for automatic classification of the content in the Classification of the End-of-Term Archive: Extending Collection Development to Web Archives (EOTCD) project. The research identified particular algorithms that hold promise for automatically detecting topically related content across disparate agency sites. This project also evaluated what kinds of metrics might be meaningful as libraries continue to expand their collections to include web harvested material. New reports and findings continue to be posted to the EOTCD project site as of June 2012.
Finally, the public access version of the End of Term Web Archive has also drawn from innovative work at the Internet Archive and experimentation with integrating web harvested content with more traditional digital library tools. The Internet Archive has developed a means to extract metadata records from multiple sources of data around this archive. In this case Dublin Core records were generated and loaded into XTF, the digital library discovery system developed at CDL. These records provide the faceted browse access via the archive site list.
Not only were these organizations able to act quickly and collaboratively to respond to a significant transition in the government information landscape, but the lessons learned are informing other broad collaborations, advances in web archive collection development, and opportunities to integrate the discovery of content from multiple archives.
Web Archiving Service Manager
California Digital Library
Follow us on Twitter: @eotarchive