The State of the Federal Web Report issued in late 2011 noted that Federal agencies planned to eliminate or merge several hundred domains, as part of the President’s Campaign to Cut Waste. The goal was to reduce outdated, redundant, and inactive domains. As part of this work, the .gov Task Force overseeing the process asked members of the National Digital Stewardship Alliance (NDSA) to archive and preserve all .gov Executive branch domains slated to be decommissioned or merged. NDSA members immediately agreed that an important step in this process was to preserve the content of these sites as part of our national digital heritage – instead of simply eliminating them.
Rather than start a separate, standalone project, we chose to launch a collaborative crawl under the auspices of the End of Term Web Archive project (EOT). Although the EOT project has primarily focused on transitions occurring at the end of administrative terms, part of the goal of the project is to document changes in all online presences of the US Federal government during key periods of transition, regardless of when or under what circumstances they occur. So, a comprehensive harvest, using a targeted list of domains supplied by the .gov Task Force and a general list of all Executive branch domains downloaded from data.gov, began on Saturday, October 8, 2011. The crawl concluded on November 5, 2011 and encompassed 46,278,384 captures and ~13TBs of data compressed.
Here’s a general outline of the sequence of events of the Fall 2011 crawl:
- Agencies identified recommended actions for domains in their Interim Progress Reports and Web Inventory
- The .gov Task Force collected a list of outgoing .gov domains and shared those with the NDSA
- Internet Archive crawled outgoing sites and the full suite of Executive branch domains (note: for some resources it took several weeks to crawl sites in their entirety)
- GSA eliminated domains after they were archived
The End of Term Web Archive project, including the archival capture of Executive Branch domains last Fall, is not meant in any way to satisfy agency records management obligations. The domains are archived solely for the purpose of preservation and posterity. Agencies separately discuss records management obligations and handle those processes independently. However, we do make every effort to replicate resources in their entirety – at least what can be supported by available tools, techniques and best practices. Some portion of every web site is housed server-side and that subset of content and/or user experience cannot be archived and replicated using traditional web crawler/capture software that is dependent on files being downloaded to the client.
The biggest challenge of this project, however, was not Web 2.0/Web 3.0 server side rendering or content serving. The biggest limiting factor was time. When we archive resources, there is a big difference between visiting and sampling a web resource using a set of scoping rules and guidelines versus going out and attempting to “drain” a site, i.e. replicate it soup to nuts as fast as the server can respond to your requests. Some of these resources house thousands to tens of thousands of PDF files, videos &/or other network intensive resources. And, most servers are programmed to meter how fast they respond to requests from the same IP address or an IP address range, so we have to wait appropriate intervals between requests in order to avoid being ignored or blacklisted by an automated process. There are ways to parallelize capture, but without dedicated funding, few institutions are able to marshal those kinds of resources on a volunteer basis.
The End of Term project is built on the collaborative best efforts of a network of partners who share a passion for preservation of online government.
For more information about the streamlining of agency website management, please visit www.usa.gov/WebReform.shtml. This effort is now part of the larger Digital Government Strategy.
For more information on the End of Term Web Archive project, please visit http://eotarchive.cdlib.org, and follow us @eotarchive.
Kris Carpenter Negulescu
Director Web Group