Challenges of Site Identification for the 2012 End of Term Web Archive

This post in our series is about the difficulty of selecting webs sites and building a list of “seed URLs.” Seed URLs are the starting points that crawlers use to capture the web content you want to capture.

Part of the difficulty of building a seed list for the End of Term capture is that the federal web space is large. How large? In June 2011, the Office of Management and Budget made federal websites a target for improving transparency in providing government information, particularly reducing “duplicative” websites that create confusion. OMB’s Jeffrey Zients wrote that, “There are nearly 2,000 top-level Federal .gov domains; within these top-level domains, there are thousands of websites, sub-sites, and micro sites, resulting in an estimated 24,000 websites of varying purpose, design, navigation, usability, and accessibility.” A “State of the Web” survey published in December 2011 reported that, “The .gov Web Inventory self-reported 1,489 domains and an estimated 11,013 websites from 56 agencies.” This report goes on to describe the terminology used: Domains are registered .gov (or .mil, or even .com as the case may be) names on the Internet (in the form www.agencyname.gov). Most agencies (and some much more than others) use sub-domains that vary from the domain by containing a different root domain (for example, project.agencyname.gov).

While domains are registered through the General Services Administration and easily tracked, sub-domains are not. The term “website” is even more nebulous, described as “hosted content … which has a unique homepage and global navigation.” As a result, the .gov website numbers are considered a “general estimate.”

It isn’t just the “bigness” of the federal web space that makes the End of Term effort a challenge – there are also variants in how the different branches of the federal government are managed and tracked. The Library of Congress archives the legislative branch websites through a leg branch crawl run on a monthly basis, so for that effort a list of seed URLs (which may be anything from a domain to a sub-domain to a particular website or part of website) for the leg branch is assiduously maintained – in other words, the situation for that branch of the federal government is in good shape. (It doesn’t hurt that it is a relatively small branch of government.) There is no such regular effort organized for the judicial branch sites and they aren’t under GSA or OPM, so a reliable seed list for the judicial branch is not so easy to come by and why judicial branch seed URL nominations are a priority for the EOT project.

The Executive branch runs into problems because the OMB lists do not include most .mil, .org, .com, or other top level domain types sometimes used by federal agencies. The executive branch .gov domains are closely tracked and available at data.gov in a list. However, those sub-domains with different roots added to domain names are not tracked here. Crawlers can get derailed and not realize “xyz.govagency.gov” is part of “www.govagency.gov” and won’t capture it, thus xyz.govagency.gov should have its own seed. It can be particularly important on large sites, such as NASA.gov, to identify these sub-domains as separate seeds.

Much more common now are social media or quasi social media .com sites where the federal agency represents itself – the State Department, for example, has a “presence” in Facebook, Flickr, Google+, Tumblr, Twitter, and YouTube. All of these can and should be scoped separately.

Complicating things further, federal agencies of all sizes, but particularly smaller bodies, can use third party hosting solutions of varying types. Some House committees use a commercial company to provide their streaming and downloadable video. An example of this is the House Ways & Means Committee use of Granicus, (which is linked to from the House Ways and Means Committee website, along with links to their Facebook, Twitter, and YouTube pages).

When I first began doing some research for this blog post, my impression was that the situation is getting easier as the GSA leads an effort for federal “web reform.” However as one sees the extent of social media and as third party hosting increases, this optimism is likely misplaced.

For now, the End of Term project can use your assistance!

Michael Neubert
Supervisory Digital Projects Specialist
Library of Congress

