Home » Articles posted by eotarchive

Author Archives: eotarchive

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Election 2012 Web Archive

This is the final post in our series on the collaborative End-of-term and Election web archives. This post focuses on the Election 2012 Web archive, particularly on the challenges we are facing in building an engaged community of web site nominators.

The idea for the 2012 Election Web Archive grew out of conversations with the End of Term project partners (Internet Archive, California Digital Library, Library of Congress, the University of North Texas Libraries and the U.S. Government Printing Office) about Harvard’s potential participation in the project. Librarians at the Harvard Kennedy School (HKS – Harvard’s graduate school of government), suggested that a logical focus for them would be the upcoming presidential election. In the past, presidential elections generated a lot of enthusiasm among faculty and students, who frequently requested both printed and online election-related resources from the library. They hoped to harness some of this enthusiasm.

Since the Library of Congress has been collecting in this area for many years, the group decided to collaboratively collect web sites about the 2012 election as a sister collection to the 2012-2013 End-of-term collection. The partners decided to distribute selection responsibilities so that they could each focus on areas of particular interest at their institution. Curators from the Library of Congress would focus on official campaign sites produced by presidential, congressional and gubernatorial candidates, using their in-house tools for tracking nominations.

The Harvard curators would focus on web sites produced by non-profit organizations, academic institutions, fact-checking organizations and some individuals, including blogs, tweets and YouTube videos, using the nomination tool created by UNT for the 2008-2009 project. Examples of these web sites include http://factcheck.org, http://dailykos.com and http://campaignmoney.com/. Initial selections were made by library staff and plans were made to engage faculty, students and staff in relevant academic areas such as Democracy, Politics and Institutions and from HKS Research Centers such as the Institute of Politics; the Shorenstein Center on the Press, Politics and Public Policy; the Center for Publish Leadership and the Ash Center for Democratic Governance and Innovation.

Nomination of web sites began in December 2011 and will continue on an ongoing basis until the election. The Internet Archive began crawling these sites in January 2012, and will continue crawling these sites on a weekly basis until sometime after the 2012 election. After each crawl, detailed reports are distributed to the entire group, highlighting any problematic web sites, for example sites that couldn’t be collected because of robot exclusion files. The campaign sites nominated by Library of Congress curators will be crawled separately as a part of the Library of Congress Web Archives with the ultimate goal of providing a shared interface for researchers to access all of the campaign and other election-related sites.

As soon as crawls were underway, HKS librarians focused on soliciting help in nominating web sites to include in the collection. Previous efforts to engage the community of faculty and students included direct emails, in-person conversations, an advertisement in the student newspaper, and posts to the HKS LinkedIn page, but produced few nominations. The campaign also included contacting staff and librarians at other public policy schools. So far this hasn’t resulted in additional nominations. At the start of the semester in September, HKS librarians will publicize the project on the Journalist’s Resource web site, and possibly on The Monkey Cage web site. In addition, the group made the decision to publish the URL to the nomination tool (which does not require an account to use) directly in public articles about the project. As was learned in the 2008-2009 project, it was more important overall to make the tool as accessible as possible than to lock it down because of the risk of misuse. In the event that any inappropriate URLs are submitted, they can be removed from the list of sites to crawl.

Although to date it has been challenging to engage a broad group of nominators for this project, we remain optimistic that as we move closer to the election it will be easier to spark interest and participation in the project.

How can you help? If you would like to nominate a web site for the Election 2012 Web Archive, visit the nomination tool and start entering URLs. If you have suggestions for us or any questions, please contact us at eotproject@loc.gov, here on this blog, or on Twitter @eotarchive.

Andrea Goethals and Wendy Marcus Gogel
Harvard Library

Keely Wilczek
Harvard Kennedy School Library

US Executive Branch Closure Crawl

The State of the Federal Web Report issued in late 2011 noted that Federal agencies planned to eliminate or merge several hundred domains, as part of the President’s Campaign to Cut Waste. The goal was to reduce outdated, redundant, and inactive domains. As part of this work, the .gov Task Force overseeing the process asked members of the National Digital Stewardship Alliance (NDSA) to archive and preserve all .gov Executive branch domains slated to be decommissioned or merged. NDSA members immediately agreed that an important step in this process was to preserve the content of these sites as part of our national digital heritage – instead of simply eliminating them.

Rather than start a separate, standalone project, we chose to launch a collaborative crawl under the auspices of the End of Term Web Archive project (EOT). Although the EOT project has primarily focused on transitions occurring at the end of administrative terms, part of the goal of the project is to document changes in all online presences of the US Federal government during key periods of transition, regardless of when or under what circumstances they occur. So, a comprehensive harvest, using a targeted list of domains supplied by the .gov Task Force and a general list of all Executive branch domains downloaded from data.gov, began on Saturday, October 8, 2011. The crawl concluded on November 5, 2011 and encompassed 46,278,384 captures and ~13TBs of data compressed.

Here’s a general outline of the sequence of events of the Fall 2011 crawl:

  • Agencies identified recommended actions for domains in their Interim Progress Reports and Web Inventory
  • The .gov Task Force collected a list of outgoing .gov domains and shared those with the NDSA
  • Internet Archive crawled outgoing sites and the full suite of Executive branch domains (note: for some resources it took several weeks to crawl sites in their entirety)
  • GSA eliminated domains after they were archived

The End of Term Web Archive project, including the archival capture of Executive Branch domains last Fall, is not meant in any way to satisfy agency records management obligations. The domains are archived solely for the purpose of preservation and posterity. Agencies separately discuss records management obligations and handle those processes independently. However, we do make every effort to replicate resources in their entirety – at least what can be supported by available tools, techniques and best practices. Some portion of every web site is housed server-side and that subset of content and/or user experience cannot be archived and replicated using traditional web crawler/capture software that is dependent on files being downloaded to the client.

The biggest challenge of this project, however, was not Web 2.0/Web 3.0 server side rendering or content serving. The biggest limiting factor was time. When we archive resources, there is a big difference between visiting and sampling a web resource using a set of scoping rules and guidelines versus going out and attempting to “drain” a site, i.e. replicate it soup to nuts as fast as the server can respond to your requests. Some of these resources house thousands to tens of thousands of PDF files, videos &/or other network intensive resources. And, most servers are programmed to meter how fast they respond to requests from the same IP address or an IP address range, so we have to wait appropriate intervals between requests in order to avoid being ignored or blacklisted by an automated process. There are ways to parallelize capture, but without dedicated funding, few institutions are able to marshal those kinds of resources on a volunteer basis.

The End of Term project is built on the collaborative best efforts of a network of partners who share a passion for preservation of online government.

For more information about the streamlining of agency website management, please visit www.usa.gov/WebReform.shtml. This effort is now part of the larger Digital Government Strategy.

For more information on the End of Term Web Archive project, please visit http://eotarchive.cdlib.org, and follow us @eotarchive.

Kris Carpenter Negulescu
Director Web Group
Internet Archive

Challenges of Site Identification for the 2012 End of Term Web Archive

This post in our series is about the difficulty of selecting webs sites and building a list of “seed URLs.” Seed URLs are the starting points that crawlers use to capture the web content you want to capture.

Part of the difficulty of building a seed list for the End of Term capture is that the federal web space is large. How large? In June 2011, the Office of Management and Budget made federal websites a target for improving transparency in providing government information, particularly reducing “duplicative” websites that create confusion. OMB’s Jeffrey Zients wrote that, “There are nearly 2,000 top-level Federal .gov domains; within these top-level domains, there are thousands of websites, sub-sites, and micro sites, resulting in an estimated 24,000 websites of varying purpose, design, navigation, usability, and accessibility.” A “State of the Web” survey published in December 2011 reported that, “The .gov Web Inventory self-reported 1,489 domains and an estimated 11,013 websites from 56 agencies.” This report goes on to describe the terminology used: Domains are registered .gov (or .mil, or even .com as the case may be) names on the Internet (in the form www.agencyname.gov). Most agencies (and some much more than others) use sub-domains that vary from the domain by containing a different root domain (for example, project.agencyname.gov).

While domains are registered through the General Services Administration and easily tracked, sub-domains are not. The term “website” is even more nebulous, described as “hosted content … which has a unique homepage and global navigation.” As a result, the .gov website numbers are considered a “general estimate.”

It isn’t just the “bigness” of the federal web space that makes the End of Term effort a challenge – there are also variants in how the different branches of the federal government are managed and tracked. The Library of Congress archives the legislative branch websites through a leg branch crawl run on a monthly basis, so for that effort a list of seed URLs (which may be anything from a domain to a sub-domain to a particular website or part of website) for the leg branch is assiduously maintained – in other words, the situation for that branch of the federal government is in good shape. (It doesn’t hurt that it is a relatively small branch of government.) There is no such regular effort organized for the judicial branch sites and they aren’t under GSA or OPM, so a reliable seed list for the judicial branch is not so easy to come by and why judicial branch seed URL nominations are a priority for the EOT project.

The Executive branch runs into problems because the OMB lists do not include most .mil, .org, .com, or other top level domain types sometimes used by federal agencies. The executive branch .gov domains are closely tracked and available at data.gov in a list. However, those sub-domains with different roots added to domain names are not tracked here. Crawlers can get derailed and not realize “xyz.govagency.gov” is part of “www.govagency.gov” and won’t capture it, thus xyz.govagency.gov should have its own seed. It can be particularly important on large sites, such as NASA.gov, to identify these sub-domains as separate seeds.

Much more common now are social media or quasi social media .com sites where the federal agency represents itself – the State Department, for example, has a “presence” in Facebook, Flickr, Google+, Tumblr, Twitter, and YouTube. All of these can and should be scoped separately.

Complicating things further, federal agencies of all sizes, but particularly smaller bodies, can use third party hosting solutions of varying types. Some House committees use a commercial company to provide their streaming and downloadable video. An example of this is the House Ways & Means Committee use of Granicus, (which is linked to from the House Ways and Means Committee website, along with links to their Facebook, Twitter, and YouTube pages).

When I first began doing some research for this blog post, my impression was that the situation is getting easier as the GSA leads an effort for federal “web reform.” However as one sees the extent of social media and as third party hosting increases, this optimism is likely misplaced.

For now, the End of Term project can use your assistance!

Michael Neubert
Supervisory Digital Projects Specialist
Library of Congress

2012 End of Term Web Archive: Call for Volunteers

In our first post, we talked about how the End of Term partner institutions joined together in 2008 and 2009 to archive the U.S. government web at the end of the Bush administration.

The End of Term project team has resumed for a new 2012-2013 archive, and we need help identifying websites for collection, particularly those that might be most at-risk of change or deletion at the end of the current presidential term.

What you can do to help

The project team has access to some lists of U.S. Federal government domains and will use those as a baseline list of URLs to crawl. Lists include those of legislative branch domains, including Senator, Representative, legislative committee and leadership web presences, executive branch domains, domains found in directories such as www.USA.gov and www.uscourts.gov, however these lists are often not comprehensive, and we need help identifying URLs to archive.

Nominations of any U.S. Federal government domains or URLs are welcome, though there are a few topic areas that we particularly need assistance identifying, including but not limited to:

* Judicial Branch websites
* Important content or subdomains on very large websites (such as NASA.gov) that might be related to current Presidential policies
* Government content on non-government domains (.com, .edu, etc.)

Volunteers may contribute as much time and effort as they are able, whether it be a nomination of one website or five hundred sites.

Project participants may also want to search the 2008-2009 archive: if we missed something in that earlier archive, it’s likely we don’t know about it this time around, either! Please nominate any U.S. government URLs you feel would be important to archive.

For the last project, we pre-loaded our nomination tool with URLs and then had volunteers help vote things in or out of scope. This time we’re trying a different approach – the nomination tool has not been pre-loaded so whatever is added by our volunteers will be more clearly identified as priority URLs. Websites recommended by volunteers will be prioritized for more frequent and in-depth collection during the course of the project.

How to Nominate URLs

To contribute a URL to this project, simply visit the Nomination Tool (developed by University of North Texas Libraries) and start entering URLs. Volunteers are asked to submit some basic metadata about the site that they are nominating (title, branch of government, agency), and provide some information about themselves.

Project Timeframe

Internet Archive will begin a baseline crawl near the end of August, with our other partners crawling various aspects of the government web (depending on their interests) in the Fall of 2012. Websites nominated by our volunteers will be crawled in depth beginning in November and through February, at least. Depending on the outcome of the election, the project team will determine how much and how often to crawl in 2013. Access to the archive will come later. For more details about our proposed schedule, visit the EOT 2012 page on our website.

If you have any questions, please contact us at eotproject@loc.gov, here on this blog, or on Twitter @eotarchive.

Abbie Grotke
Web Archiving Team Lead
Library of Congress

Cathy Hartman
Associate Dean
University of North Texas Libraries

Building the 2008 End of Term Web Archive

This is the first in a series of guest posts from members of the End of Term (EOT) project. We look forward to blogging here this month and telling you more about our efforts to archive and preserve U.S. government websites.

The 2008-2009 End of Term Web Archive contains over 3,300 U.S. Federal Government websites harvested between September 2008 and November 2009 during the transition from the Bush to Obama administrations. The recent DttP article “It Takes a Village to Save the Web” (mentioned in the introduction post yesterday) focuses in detail on the collaborative work that took place at every level, including site nomination, harvesting, data transfer, preservation, analysis and access. This post will briefly highlight some of the more remarkable aspects of building the 2008-2009 archive.

The first is that the collaboration came together so quickly and effectively. The group formed in response to the announcement in 2008 that NARA would not be archiving the .gov domain during this transitional period. With no funding and very little notice, the organizations involved were able to set up a means to identify and archive an extremely large body of content. The harvesting process, for example, was carried out across institutions over the course of the year, and while the bulk of the crawls were run by the Internet Archive (IA), the California Digital Library, the University of North Texas and the Library of Congress were able to fill in the gaps at times when the IA could not run crawls, and were also able to target selected areas of content in particular depth.

The 16 terabyte body of data that you’re able to browse, search and display at the End of Term Web Archive represents the entirety of what all four organizations harvested, and is just one of three copies of that content. Once the capture phase of the project was complete, each organization transferred its share of content to other project partners to build the complete archive. An August 2011 FGI post Archiving the End of Term Bush Administration points to an article that details the content transfer aspect of this project. The access copy is held at the Internet Archive (with an access gateway provided by CDL), a preservation copy is held at the Library of Congress, and a copy of the data for research and analysis is held at the University of North Texas.

Another noteworthy aspect of this project is the number of new, emerging and experimental technologies it either generated or made use of. The Nomination Tool, which will be described in more detail in a future post, was built by the University of North Texas to support selection work for this project, and remains a valuable resource for the web archiving community.

The most significant technical work on the End of Term data was conducted by the University of North Texas and the Internet Archive, as they used link graph analysis and other methods to explore the potential for automatic classification of the content in the Classification of the End-of-Term Archive: Extending Collection Development to Web Archives (EOTCD) project. The research identified particular algorithms that hold promise for automatically detecting topically related content across disparate agency sites. This project also evaluated what kinds of metrics might be meaningful as libraries continue to expand their collections to include web harvested material. New reports and findings continue to be posted to the EOTCD project site as of June 2012.

Finally, the public access version of the End of Term Web Archive has also drawn from innovative work at the Internet Archive and experimentation with integrating web harvested content with more traditional digital library tools. The Internet Archive has developed a means to extract metadata records from multiple sources of data around this archive. In this case Dublin Core records were generated and loaded into XTF, the digital library discovery system developed at CDL. These records provide the faceted browse access via the archive site list.

Not only were these organizations able to act quickly and collaboratively to respond to a significant transition in the government information landscape, but the lessons learned are informing other broad collaborations, advances in web archive collection development, and opportunities to integrate the discovery of content from multiple archives.

Tracy Seneca
Web Archiving Service Manager
California Digital Library

Follow us on Twitter: @eotarchive

Archives