Home » Posts tagged 'EOT' (Page 2)

Tag Archives: EOT

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

A Long-Term Goal For Creating A Digital Government-Information Library Infrastructure

(Editor’s note: this post was a guest editorial on Libraries Network, a nascent collaborative effort of the Association of Research Libraries (ARL) spurred by the work of the DataRefuge project, End of Term crawl, and other volunteer efforts to preserve data and content from the .gov/.mil domain. This is the first of 2 posts for the Libraries Network. The second one will be posted tomorrow. JRJ)

Now that so many have done so much good work to rescue so much data, it is time to reflect on our long-term goals. This is the first of two posts that suggest some steps to take.

The amount of data rescue work that has already been done by DataRefuge, ClimateMirror, Environmental Data and Governance Initiative (EDGI) projects and the End of Term crawl (EOT) 2016 is truly remarkable. In a very practical sense, however, this is only the first stage in a long process. We still have a lot of work to do to make all the captured digital content (web pages, data, PDFs, videos, etc) discoverable and understandable and usable. We believe that the next step is to articulate a long-term goal to guide the next tasks.

Of course, we do already have broad goals but up to now those goals have by necessity been more short-term than long-term. The short-term goals that have driven so much action have been either implicit (“rescue data!”) or explicit (“to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations” [EOT]). These have been sufficient to draw librarian-, scientist-, hacker-, and public volunteers who have accomplished a lot! But, as the EOT folks will remind us, most of this work is volunteer work.

The next stages will require more resources and long-term commitments. Notable next tasks include: creating metadata, identifying and acquiring DataRefuge’s uncrawlable data, and doing Quality Assurance (QA) work on content that has been acquired. This work has begun. The University of North Texas, for example, has created a pilot crowdsourcing project to catalog a cache of EOT PDFs and is looking for volunteers. This upcoming work is essential in order to make content we rescue and acquire discoverable and usable and to ensure that the content is preserved for the long-term.

As we look to the long-term, we turn to the two main international standards for long-term preservation: OAIS (Reference Model For An Open Archival Information System) and TDR (Audit And Certification Of Trustworthy Digital Repositories). Using the terminology of those standards our current actions have focused on “ingest.” Now we have to focus on the other functions of a TDR: management, preservation, access, and use. We might say that what we have been doing is Data Rescue but what we will do next is Data Preservation which includes discovery, access and use.

Given that, here is our suggestion for a long-term goal:

Create a digital government-information library infrastructure in which libraries collectively provide services for collections that are selected, acquired, organized, and preserved for specific Designated Communities (DCs). 

Adopting this goal will not slow down or interrupt existing efforts. It focuses on “Designated Communities” and the life-cycle of information and, by doing so, it will help prioritize our actions. By doing this, it will help attract libraries to participate in the next stage activities. It will also make long-term participation easier and more effective by helping participants understand where their activities lead, what the outcomes will be, and what benefits they will get tomorrow by investing their resources in these activities today.

How does simply adopting a goal do all that?

First, by expressing the long-term goal in the language of OAIS and TDR it assures participants that today’s activities will ensure long-term access to information that is important to their communities.

Second, by putting the focus on the users of the information it demonstrates to our local communities that we are doing this for them. This will help make it practical to invest needed resources in the necessary work. The goal focuses on users of information by explicitly saying that our actions have been and will be designed to provide content and services for specific user groups (Designated Communities in OAIS terminology).

Third, by focusing on an infrastructure rather than isolated projects, it provides an opportunity for libraries to benefit more by participating than by not participating.

The key to delivering these benefits lies in the concept of Designated Communities. In the paper-and-ink world, libraries were limited in who they could serve. “Users” had to be local; they had to be able to walk into our buildings. It was difficult and expensive to share either collections or services, so we limited both to members of our funding institution or a geographically-local community. In the digital world, we no longer have to operate under those constraints. This means that we can build collections for Designated Communities that are defined by discipline or subject or by how a community uses digital information. This is a big change from defining a community by its institutional affiliation or by its members’ geographical proximity to an institution or to each other.

This means that each participating institution can benefit from the contributions of all participating institutions. To use a simple example, if ten libraries each invested the cost of developing collections and services for two DCs, all ten libraries (and their local/institutional communities) would get the benefits of twenty specific collections and services. There are more than one thousand Federal Depository Library Program (FDLP) libraries.

Even more importantly, this model means that the information-users will get better collections of the information they need and will get services that are tailored to how they look for, select, and use that information.

This approach may seem unconventional to government information specialists who are familiar with agency-based collections and services. The digital world allows us to combine the benefits of agency-based acquisitions with DC-based collections and services.

This means that we can still use the agency-based model for much of our work while simultaneously providing collections for DCs. For example, it is probably always more efficient and effective to identify, select, and acquire information by focusing on the the output of an agency. It is certainly easier to ensure comprehensiveness with this approach. It is often easier to create metadata and do QA for a single agency at a time. And information content can be easily stored and managed using the same agency-based approach. And information stored by agency can be viewed and served (through use of metadata and APIs) as a single “virtual” collection for a Designated Community. Any given document, dataset, or database may show up in the collections of several DCs, and any given “virtual” collection can easily contain content from many agencies.

For example, consider how this approach would affect a Designated Community of economists. A collection built to serve economists would include information from multiple agencies (e.g., Commerce, Council of Economic Advisors, CBO, GAO, NEC, USDA, ITA, etc. etc.). When one library built such a collection and provided services for it, every library with economists would be able better serve their community of economists. And every economist at every institution would be able to more easily find and use the information she needs. The same advantages would be true for DCs based on kind of use (e.g. document-based reading; computational textual-analysis; GIS; numeric data analysis; re-purposing and combining datasets; etc.). 


We believe that adopting this goal will have several benefits. It will help attract more libraries to participate in the essential work that needs to be done after information is captured. It will provide a clear path for planning the long-term preservation of the information acquired. It will provide better collections and services to more users more efficiently and effectively than could be done by individual libraries working on their own. It will demonstrate the value of libraries to our local user-communities, our parent institutions, and funding agencies.

James A. Jacobs, Librarian Emeritus, University of California San Diego
James R. Jacobs, Federal Government Information Librarian, Stanford University

End-of-term crawl ongoing. Please help us do QA!

The End of Term 2016 collection is still going strong, and we continue to receive email from interested folks about how they can help. Much of the content for the EOT crawl has already been collected and some of it is publicly accessible already through our partners. Last month we posted about ways to help the collection process. At this point volunteers are encouraged to help check the archive to see if content has been archived (i.e., do quality assurance (QA) for the crawls).

Here’s how you can help us assure that we’ve collected and archived as thoroughly and completely as possible:

Step 1: Check the Wayback Machine

Search the Internet Archive to see if the URL has already been captured. Please note this is not a specific End of Term collection search and does not include ALL content archived by the End of Term partners, but will be helpful in identifying whether something has been preserved already.

You may type in specific URLs or domains or subdomains, or try a simple keyword search (in Beta!).

1a: Help Perform Quality Assurance

If you do find a site or URL you were looking for, please click around to check if it was captured completely. A simple way to do this is to click around the archived page – click on navigation, links on the page, images, etc. We need help identifying parts of the sites that the crawlers might have missed, for instance specific documents or pages you are looking for but perhaps we haven’t archived. Please note that crawlers are not perfect and cannot archive some content. IA has a good FAQ on information about the challenges crawlers face.

If you do discover something is missing, you can still nominate pages or documents for archiving using the link in step 3 below.

Step 2: Check the Nomination Tool

Check the Nomination Tool to see if the URL or site has been nominated already. There are a few ways to do this:

Step 3: Nominate It!

If you don’t see the URL you were looking for in any of those searches, please nominate it here.

There are a few plugins and bookmarklets to help nominate via your browser, eg this one created by Matt Price for an event at University of Toronto, and others available at the bottom of this page.

Questions? Please contact the End of Term project at eot-info AT archive DOT org.

Attend the FGI virtual EOT seed nomination sprint. Help make and preserve .gov history!

If you’ve been waiting for your chance to make history: now’s the time!

Please join us for the FGI virtual End of Term Project Web archiving nomination sprint on Wednesday 11 January 2017 from 9AM – 11AM Pacific / 12 noon – 2PM EST. During that time, We’ll set up a virtual conference room, give a brief presentation of the End of Term crawl and the ins and outs of nominating seeds and then volunteers will be on hand to answer your questions, suggest agencies for deep exploration, and take information about databases and other resources that are tricky to capture with traditional web archiving. RSVP TODAY!

If you’re new to the End of Term Project, it’s a collaborative project to collect and preserve public United States government web sites prior to the end of the current presidential administration on January 20, 2017. Working together, the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office (GPO) are conducting a thorough Web harvest of the .gov/.mil domain based on prioritized lists of URLs, including social media. As it did in 2008 and 2012 (previous harvests are accessible here), the project’s goal is to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations, to enhance the existing archival Internet collections, and to give the public access to archived digital government information. This broad comprehensive crawl of the .gov/.mil domain is based on a prioritized list of URLs, including social media.

This sprint to nominate seeds is a big part of making it happen! Hundreds of volunteers and institutions are already involved in the effort. We hope you’ll join the conversation and the fun. There may even be a few (completely non-monetary) prizes for top contributors.

You can pre-register here. We’ll contact you as the date gets closer with access information for the virtual conference.

The final deadline to nominate URLs prior to Inauguration Day is Friday, January 13th, so even if you can’t sprint with us, keep the nominations coming! Questions? Email us at admin AT freegovinfo DOT com.

2016 End of Term (EOT) crawl and how you can help

[Editor’s note: Updated 12/15/16 to include updated email address for End-of-Term project queries (eot-info AT archive DOT org), and information about robots.txt (#1 below) and databases and their underlying data (#5 below). Also updated 12/22/16 with note about duplication of efforts and how to dive deeply into an agency’s domain at the bottom of #1 section. jrj]

Here at FGI, we’ve been tracking the disappearance of government information for quite some time (and librarians have been doing it for longer than we have; see ALA’s long running series published from 1981 until 1998 called “Less Access to Less Information By and About the U.S. Government.”). We’ve recently written about the targeting of NASA’s climate research site and the Department of Energy’s carbon dioxide analysis center for closure.

But ever since the NY Times last week wrote a story “Harvesting Government History, One Web Page at a Time”, there has been renewed worry and interest from the library- and scientific communities as well as the public in archiving government information. And there’s been increased interest in the End of Term (EOT) crawl project — though there’s increased worry about the loss of government information with the incoming Trump administration, it’s important to note that the End of Term crawl has been going on since 2008, with both Republican and Democratic administrations, and will go on past 2016. EOT is working to capture as much of the .gov/.mil domains as we can, and we’re also casting our ‘net to harvest social media content and government information hosted on non-.gov domains (e.g., the St Louis Federal Reserve Bank at www.stlouisfed.org). We’re running several big crawls right now (you can see all of the seeds we have here as well as all of the seeds that have been nominated so far) and will continue to run crawls up to and after the Inauguration as well. We strongly encourage the public to nominate seeds of government sites so that we can be as thorough in our crawling as possible.