Document of the day: NSA’s guide to the internet

February 1, 2017 by · Leave a Comment
Filed under: Doc of the day, post 

This just came through my twitter feed from @MuckRock. Through a FOIA request which shook it loose from the notoriously difficult NSA, we now have access to NSA’s 2007 Untangling the Web: a guide to Internet research. It kind of reads like a Terry Pratchett novel if Terry was having a psychotic/psychedelic episode. As MuckRock notes, “you don’t have to go very far before this takes a hard turn into ‘Dungeons and Dragons campaign/Classics major’s undergraduate thesis’ territory.” Read on, you’ll thank me later!

And if you’re interested, I collected and cataloged a version for our library. The original NSA link to the document no longer resolves (and it was put up just last year!!), but there’s an archived copy in the WayBack Machine.

The NSA has a well-earned reputation for being one of the tougher agencies to get records out of, making those rare FOIA wins all the sweeter. In the case of Untangling the Web, the agency’s 2007 guide to internet research, the fact that the records in question just so happen to be absolutely insane are just icing on the cake – or as the guide would put it, “the nectar on the ambrosia.”

via The NSA’s guide to the internet is the weirdest thing you’ll read today.

End-of-term crawl ongoing. Please help us do QA!

January 31, 2017 by · Leave a Comment
Filed under: post 

The End of Term 2016 collection is still going strong, and we continue to receive email from interested folks about how they can help. Much of the content for the EOT crawl has already been collected and some of it is publicly accessible already through our partners. Last month we posted about ways to help the collection process. At this point volunteers are encouraged to help check the archive to see if content has been archived (i.e., do quality assurance (QA) for the crawls).

Here’s how you can help us assure that we’ve collected and archived as thoroughly and completely as possible:

Step 1: Check the Wayback Machine

Search the Internet Archive to see if the URL has already been captured. Please note this is not a specific End of Term collection search and does not include ALL content archived by the End of Term partners, but will be helpful in identifying whether something has been preserved already.

You may type in specific URLs or domains or subdomains, or try a simple keyword search (in Beta!).

1a: Help Perform Quality Assurance

If you do find a site or URL you were looking for, please click around to check if it was captured completely. A simple way to do this is to click around the archived page – click on navigation, links on the page, images, etc. We need help identifying parts of the sites that the crawlers might have missed, for instance specific documents or pages you are looking for but perhaps we haven’t archived. Please note that crawlers are not perfect and cannot archive some content. IA has a good FAQ on information about the challenges crawlers face.

If you do discover something is missing, you can still nominate pages or documents for archiving using the link in step 3 below.

Step 2: Check the Nomination Tool

Check the Nomination Tool to see if the URL or site has been nominated already. There are a few ways to do this:

Step 3: Nominate It!

If you don’t see the URL you were looking for in any of those searches, please nominate it here.

There are a few plugins and bookmarklets to help nominate via your browser, eg this one created by Matt Price for an event at University of Toronto, and others available at the bottom of this page.

Questions? Please contact the End of Term project at eot-info AT archive DOT org.

2016 End of Term (EOT) crawl and how you can help

December 9, 2016 by · 7 Comments
Filed under: post 

[Editor’s note: Updated 12/15/16 to include updated email address for End-of-Term project queries (eot-info AT archive DOT org), and information about robots.txt (#1 below) and databases and their underlying data (#5 below). Also updated 12/22/16 with note about duplication of efforts and how to dive deeply into an agency’s domain at the bottom of #1 section. jrj]

Here at FGI, we’ve been tracking the disappearance of government information for quite some time (and librarians have been doing it for longer than we have; see ALA’s long running series published from 1981 until 1998 called “Less Access to Less Information By and About the U.S. Government.”). We’ve recently written about the targeting of NASA’s climate research site and the Department of Energy’s carbon dioxide analysis center for closure.

But ever since the NY Times last week wrote a story “Harvesting Government History, One Web Page at a Time”, there has been renewed worry and interest from the library- and scientific communities as well as the public in archiving government information. And there’s been increased interest in the End of Term (EOT) crawl project — though there’s increased worry about the loss of government information with the incoming Trump administration, it’s important to note that the End of Term crawl has been going on since 2008, with both Republican and Democratic administrations, and will go on past 2016. EOT is working to capture as much of the .gov/.mil domains as we can, and we’re also casting our ‘net to harvest social media content and government information hosted on non-.gov domains (e.g., the St Louis Federal Reserve Bank at www.stlouisfed.org). We’re running several big crawls right now (you can see all of the seeds we have here as well as all of the seeds that have been nominated so far) and will continue to run crawls up to and after the Inauguration as well. We strongly encourage the public to nominate seeds of government sites so that we can be as thorough in our crawling as possible.

Read more

Status of the Wayback Machine

May 26, 2011 by · Leave a Comment
Filed under: post 

Roy updates us on the status of the Wayback machine with an example from the White House:

  • Back to the Wayback Machine, Roy Tennant, Library Journal (May 18th, 2011).

    But that means that any claims to be “archiving the web” should be taken with a grain of salt. Maybe say “archiving the parts of the web that matter” or “ignoring what doesn’t matter so much”.

And, don’t forget Archive-It, the web archiving service from Internet Archive.

Through a user-friendly web interface, Archive-It partners can catalog, manage, and browse their archived collections using web archiving tools developed at the Internet Archive. Collections are hosted at the Internet Archive data center and are accessible to the public, including full-text search.

  • Our mission

    Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.
  • Archives