[Editor’s note: Updated 12/15/16 to include updated email address for End-of-Term project queries (eot-info AT archive DOT org), and information about robots.txt (#1 below) and databases and their underlying data (#5 below). Also updated 12/22/16 with note about duplication of efforts and how to dive deeply into an agency’s domain at the bottom of #1 section. jrj]
Here at FGI, we’ve been tracking the disappearance of government information for quite some time (and librarians have been doing it for longer than we have; see ALA’s long running series published from 1981 until 1998 called “Less Access to Less Information By and About the U.S. Government.”). We’ve recently written about the targeting of NASA’s climate research site and the Department of Energy’s carbon dioxide analysis center for closure.
But ever since the NY Times last week wrote a story “Harvesting Government History, One Web Page at a Time”, there has been renewed worry and interest from the library- and scientific communities as well as the public in archiving government information. And there’s been increased interest in the End of Term (EOT) crawl project — though there’s increased worry about the loss of government information with the incoming Trump administration, it’s important to note that the End of Term crawl has been going on since 2008, with both Republican and Democratic administrations, and will go on past 2016. EOT is working to capture as much of the .gov/.mil domains as we can, and we’re also casting our ‘net to harvest social media content and government information hosted on non-.gov domains (e.g., the St Louis Federal Reserve Bank at www.stlouisfed.org). We’re running several big crawls right now (you can see all of the seeds we have here as well as all of the seeds that have been nominated so far) and will continue to run crawls up to and after the Inauguration as well. We strongly encourage the public to nominate seeds of government sites so that we can be as thorough in our crawling as possible.
Over the last week, several people have contacted the EOT project wondering what they can do to help. We’ve given advice and tips to several groups who’ve already put events together to work on this — including the folks at the University of Toronto’s Technscience Research Unit, who are putting together a Guerrilla Archiving Event: Saving Environmental Data from Trump, the Fellows at the UPenn Program in the Environmental Humanities and Professor Debbie Rabina at Pratt Institute’s School of Information, who’s holding *another* event (she and her government information students have done several nominating klatches over the years!) on Tuesday December 13 from 10am – 1pm at the New York Academy of Medicine, 1216 Fifth Avenue at 103rd Street, New York, NY.
Since there’s been an increasing amount of interest and people from all over the country contacting us (AND all of the folks working on EOT have day jobs :-)) I thought it’d be helpful to share more widely the recommendations and advice we’ve been sharing with those groups that have already contacted us. If you, the reader, want to help as an individual, or want to organize a group to help EOT be as thorough as we can, here’s what you can do:
1) Seed nomination: Nominate sites you feel are important. Include both top level (eg epa.gov) as well as subdomains (nepis.epa.gov). Here’s the nomination tool.
The usuals that are of concern to us are .gov content not on .gov domains, social media, and anything buried deep on some of the larger sites. If you’re working as a group, you might want to pick a topic to focus on but we’re happy to accept any and all nominations you come up with. One way you could do this is to do google searches for topic(s) of interest and include the .gov search parameter ( e.g., “environment site:*.gov”). That will only search .gov domain for that keyword and you’ll quickly find the govt sites of interest to you. Don’t worry about whether your nominated site has already been nominated. We’ll de-duplicate our list of seeds.
**[added 12/15/16] When you find a site, check to see if there’s a robots.txt exclusion or block on Web crawlers. You can do this by appending the url with “/robots.txt”. e.g. EPA’s site www3.epa.gov has a robots.txt file at www3.epa.gov/robots.txt. Some sites, like NASA’s technical reports server put restrictions on systematic downloading — luckily, NASA has set up an auxiliary harvesting mechanism using Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), but not all agencies do this. If you find a site like that, please email the project at eot-info AT archive DOT org and let us know. We’ll try and harvest the content another way if possible.
[Added 12/22/16] One thoughtful seed nominator asked the EOT project about how to assure that various projects don’t duplicate efforts. The #dataRefuge- and Azimuth projects are targeting environmental and climate data specifically — but that of course runs across several agencies including EPA, NOAA, Interior and others. But the EOT volunteers haven’t kept a list or tracked on what other gatherings are doing or if they’re targeting specific topics or agencies.
However, the nomination tool let’s you put in a url and query the EOT seed list to see if that url is existing in the system. This is not perfect of course, because a lower-down url (eg nepis.epa.gov) may not be listed, but the top level url (epa.gov) is definitely there and the lower-down url would be within our crawl scope and would therefore be crawled — even if it looks from the nomination tool that nepis.epa.gov is not on our list.
If there’s a group working to nominate seeds, it might make sense to divvy up work based on federal agencies and commissions. Then do some google searches that limit the keyword to searching within that .gov domain (eg “water site:epa.gov” or “legislation site:*.gov”). Another good search to facilitate a dive into an agency would be “database site:*.gov” — see section #2 below for more on databases — and you can also link several limiters together like “site:*.gov filetype:pdf”. That’ll bring up the databases running on each of the .gov domains and allow you to dig in there and see if you can’t nominate those lower-level urls and documents within databases.
2) Dark web databases: Depending on the number of people at your event and their technical skill level, you can also expand beyond top-level .gov sites to include hidden databases. See for example EPA’s National Service Center for Environmental Publications (NSCEP) (http://epa.gov/nscep). It’s not enough to crawl the top level of NSCEP, because many of their documents are housed in a queryable database. I was able to find their page which has the complete list of their publications (https://nepis.epa.gov/EPA/html/pubs/pubtitle.html) and nominated that url as well.
So it’d be great to nominate those lower level urls that point to document lists. If you find databases that are only accessible via keyword search, you could nominate some example urls, but also email them to us (the project email is eot-info AT archive DOT org) to make us aware of them. We can then try to reverse engineer the crawl to get those hidden documents. Using nepis as an example, if you can find a page which lists all documents (https://nepis.epa.gov/EPA/html/pubs/pubtitle.html) then nominate that one. If there’s no publications index page, then having some example urls could help us get all of them:
3) Checking Wayback: Another tip for figuring out what to nominate is to use the Internet Archive’s wayback machine to check and see whether the site you want to nominate has been crawled by the internet Archive in the past AND MORE IMPORTANTLY that the Wayback machine crawled the site in its entirety. For example, I recently learned that the Law Library of Congress had digitized and posted the historically important publication called the United States Statutes at Large from 1789 – 1950 and so I checked the Wayback to make sure the site had been archived. I found that the site had been archived but that the individual PDFs had not. However, by analyzing the urls, I was able to nominate (to both Wayback machine AND the EOT crawl!) the url www.loc.gov/law/help/statutes-at-large WITHOUT the trailing “/” (so NOT http://www.loc.gov/law/help/statutes-at-large/) which caused Wayback to collect all of the PDFs as well as the site itself. So it’s important to analyze and nominate good solid urls.
4) FTP sites: The EOT crawl will no doubt do a good job with http/https content in the .gov domain. We’ve got seed lists from the last 2 EOT crawls, official lists from GSA and a number of nominated urls (see the reports at http://digital2.library.unt.edu/nomination/eth2016/reports/). However, FTP (File Transfer Protocol) is not currently part of the EOT project (though we’re talking about including FTP’d files). If one looks, one will find that EPA, NASA, NOAA, Census and other executive agencies have a lot of content and data sets being served out via FTP — and much of that content (eg NASA’s climate research site and dept of energy’s carbon dioxide analysis center as noted above), especially those dealing with the environment, are being targeted for budget cuts if not complete dismantling. Both of these sites host a lot of critical data sets on ftp servers, but are at risk of Stephen Harper-like behavior (scrubbing of environmental data from Canadian govt sites etc).
Several of us have been discussing best ways to go about an FTP harvest. We don’t want to attack this issue blindly, so could really use input especially from research communities of interest, about what FTP sites are most important to you. Here’s a list of some important things we don’t know:
- Which datasets are in the sole custody of the US government vs. which already have mirrors elsewhere.
- Which of the sole custody datasets are high priority for mirroring outside US government?
- How big are they?
- How can they be accessed?
- Which of the high priority datasets are already targeted by scientists?
So if you could find FTP sites that are critical to you (eg ftp.census.gov) — or sites where robots.txt block crawlers or databases that can’t easily be crawled for that matter! — send them to us (remember, the project email is eot-info AT archive DOT org), and include as much information as you can discern about the site (number/size of files etc) it will help us focus on information of critical importance. Or better yet, for any FTP data sites serving out environmental/climate data, submit them via the form on the #DataRefuge project. #DataRefuge is collecting important data sets in one place, and EOT will no doubt use that list as well to do what we can on our end.
I’m guessing that both EJSCREEN and BEACON *applications* will be difficult/impossible for heritrix to harvest and serve out from within the archive. But if you can find the underlying data, that *could* be harvested either via the end of term crawl (if you nominate the url where the data resides) or through nominating the url (if it’s ftp) to the #DataRefuge project. EOT will also be using the rapidly growing #DataRefuge spreadsheet, so nominating data sets there will help both projects.
6) Get the word out and let us know: Lastly, please send email to your popular listservs, let them know about the EOT project, what you’re doing to help out, and ask that others do the same thing at their universities. And by all means, tweet @EOTarchive to let us know what you’re doing too!
Ok I think that’s it for now. If you and your volunteer groups come up with other strategies, please email us (eot-info AT archive DOT org) and we’ll update this strategies list.