Home » Posts tagged 'End of term archive'

Tag Archives: End of term archive

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

These Advocates Want to Make Sure Our Data Doesn’t Disappear

Here’s another story about data rescue and the preservation of government information, this time from PC Magazine UK. Though the last data refuge event was in Denton, TX in May and the 2016 End of Term crawl has finished its collection work and will soon have its 200TB of data publicly accessible, there still remains much interest — and not a little bit of worry — about the collection and preservation of govt information and data. And with stories continuing to come out — eg this one from the Guardian entitled “Another US agency deletes references to climate change on government website” — about the US government agencies scrubbing or significantly altering their Websites, this issue will not be going away any time soon.

“Somewhere around 20 percent of government info is web-accessible,” said Jim (sic.) Jacobs, the Federal Government Information Librarian at Stanford University Library. “That’s a fairly large chunk of stuff that’s not available. Though agencies have their own wikis and content management systems, the only time you find out about some of it is if someone FOIAs it.”

To be sure, a great deal of information was indeed captured and now resides on non-government servers. Between Data Refuge events and projects such as the 2016 End-of-Term Crawl, over 200TB of government websites and data were archived. But rescue organizers began to realize that piecemeal efforts to make complete copies of terabytes of government agency science data could not realistically be sustained over the long term—it would be like bailing out the Titanic with a thimble.

So although Data Rescue Denton ended up being one of the final organized events of its kind, the collective effort has spurred a wider community to work in concert toward making more government data discoverable, understandable, and usable, Jacobs wrote in a blog post.

via Feature: These Advocates Want to Make Sure Our Data Doesn’t Disappear.

Tweets of Congress, tweets of Trump archived and downloadable in bulk

The recently-launched Tweets Of Congress is collecting and publishing daily archives of tweets by congressional representatives, caucuses, and committees. The site only got up and running last week, so there are daily archives starting June 21, 2017. There’s also the Trump Twitter Archive, which has collected more than 30,000 of @realDonaldTrump’s tweets, which can be searched and downloaded in bulk.

But this points to a larger issue of the US government using commercial social media sites and tools to communicate with the public. This time around, the 2016 End of Term crawl included 9,000+ social media accounts (scraped from the .gov social media registry API) and included 44% FaceBook, 37% Twitter, 10% YouTube accounts. We also collected ~130 TB of .gov ftp sites that agencies use to serve out their collected data sets.

Tweets of Congress is my attempt to collate the entirety of Congress’ daily Twitter output using an automated process that checks Twitter on a fixed interval. Archives are available on this site and in JSON form. You can find JSON datasets linked in posts or in this site’s Github repo. Due to size constraints, archives will be limited at some tbd point. This site is open-source, so feel free to fork or whatever to your heart’s content. For any issues or other feedback, file an issue in the repo or send me an email.

via About – Tweets of Congress.

HT Data Is Plural 2017.06.28 edition. Don’t forget to subscribe to Jeremy Singer-Vine’s Data Is Plural weekly newsletter!

End-of-term crawl ongoing. Please help us do QA!

The End of Term 2016 collection is still going strong, and we continue to receive email from interested folks about how they can help. Much of the content for the EOT crawl has already been collected and some of it is publicly accessible already through our partners. Last month we posted about ways to help the collection process. At this point volunteers are encouraged to help check the archive to see if content has been archived (i.e., do quality assurance (QA) for the crawls).

Here’s how you can help us assure that we’ve collected and archived as thoroughly and completely as possible:

Step 1: Check the Wayback Machine

Search the Internet Archive to see if the URL has already been captured. Please note this is not a specific End of Term collection search and does not include ALL content archived by the End of Term partners, but will be helpful in identifying whether something has been preserved already.

You may type in specific URLs or domains or subdomains, or try a simple keyword search (in Beta!).

1a: Help Perform Quality Assurance

If you do find a site or URL you were looking for, please click around to check if it was captured completely. A simple way to do this is to click around the archived page – click on navigation, links on the page, images, etc. We need help identifying parts of the sites that the crawlers might have missed, for instance specific documents or pages you are looking for but perhaps we haven’t archived. Please note that crawlers are not perfect and cannot archive some content. IA has a good FAQ on information about the challenges crawlers face.

If you do discover something is missing, you can still nominate pages or documents for archiving using the link in step 3 below.

Step 2: Check the Nomination Tool

Check the Nomination Tool to see if the URL or site has been nominated already. There are a few ways to do this:

Step 3: Nominate It!

If you don’t see the URL you were looking for in any of those searches, please nominate it here.

There are a few plugins and bookmarklets to help nominate via your browser, eg this one created by Matt Price for an event at University of Toronto, and others available at the bottom of this page.

Questions? Please contact the End of Term project at eot-info AT archive DOT org.

Panel on End-of-term crawl and the collection of vulnerable government information

I was honored last week to be part of a panel hosted by OpenTheGovernment and the Bauman Foundation to talk about the End of Term project. Other presenters included Jess Kutch at Coworker.org and Micah Altman, Director of Research at MIT Libraries. I talked about what EOT is doing, as well as some of the other great projects, including Climate Mirror, Data Refuge and the Azimuth backup project, working in concert/parallel to preserve federal climate and environmental data.

I thought the Q&A segment was especially interesting because it raised and answered some of the common questions and concerns that EOT receives on a regular basis. I also learned about a cool project called Violation Tracker, a search engine on corporate misconduct. And I was also able to talk a bit about what are the needs going forward, including the idea of “Information Management Plans” for agencies similar to the idea of “Data Management Plans” for all federally funded research. I was heartened to know that there is interest in that as a wider policy advocacy effort!

The full recorded meeting can be viewed here from Bauman’s adobe connect account.

Here’s more information on the EOT crawl and how you can help.

Coalitions of government, university, and public interest organizations have been working to ensure as much information as possible is preserved and accessible, amid growing concern that important and sensitive government data on climate, labor, and other issues may disappear from the web once the Trump Administration takes office.

Last Thursday, OTG and the Bauman Foundation hosted a meeting of advocates interested in preserving access to government data, and individuals involved in web harvesting efforts. James Jacobs, a government information librarian at Stanford University Library who is working on the End of Term (EOT) web harvest – a joint project between the Internet Archive, the Library of Congress, the Government Publishing Office, and several universities – spoke about the EOT crawl, and explained the various targets of the harvest, including all .gov and .mil web sites, government social media accounts, and more.

Jess Kutch discussed efforts by Coworker.org with Cornell University to preserve information related to workers’ rights and labor protections, and other meeting attendees presented some of their own projects as well. Philip Mattera explained how Good Jobs First is using its Violation Tracker database to scrape and preserve government source material related to corporate misconduct.  

Micah Altman, Director of Research at MIT Libraries, presented on the need for libraries and archives to build better infrastructure for the EOT harvest and other projects – including data portals, cloud infrastructure, and technologies that enhance discoverability – so that data and other government information can be made more easily accessible to the public.

via Volunteers work to preserve access to vulnerable government information, and you can help | OpenTheGovernment.org.

Attend the FGI virtual EOT seed nomination sprint. Help make and preserve .gov history!

If you’ve been waiting for your chance to make history: now’s the time!

Please join us for the FGI virtual End of Term Project Web archiving nomination sprint on Wednesday 11 January 2017 from 9AM – 11AM Pacific / 12 noon – 2PM EST. During that time, We’ll set up a virtual conference room, give a brief presentation of the End of Term crawl and the ins and outs of nominating seeds and then volunteers will be on hand to answer your questions, suggest agencies for deep exploration, and take information about databases and other resources that are tricky to capture with traditional web archiving. RSVP TODAY!

If you’re new to the End of Term Project, it’s a collaborative project to collect and preserve public United States government web sites prior to the end of the current presidential administration on January 20, 2017. Working together, the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office (GPO) are conducting a thorough Web harvest of the .gov/.mil domain based on prioritized lists of URLs, including social media. As it did in 2008 and 2012 (previous harvests are accessible here), the project’s goal is to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations, to enhance the existing archival Internet collections, and to give the public access to archived digital government information. This broad comprehensive crawl of the .gov/.mil domain is based on a prioritized list of URLs, including social media.

This sprint to nominate seeds is a big part of making it happen! Hundreds of volunteers and institutions are already involved in the effort. We hope you’ll join the conversation and the fun. There may even be a few (completely non-monetary) prizes for top contributors.

You can pre-register here. We’ll contact you as the date gets closer with access information for the virtual conference.

The final deadline to nominate URLs prior to Inauguration Day is Friday, January 13th, so even if you can’t sprint with us, keep the nominations coming! Questions? Email us at admin AT freegovinfo DOT com.

Archives