Home » Posts tagged 'End of term archive'

Tag Archives: End of term archive

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

The EPA’s Website after a year of climate change censorship

Here’s a good article from Time Magazine“Here’s What the EPA’s Website Looks Like After a Year of Climate Change Censorship” — which accurately reports how the Trump Administration and EPA Administrator Scott Pruitt have changed, skewed or deleted government information from the EPA Website for crass political purposes. For more in-depth analysis of the issue of information scrubbing from federal websites, one should look to the work of the Environmental Data and Governance Initiative (EDGI) and especially their reports: “Changing the Digital Climate” and “The EPA Under Siege”.

According to former government officials and EPA staffers, the level of scrutiny is without precedent. In the hands of an administration that has eschewed facts for their alternative cousins, the agency’s site is increasingly unmoored from its scientific core.

“In my experience, new administrations might come in and change the appearance of an agency website or the way they present information, but this is an unprecedented attempt to delete or bury credible scientific information they find politically inconvenient,” Heather Zichal, a senior fellow at the Atlantic Council’s Global Energy Center, and previously President Barack Obama’s top White House adviser on energy and climate change, tells TIME.

The EPA’s site is now riddled with missing links, redirecting pages and buried information. Over the past year, terms like “fossil fuels”, “greenhouse gases” and “global warming” have been excised. Even the term “science” is no longer safe.

Christine Todd Whitman, the EPA Administrator under George W. Bush, says the overhaul is “to such an extreme degree that [it] undermines the credibility of the site”…

Of the more than 25,000 web pages tracked by the Environmental Data and Governance Initiative (EDGI) since Trump’s election, they say the EPA’s have been hit hardest. One section, which provided local communities with resources for combating climate change, disappeared for months only to resurface heavily redacted, including just 175 of its 380 pages.

via The EPA’s Website After a Year of Climate Change Censorship | Time.

2016 End of Term Web Archive is now available

The 2016 end of term .gov/.mil web crawl is now available! We collected approximately 300TB of government websites which includes over “70 million html pages, over 40 million PDFs and, towards the other end of the spectrum and for semantic web aficionados, 8 files of the text/turtle mime type” as well as @100TB of public data via .gov FTP file servers! Thanks to everyone who participated on the project and the thousands(!) of seed nominators, both individuals and those that came in via DataRefuge and EDGI tools and public events.

The End of Term Web Archive contains federal government websites (.gov, .mil, etc) in the Legislative, Executive, or Judicial branches of the government. Websites that were at risk of changing (i.e., whitehouse.gov) or disappearing altogether during government transitions were captured. Local government websites, or any other site not part of the federal government domain were out of scope.

The mystery of the suspended U.S. government Twitter accounts

here’s a strange story unfolding. Our End of Term project friend Justin Littman from George Washington University, was doing some maintenance on the official US government twitter accounts that had been captured for the End of Term crawl, and noticed a number of .gov twitter accounts had been suspended. Account suspension happens when an account is sending spam or has been hacked or compromised in some way. I’ll let Justin explain below, but I’ll be really interested to find out how the folks running the U.S. Digital Registry are going to respond.

…When collecting a large number of Twitter accounts, the list of accounts requires occasional maintenance, as sometimes Twitter accounts are deleted or protected. It’s understandable how U.S. government accounts would be expected to change over time as agencies and initiatives change. However, when I was doing maintenance earlier today, I noticed something odd: a number of the accounts were suspended, not deleted or protected.

Curious, I exported the tweets from some of the suspended accounts. Really odd – the tweets were in Russian.

Then I checked back in the U.S. Digital Registry. The U.S. Digital Registry is supposed to be the authoritative list of the official U.S. government social media accounts…

…Still, there are some immediate take-aways:

  • While the U.S. Digital Registry is a very important service for promoting trust and transparency in the U.S. government and invaluable for those of us attempting to archive the web presence of the U.S. government, it desperately needs a scrubbing and quality control processes put into place.
  • The U.S. government needs to take full advantage of verified status on Twitter (i.e., the blue check), perhaps even requiring it.
  • Twitter needs to deal with the problem of recycled screen names. A person or organization should be able to delete an account without the fear of being impersonated. In particular, for organizations such as government agencies, this is critical.

via Suspended U.S. government Twitter accounts • Social Feed Manager.

These Advocates Want to Make Sure Our Data Doesn’t Disappear

Here’s another story about data rescue and the preservation of government information, this time from PC Magazine UK. Though the last data refuge event was in Denton, TX in May and the 2016 End of Term crawl has finished its collection work and will soon have its 200TB of data publicly accessible, there still remains much interest — and not a little bit of worry — about the collection and preservation of govt information and data. And with stories continuing to come out — eg this one from the Guardian entitled “Another US agency deletes references to climate change on government website” — about the US government agencies scrubbing or significantly altering their Websites, this issue will not be going away any time soon.

“Somewhere around 20 percent of government info is web-accessible,” said Jim (sic.) Jacobs, the Federal Government Information Librarian at Stanford University Library. “That’s a fairly large chunk of stuff that’s not available. Though agencies have their own wikis and content management systems, the only time you find out about some of it is if someone FOIAs it.”

To be sure, a great deal of information was indeed captured and now resides on non-government servers. Between Data Refuge events and projects such as the 2016 End-of-Term Crawl, over 200TB of government websites and data were archived. But rescue organizers began to realize that piecemeal efforts to make complete copies of terabytes of government agency science data could not realistically be sustained over the long term—it would be like bailing out the Titanic with a thimble.

So although Data Rescue Denton ended up being one of the final organized events of its kind, the collective effort has spurred a wider community to work in concert toward making more government data discoverable, understandable, and usable, Jacobs wrote in a blog post.

via Feature: These Advocates Want to Make Sure Our Data Doesn’t Disappear.

Tweets of Congress, tweets of Trump archived and downloadable in bulk

The recently-launched Tweets Of Congress is collecting and publishing daily archives of tweets by congressional representatives, caucuses, and committees. The site only got up and running last week, so there are daily archives starting June 21, 2017. There’s also the Trump Twitter Archive, which has collected more than 30,000 of @realDonaldTrump’s tweets, which can be searched and downloaded in bulk.

But this points to a larger issue of the US government using commercial social media sites and tools to communicate with the public. This time around, the 2016 End of Term crawl included 9,000+ social media accounts (scraped from the .gov social media registry API) and included 44% FaceBook, 37% Twitter, 10% YouTube accounts. We also collected ~130 TB of .gov ftp sites that agencies use to serve out their collected data sets.

Tweets of Congress is my attempt to collate the entirety of Congress’ daily Twitter output using an automated process that checks Twitter on a fixed interval. Archives are available on this site and in JSON form. You can find JSON datasets linked in posts or in this site’s Github repo. Due to size constraints, archives will be limited at some tbd point. This site is open-source, so feel free to fork or whatever to your heart’s content. For any issues or other feedback, file an issue in the repo or send me an email.

via About – Tweets of Congress.

HT Data Is Plural 2017.06.28 edition. Don’t forget to subscribe to Jeremy Singer-Vine’s Data Is Plural weekly newsletter!

Archives