Home » Posts tagged 'Web harvesting' (Page 2)
Tag Archives: Web harvesting
If you’ve been waiting for your chance to make history: now’s the time!
Please join us for the FGI virtual End of Term Project Web archiving nomination sprint on Wednesday 11 January 2017 from 9AM – 11AM Pacific / 12 noon – 2PM EST. During that time, We’ll set up a virtual conference room, give a brief presentation of the End of Term crawl and the ins and outs of nominating seeds and then volunteers will be on hand to answer your questions, suggest agencies for deep exploration, and take information about databases and other resources that are tricky to capture with traditional web archiving. RSVP TODAY!
If you’re new to the End of Term Project, it’s a collaborative project to collect and preserve public United States government web sites prior to the end of the current presidential administration on January 20, 2017. Working together, the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office (GPO) are conducting a thorough Web harvest of the .gov/.mil domain based on prioritized lists of URLs, including social media. As it did in 2008 and 2012 (previous harvests are accessible here), the project’s goal is to document federal agencies’ presence on the World Wide Web during the transition of Presidential administrations, to enhance the existing archival Internet collections, and to give the public access to archived digital government information. This broad comprehensive crawl of the .gov/.mil domain is based on a prioritized list of URLs, including social media.
This sprint to nominate seeds is a big part of making it happen! Hundreds of volunteers and institutions are already involved in the effort. We hope you’ll join the conversation and the fun. There may even be a few (completely non-monetary) prizes for top contributors.
You can pre-register here. We’ll contact you as the date gets closer with access information for the virtual conference.
The final deadline to nominate URLs prior to Inauguration Day is Friday, January 13th, so even if you can’t sprint with us, keep the nominations coming! Questions? Email us at admin AT freegovinfo DOT com.
[Editor’s note: Updated 12/15/16 to include updated email address for End-of-Term project queries (eot-info AT archive DOT org), and information about robots.txt (#1 below) and databases and their underlying data (#5 below). Also updated 12/22/16 with note about duplication of efforts and how to dive deeply into an agency’s domain at the bottom of #1 section. jrj]
Here at FGI, we’ve been tracking the disappearance of government information for quite some time (and librarians have been doing it for longer than we have; see ALA’s long running series published from 1981 until 1998 called “Less Access to Less Information By and About the U.S. Government.”). We’ve recently written about the targeting of NASA’s climate research site and the Department of Energy’s carbon dioxide analysis center for closure.
But ever since the NY Times last week wrote a story “Harvesting Government History, One Web Page at a Time”, there has been renewed worry and interest from the library- and scientific communities as well as the public in archiving government information. And there’s been increased interest in the End of Term (EOT) crawl project — though there’s increased worry about the loss of government information with the incoming Trump administration, it’s important to note that the End of Term crawl has been going on since 2008, with both Republican and Democratic administrations, and will go on past 2016. EOT is working to capture as much of the .gov/.mil domains as we can, and we’re also casting our ‘net to harvest social media content and government information hosted on non-.gov domains (e.g., the St Louis Federal Reserve Bank at www.stlouisfed.org). We’re running several big crawls right now (you can see all of the seeds we have here as well as all of the seeds that have been nominated so far) and will continue to run crawls up to and after the Inauguration as well. We strongly encourage the public to nominate seeds of government sites so that we can be as thorough in our crawling as possible.
Interesting and informative post by our friend, Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress:
- Introducing the Federal Web Archiving Working Group, by Michael Neubert, The Signal digital preservation blog of the Library of Congress (February 23, 2015).
Michael comments on the huge amount of born-digital federal government information that is being lost every day:
Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration–such materials are vulnerable to being lost.
He goes on to describe how staff of the GPO, NARA, and the Library of Congress are now meeting monthly to discuss their web harvesting projects.
Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why–whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.
They hope to develop “a shared collective development strategy, if only informally.”
The New Yorker has an interesting piece by Jill Lapore, “What the Web Said Yesterday” which explores the issue of internet preservation and highlights the important work being done by the Internet Archive. Vint Cerf, Chief Internet Evangelist at Google, says we need “digital vellum” or the “twenty-first century will become an informational black hole.”
The International Internet Preservation Consortium is working in this space — and in fact the 2015 IIPC general assembly will be held at Stanford University! — but I believe there’s a need for more libraries and especially more subject librarians to be working in this space. That’s why I announced a virtual discussion session on digital collection development for govt information librarians on Wednesday January 21st at 9am PST / 12 noon EST. I’ll be on IRC (irc.freenode.net) #FDLP channel. I hope you’ll pop in to discuss how to do digital collection development, fugitive hunting, web harvesting etc.
The average life of a Web page is about a hundred days. Strelkov’s “We just downed a plane” post lasted barely two hours. It might seem, and it often feels, as though stuff on the Web lasts forever, for better and frequently for worse: the embarrassing photograph, the regretted blog (more usually regrettable not in the way the slaughter of civilians is regrettable but in the way that bad hair is regrettable). No one believes any longer, if anyone ever did, that “if it’s on the Web it must be true,” but a lot of people do believe that if it’s on the Web it will stay on the Web. Chances are, though, that it actually won’t. In 2006, David Cameron gave a speech in which he said that Google was democratizing the world, because “making more information available to more people” was providing “the power for anyone to hold to account those who in the past might have had a monopoly of power.” Seven years later, Britain’s Conservative Party scrubbed from its Web site ten years’ worth of Tory speeches, including that one. Last year, BuzzFeed deleted more than four thousand of its staff writers’ early posts, apparently because, as time passed, they looked stupider and stupider. Social media, public records, junk: in the end, everything goes.
Web pages don’t have to be deliberately deleted to disappear. Sites hosted by corporations tend to die with their hosts. When MySpace, GeoCities, and Friendster were reconfigured or sold, millions of accounts vanished. (Some of those companies may have notified users, but Jason Scott, who started an outfit called Archive Team—its motto is “We are going to rescue your shit”—says that such notification is usually purely notional: “They were sending e-mail to dead e-mail addresses, saying, ‘Hello, Arthur Dent, your house is going to be crushed.’ ”) Facebook has been around for only a decade; it won’t be around forever. Twitter is a rare case: it has arranged to archive all of its tweets at the Library of Congress. In 2010, after the announcement, Andy Borowitz tweeted, “Library of Congress to acquire entire Twitter archive—will rename itself Museum of Crap.” Not long after that, Borowitz abandoned that Twitter account. You might, one day, be able to find his old tweets at the Library of Congress, but not anytime soon: the Twitter Archive is not yet open for research. Meanwhile, on the Web, if you click on a link to Borowitz’s tweet about the Museum of Crap, you get this message: “Sorry, that page doesn’t exist!”