Home » Posts tagged 'Internet archive'
Tag Archives: Internet archive
Drop everything and watch this presentation from the 2017 Code4Lib conference that took place in Los Angeles March 6-9, 2017. Heck, watch the entire proceedings because there is a bunch of interesting and thoughtful stuff going on in the world of libraries and technology! But in particular, check out Matt Zumwalt’s presentation “How the distributed web could bring a new Golden Age for Libraries” — after submitting his talk, he changed the new title to “Storing data together: the movement to decentralize data and how libraries can lead it” because of the DataRefuge movement.
Zumwalt (aka @FLyingZumwalt on twitter), works at Protocol Labs, one of the primary developers of IPFS, the Interplanetary File System (IPFS) — grok their tagline “HTTP is obsolete. It’s time for the distributed, permanent web!” He has spent much of his spare time over the last 9 months working with groups like EDGI, DataRefuge, and the Internet Archive to help preserve government datasets.
Here’s what Matt said in a nutshell: The Web is precarious. But using peer-to-peer distributed network architecture, we can “store data together”, we can collaboratively preserve and serve out government data. This resonates with me as an FDLP librarian. What if a network of FDLP libraries actually took this on? This isn’t some far-fetched, scifi idea. The technologies and infrastructures are already there. Over the last 9 months, researchers, faculty and public citizens around the country have already gotten on board with this idea. Libraries just have to get together and agree that it’s a good thing to collect/download, store, describe and serve out government information. Together we can do this!
Matt’s talk starts at 3:07:41 of the YouTube video below. Please watch it, let his ideas sink in, share it, start talking about it with your colleagues and administrators in your library, and get moving. Government information could be the great test case for the distributed web and a new Golden Age for Libraries!
This presentation will show how the worldwide surge of work on distributed technologies like the InterPlanetary File System (IPFS) opens the door to a flourishing of community-oriented librarianship in the digital age. The centralized internet, and the rise of cloud services, has forced libraries to act as information silos that compete with other silos to be the place where content and metadata get stored. We will look at how decentralized technologies allow libraries to break this pattern and resume their missions of providing discovery, access and preservation services on top of content that exists in multiple places.
This is an amazing offer from Brewster Kahle and the internet Archive. Kahle just wrote a letter to the House Subcommittee on Courts, Intellectual Property and the Internet Committee on the Judiciary stating unequivocally that they will “archive and host — for free, forever, and without restriction on access to the public — all records contained in PACER.” The “Public Access to Court Electronic Records” or PACER system is the supposedly publicly accessible system of federal court records that charges exorbitant fees to download, thus making it for all intents and purposes blocking meaningful access to federal court records. But with this letter, the whole system could become actually accessible, for free and in perpetuity!
By this submission, tile Internet Archive would like to clearly state to the Judiciary Committee, as well as to the Administrative Office of the U.S. Courts and the Judicial Conference of the United States, that we would be delighted to archive and host — for free, forever, and without restriction on access to the public — all records contained in PACER…
In order to recognize the vision of universal free access to public court records, the Federal Judiciary would essentially have to do nothing. We are experts at “crawling” online databases in an efficient and careful fashion that does not burden those systems. We are already able to comprehensively crawl PACER from a technical perspective, but the resulting fees would be astronomical. The Federal Judiciary has a Memorandum of Understanding with both the Executive Office for us Trustees and with the Government Printing Office that gives each entity no-fee access for the public benefit. The collection we would provide to the public would be far more comprehensive than the GPO’s current court opinion program- although I must laud that program for providing a digitally-authenticated collection of many opinions.
By making federal judicial dockets available in this manner, the Federal Judiciary would enable free and unlimited public access to all records that exist in PACER, finally living up to the name of the program. In today’s world, public access means access on the Internet. Public access also means that people can work with big data without having to pass a cash register for each document.
The OpenGov Foundation wrote just released their “Statement on Internet Archive Offer to Deliver Free and Perpetual Public Access to PACER” in which they said:
“The vital public information in PACER is the property of the American people. Public information, from laws to court records, should never be locked away behind paywalls, never be stashed behind arbitrary barriers and never be covered in artificial restrictions. Forcing Americans to pay hard-earned money to access public court records is no better than forcing them to pay a poll tax.
“The Internet Archive’s offer to archive and deliver unrestricted public access to PACER for free and forever is the best possible Valentine’s Day gift to the American people. The Internet Archive is proposing a cost-effective and innovative public-private partnership that will finally fix a clear injustice. There is no reason to do anything but accept this offer in a heartbeat.”
This just came through my twitter feed from @MuckRock. Through a FOIA request which shook it loose from the notoriously difficult NSA, we now have access to NSA’s 2007 Untangling the Web: a guide to Internet research. It kind of reads like a Terry Pratchett novel if Terry was having a psychotic/psychedelic episode. As MuckRock notes, “you don’t have to go very far before this takes a hard turn into ‘Dungeons and Dragons campaign/Classics major’s undergraduate thesis’ territory.” Read on, you’ll thank me later!
And if you’re interested, I collected and cataloged a version for our library. The original NSA link to the document no longer resolves (and it was put up just last year!!), but there’s an archived copy in the WayBack Machine.
The NSA has a well-earned reputation for being one of the tougher agencies to get records out of, making those rare FOIA wins all the sweeter. In the case of Untangling the Web, the agency’s 2007 guide to internet research, the fact that the records in question just so happen to be absolutely insane are just icing on the cake – or as the guide would put it, “the nectar on the ambrosia.”
The End of Term 2016 collection is still going strong, and we continue to receive email from interested folks about how they can help. Much of the content for the EOT crawl has already been collected and some of it is publicly accessible already through our partners. Last month we posted about ways to help the collection process. At this point volunteers are encouraged to help check the archive to see if content has been archived (i.e., do quality assurance (QA) for the crawls).
Here’s how you can help us assure that we’ve collected and archived as thoroughly and completely as possible:
Step 1: Check the Wayback Machine
Search the Internet Archive to see if the URL has already been captured. Please note this is not a specific End of Term collection search and does not include ALL content archived by the End of Term partners, but will be helpful in identifying whether something has been preserved already.
You may type in specific URLs or domains or subdomains, or try a simple keyword search (in Beta!).
1a: Help Perform Quality Assurance
If you do find a site or URL you were looking for, please click around to check if it was captured completely. A simple way to do this is to click around the archived page – click on navigation, links on the page, images, etc. We need help identifying parts of the sites that the crawlers might have missed, for instance specific documents or pages you are looking for but perhaps we haven’t archived. Please note that crawlers are not perfect and cannot archive some content. IA has a good FAQ on information about the challenges crawlers face.
If you do discover something is missing, you can still nominate pages or documents for archiving using the link in step 3 below.
Step 2: Check the Nomination Tool
Check the Nomination Tool to see if the URL or site has been nominated already. There are a few ways to do this:
- View all reports here
- Check this list here for a list of everything nominated or search here.
- You can also check our bulk lists here
Step 3: Nominate It!
If you don’t see the URL you were looking for in any of those searches, please nominate it here.
Questions? Please contact the End of Term project at eot-info AT archive DOT org.
I was honored last week to be part of a panel hosted by OpenTheGovernment and the Bauman Foundation to talk about the End of Term project. Other presenters included Jess Kutch at Coworker.org and Micah Altman, Director of Research at MIT Libraries. I talked about what EOT is doing, as well as some of the other great projects, including Climate Mirror, Data Refuge and the Azimuth backup project, working in concert/parallel to preserve federal climate and environmental data.
I thought the Q&A segment was especially interesting because it raised and answered some of the common questions and concerns that EOT receives on a regular basis. I also learned about a cool project called Violation Tracker, a search engine on corporate misconduct. And I was also able to talk a bit about what are the needs going forward, including the idea of “Information Management Plans” for agencies similar to the idea of “Data Management Plans” for all federally funded research. I was heartened to know that there is interest in that as a wider policy advocacy effort!
The full recorded meeting can be viewed here from Bauman’s adobe connect account.
Here’s more information on the EOT crawl and how you can help.
Coalitions of government, university, and public interest organizations have been working to ensure as much information as possible is preserved and accessible, amid growing concern that important and sensitive government data on climate, labor, and other issues may disappear from the web once the Trump Administration takes office.
Last Thursday, OTG and the Bauman Foundation hosted a meeting of advocates interested in preserving access to government data, and individuals involved in web harvesting efforts. James Jacobs, a government information librarian at Stanford University Library who is working on the End of Term (EOT) web harvest – a joint project between the Internet Archive, the Library of Congress, the Government Publishing Office, and several universities – spoke about the EOT crawl, and explained the various targets of the harvest, including all .gov and .mil web sites, government social media accounts, and more.
Jess Kutch discussed efforts by Coworker.org with Cornell University to preserve information related to workers’ rights and labor protections, and other meeting attendees presented some of their own projects as well. Philip Mattera explained how Good Jobs First is using its Violation Tracker database to scrape and preserve government source material related to corporate misconduct.
Micah Altman, Director of Research at MIT Libraries, presented on the need for libraries and archives to build better infrastructure for the EOT harvest and other projects – including data portals, cloud infrastructure, and technologies that enhance discoverability – so that data and other government information can be made more easily accessible to the public.