Internet archive

Lunchtime listen: not your grandfather's web anymore

Not Your Grandfather's Web Any More, a project briefing from the Coalition for Networked Information (CNI) spring 2013 member meeting by David S.H. Rosenthal of LOCKSS and Kris Carpenter Negulescu of the Internet Archive, is now available on CNI's video channels:

YouTube: http://youtu.be/uIqU2Cr2Kjs
Vimeo: http://vimeo.com/66175352

What are the practical and theoretical archiving problems posed by the newer parts of the Web, like social media, scientific workflows and Web services? How can the challenges of these latest developments be met, if at all? This presentation reports on the results of a workshop held at the Library of Congress under the auspices of the International Internet Preservation Consortium, where practitioners of Web archiving reviewed these questions. More information about this talk, including presentation slides, is available on the CNI site.

“Let a thousand Jon Stewarts bloom.” Internet Archive launches TV broadcast news search and borrow

Congratulations to the Internet Archive for launching TV News Search & Borrow service. This is truly an amazing endeavor with a growing collection of "every morsel of news produced in the last three years by 20 different channels, encompassing more than 1,000 news series that have generated more than 350,000 separate programs devoted to news." Search results will be in 30 second clips, and if someone wants a copy of the entire program, a DVD will be sent on loan. NY Times has more. I'm sending it in to be cataloged right now!!

Today the Internet Archive launches TV News Search & Borrow. This service is designed to help engaged citizens better understand the issues and candidates in the 2012 U.S. elections by allowing them to search closed captioning transcripts to borrow relevant television news programs.

The Internet Archive works to preserve the published works of humankind. Inspired by Vanderbilt University’s Television News Archive project, the Internet Archive collects and preserves television news. Like library collections of books and newspapers, this accessible archive of TV news enables anyone to reference and compare statements from this influential medium.

The collection now contains 350,000 news programs collected over 3 years from national U.S. networks and stations in San Francisco and Washington D.C. The archive is updated with new broadcasts 24 hours after they are aired. Older materials are also being added.

Gary's Thursday Roundup: NLRB, Internet Archive, Ancestry.com, U.S. Census, and Much More (17 Items)

Hello From DC (I mean Shakeytown, it Was My First Quake) Everyone.

As we prepare for our next event around hear and elsewhere along the east coast I thought it might be a good time to share a mountain of news, new resources, and other goodies with all of you.

The material comes from posts Shirl Kennedy and I made to our INFOdocket.com site. This is just a small amount of what we post seven days a week. Plus, we also provide FullTextReports.com. New reports are listed in the left rail (Thanks Jim and James)

We both hope you find and item or two of interest in the following update. More very soon. (-:

1. Hurricane Irene: FEMA’s National Situation Daily Update Available Online & Natl. Hurricane Center Mobile Resources

2. New Web Site: Feds Launch Performance.gov, Now Publicly Accessible

3. Acquisitions: Bloomberg is Buying BNA for $990 Million

4. US Department of Labor Improves Enforcement Databases Including Visualization/Animation Tools

5.U.S. History: “Rare Footage Unearthed Online”

6. New From the Internet Archive: “Understanding 9/11: A Television News Archive”

7.“Google Forfeits $500 Million Generated by Online Ads & Prescription Drug Sales by Canadian Online Pharmacies”
The full text of the statement from the USDOJ and FDA

8. Washington Post Op/Ed: “Don’t Kill America’s Databook” (U.S. Census Statistical Abstract)

9. NLRB — Acting General Counsel Releases Report on Social Media Cases

10. Back to School 2011-2012: Facts About Schools, Students and Teachers From the U.S. Census

11. 1940 U.S. Census to be Free on Ancestry.com

12. Government Information: GPO Releases API For FederalRegister.gov (Formal Announcement)

13. Teen Dating Violence: A Literature Review and Annotated Bibliography
From the Federal Research Division of the Library of Congress

14. Update: More Digitized Historic U.S. Government Economic and Banking Documents and Reports via FRASER

15. A Look at a Few Resources Using U.S. Department of Agriculture Open Data

16. Cook County, IL: New online database lets anyone see who has outstanding warrants

17. Federal Agencies Take Action to Digitally Document Nearly 50 Endangered Languages

Status of the Wayback Machine

Roy updates us on the status of the Wayback machine with an example from the White House:

  • Back to the Wayback Machine, Roy Tennant, Library Journal (May 18th, 2011).

    But that means that any claims to be "archiving the web" should be taken with a grain of salt. Maybe say "archiving the parts of the web that matter" or "ignoring what doesn’t matter so much".

And, don't forget Archive-It, the web archiving service from Internet Archive.

Through a user-friendly web interface, Archive-It partners can catalog, manage, and browse their archived collections using web archiving tools developed at the Internet Archive. Collections are hosted at the Internet Archive data center and are accessible to the public, including full-text search.

book scan wizard + internet archive = DIY public domain digital book repository

At the last couple of depository library council meetings, I've heard comments from documents librarians -- especially from librarians at smaller institutions -- that they'd love to participate in the digitization process of historic government documents, but for various reasons (lack of $$, staffing, time, technical infrastructure etc) could not undertake large scale digitization projects.

Now there's a way for lots of libraries to chip in on the greater goal of increased access to historic government documents with very little $$ or infrastructure. We've mentioned before about BookLiberator and DIYbookscanner, two projects working on low cost hardware solutions for digitizing books using off the shelf digital cameras and free opensource software called Book Scan Wizard.

But there were still 2 pieces missing to make the whole workflow run smoothly for libraries and government documents collections of all sizes. The third piece to the puzzle just became a reality with yesterday's announcement that Book Scan Wizard had teamed up with the Internet Archive to provide automatic uploads of scans to the Internet Archive (directions and more information here). Hardware: check. Software: check. Digital infrastructure: check.

With the new version of Book Scan Wizard, or even through just uploading directly to the Internet Archive, any PDF composed of images of book pages or organized zip file filled with images of book pages will be automatically processed. The Internet Archive’s servers will then automatically perform optical character recognition (OCR) on the book and make a pdf, epub, kindle (mobi), daisy, djvu, and text file copy of the entire book available for download by anyone, anywhere. You can see a sample book from this process to get a better idea. All this happens within a few hours of the book being uploaded and then anyone can download it. This is free OCR for anyone in the world.

Now there's one last piece needed: Scan on demand. This idea has already been put into practice by the Internet Archive's Open Library and their partnership with the Boston Public Library. What we need is to open up the Catalog of Government Publications (CGP) -- which will soon include over 1 million records from GPO's historic shelflist spanning 1870s - 1992 -- similar to the way the BPL's scan on demand project (now retired it seems) allowed users to request a scan of a public domain book directly from the Open Library catalog.

GPO could manage this scan on demand process -- or allow libraries to pick and choose documents from the CGP -- connect the bibliographic metadata from the historic shelflist, and upload to both the Internet Archive and FDsys. The circle is complete. Am I missing anything? Would love to hear readers' thoughts.

Holiday gift idea: a piece of the public domain

Carl Malamud's FedFlix project is a joint venture with the National Technical Information Service (NTIS) whereby he takes NTIS videos, digitizes them and uploads them to the Internet Archive.

Well now he's expanding FedFlix to include public domain videos from the National Archives. He's released 41 videos into the public domain in this way, but has put together an Amazon Wish List in order to expand public access to public domain video content from the National Archives. If you see anything you'd like to buy the public domain, they'll take your DVD and upload the video to YouTube, the Internet Archive, and to public.resource.org's own rsync/ftp public domain stock footage library. So why not add a gift of the public domain to your favorite person's/people's stockings this year? We'll all be glad we did!

UPDATE 12/25/09: The wish list has been fulfilled. You can watch all of the donated NARA videos on YouTube, Internet Archive, or public.resource.org's bulk server. Thanks Carl!

[HT BoingBoing!]

Tools of Change: BookServer

This is a presentation that is definitely worth seeing!

  • Web of Books, Peter Brantley, Presentation at Tools of Change Frankfurt introducing the BookServer architecture. Brief description of history, motivations, and technical outline.

    "Creating a new architecture using common, open standards that permits people to ?nd, buy, acquire, and read books from any source, on any device, using many different ebook applications."

Internet Archive proposal for mass digitization

I had known that the Internet Archive had submitted a response to the GPO's RFP for mass digitization. A friend just sent me the link to the proposal submitted to GPO (embedded below and here's the link to the proposal and supporting documents).

As you can probably guess, we've been pulling for the Archive to get the bid, not least of which because the Archive is a 501(c)(3) non-profit library and we've stated on more than one occasion that privatization of public domain government information is a very bad idea. But also, we've been heartened by the quality of the Archive's scans to date, their openness and willingness to be collaborative in their processes and data access and sharing. Those qualities certainly come through in their proposal for mass digitization -- not to mention the fact that they've actually made their proposal public!

While the award has not been officially announced, we really hope that the Archive wins the award. Perhaps GPO will name them as an official depository library and work with them not only on the "legacy" collection (there needs to be a better description of the deep and rich collections of depository libraries than the somewhat pejorative "legacy" :-| ) but on digital deposit of government documents going forward.

--that is all.


Brewster Kahle on Google Book Digitization and the Future of Libraries

Of all the things I have read about the Google book digitization project and its consequences, this is one of the best. Listen to the interview (Lunchtime Listen!) or read the transcript.

This is relevant to government documents since so many are in the project. The way they are treated and controlled by Google and Google's contracts and licenses and agreements will have lasting impact on long-term, free, public access.

Kahle highlights two things that, for me, are very important. First, at least some of the participating libraries are relying solely on Google and its restrictions and are not even getting digital copies from Google although they could.

BREWSTER KAHLE: Let's take the out of copyright, the stuff that's really--it's public domain, meaning belongs to the public. It's lived long enough to become part of the public sphere. But there are perpetual restrictions that the libraries must perform, that if they get these digital copies back, they must put up restrictions on use, such that they cannot be accessible by the general public.

AMY GOODMAN: Who can they be accessed by?

BREWSTER KAHLE: People on campus can use them, for the out-of-copyright works, but just on campus. And otherwise, they have to put up restrictions. And what's turning out is a lot of these libraries aren't even bothering to get copies back, because what can they use them for? I mean, in the future, people are going to want to have access to as many books as possible. And what Google is doing is pulling these together for many libraries to build a great collection. Terrific. But the bits and pieces that are going back to these libraries don't make up a great collection. And what they can do with them is very, very limited. So these libraries aren't, in many cases, even bothering to get the digital copies back.

Second, when Kahle asked if Google would share copies of digitized books with the Internet Archive, Google refused.

AMY GOODMAN: Conceivably, Google could give you the digitized copies, is that right?

BREWSTER KAHLE: Yes, Google could, but they have refused.

AMY GOODMAN: Why?

BREWSTER KAHLE: They say that they've paid for the work. They want to be the place that people go to get them. So they are going to be the proprietors of the public domain.

Although Google claims its mission is "to organize the world's information and make it universally accessible and useful," it would be more accurate to say its mission is to make money controlling the world's information.

Economist Interview with Brewster Kahle of Internet Archive

The Economist has an online article "The Internet's Librarian" that is also in the March 5th, 2009 print edition.

...the founder of the Internet Archive explains what has driven him for more than a decade. “We are trying to build Alexandria 2.0,” says Mr Kahle with a wide-eyed, boyish grin. Sure, and plenty of people are trying to abolish hunger, too.

It would be easy to dismiss Mr Kahle as an idealistic fruitcake, but for one thing: he has an impressive record when it comes to setting lofty goals and then lining up the people and technology needed to get the job done. “Brewster is a visionary who looks at things differently,” says Carole Moore, chief librarian at the University of Toronto. “He is able to imagine doing things that everyone else thinks are impossible. But then he does them.”

This is probably my favorite quote:

“Come back when you have a warrant,” reads the floor mat underneath his office recliner. It was a gift from the Electronic Frontier Foundation (an activist group on whose board Mr Kahle sits) after Mr Kahle refused to hand over information about one of the Internet Archive’s users to the Federal Bureau of Investigation in 2007.

I only wish more interviews with Brewster would discuss the plethora of government documents that are in Internet Archive. It's a valuable resource and it keeps growing!

Syndicate content Syndicate content