digital preservation

MetaArchive publishes guide to distributed digital preservation

Please check out the new book published by the MetaArchive Cooperative called A Guide to Distributed Digital Preservation. It's both timely and handy.

[Full disclosure: the book is primarily about LOCKSS and mentions specifically the project that I'm working on LOCKSS-USDOCS, FGI and I receive no compensation from the sales of the book.]

Announcement: publication of A Guide to Distributed Digital Preservation

Authored by members of the MetaArchive Cooperative, A Guide to Distributed Digital Preservation is the first of a series of volumes from the Educopia Institute describing successful collaborative strategies and articulating specific new models that may help cultural memory organizations work together for their mutual benefit.

This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways.

This guide is written with a broad audience in mind that includes librarians, archivists, scholars, curators, technologists, lawyers, and administrators. Readers may use this guide to gain both a philosophical and practical understanding of the emerging field of distributed digital preservation, including how to establish or join a network.

Readers may access A Guide to Distributed Digital Preservation as a freely downloadable pdf and/or as a print publication for purchase. Please visit http://www.metaarchive.org/GDDP to download or order the book.

******

The MetaArchive Cooperative provides low-cost, high-impact preservation services to help ensure the long-term accessibility of the digital assets of universities, libraries, museums, and other cultural memory organizations. In addition to preserving members' digital content in a distributed digital preservation network, the Cooperative also offers consulting and education services to institutions that seek training in digital preservation planning, policy creation, and implementation, including setting up and running Private LOCKSS Networks (http://www.lockss.org).

For more information, please contact Program Manager Katherine Skinner (katherine.skinner@metaarchive.org).

Lunchtime Listen: How are we ensuring the longevity of digital documents?

Please check out the spring 2009 plenary at Coalition for Networked Information (CNI) by David Rosenthal, chief scientist of the LOCKSS program. He presents a "contrarian view" of digital preservation. The issues he raises are definitely important to think about for those of us working to preserve digital govt information/documents for the long term.

How Are We Ensuring the Longevity of Digital Documents? from CNI Video Editor on Vimeo.

Lost Conversations, Lost Decisions, Lost History

Oliver Bell summarizes the issue of new forms of government communication and the need for new ways of preserving them:

Work needs to begin on archiving standards that will retain the information that is driving decisions today and as technology plays an increasingly larger role in the business of government archiving standards needs to be a core part of systems design, not a problem that we try and solve after the fact.

One important element of this issue is Title 44 of the US Code that defines what GPO and the FDLP can handle. Its definitions limit what we can archive within those boundaries of the FDLP. But... if we had digital deposit, GPO could deposit official Title-44-approved content in FDLP digital libraries and those libraries could combine that content with non-Title-44 (Gov-2.0) content. GPO can't do this because of the limits of Title 44, but individual FDLP libraries have the flexibility to build their own collections combining Title 44 content with other content. We can do this today without changing the law. But we need digital deposit to make those collections rich and useful.

Do zombies care about digital preservation?

I guess this is the day of big thoughts :-) “How secure is our civilization’s accumulated knowledge?” This is the question mulled over by Richard Heinberg in his essay Our Evanescent Culture and the Awesome Duty of Librarians. “Ultimately,” Heinberg writes,

the entire project of digitized cultural preservation depends on one thing: electricity. As soon as the power goes off, access to the Internet goes down. CDs and DVDs become meaningless plastic disks; e-books become inscrutable and useless; digital archives become as illegible as cuneiform tablets — or more so. Altogether, digitization represents a huge bet on society’s ability to keep the lights on forever . . . . It’s ironic to think that the cave paintings of Lascaux may be far more durable than the photos from the Hubble space telescope. Altogether, if the lights were to go out now, in just a century or two the vast majority of our recently recorded knowledge would be gone or inaccessible.

This isn't just an idle thought experiment. With GPO, FDLP libraries, Google, Internet Archive etc actively engaged in and/or planning for massive digitization of the historic record of the US govt.

Matt Cardin of The Teeming Brain expounds on Heinberg's essay and makes some good points in "Zombies, Digital Media, and Cultural Preservation in the New Dark Age".

[Thanks for the tweet about this ArchivesNext!]

What is wrong with this picture?

The Executive Office of the President is looking for a private company to archive some of its presidential records. The records in question are those published by EOP on publicly-accessible web sites including social networking sites like Facebook and Twitter.

Does this strike anyone else as indicative of a major problem with how the government is approaching records retention and preservation? Will they just call it "outsourcing" and hope that magic word makes all the questions go away?

Wouldn't it be easier, more comprehensive, and more authoritative for the government to capture its own records on the way to social networking sites? Its workflow of a) posting to commercial services, b) paying contractors to extract content from those servers and c) reformat it for preservation sounds like very bad planning to me.

Surely we need governments to start taking their own communications seriously and create preservable records rather than try to re-create records in order to preserve them. The idea of creating hard-to-preserve records using commercial services and having to go through extra steps to re-capture and re-format the information is doing two things wrong instead of one thing right.

By the way, the solicitation also asks for capturing "information posted by non-EOP persons on publicly-accessible web sites where the EOP maintains a presence both comments posted on pages created by EOP and messages sent to EOP accounts on those web sites" which brings up a different set of questions. Some of people are upset by the idea of capturing the public's publicly-posted comments. I don't really follow that train of reasoning. The comments are voluntarily posted as public comments, but the government shouldn't save them? Archivists are faced with big challenges! (See also: Sustaining the Future of Archives. Hat tip to George!)

Internet Archive proposal for mass digitization

I had known that the Internet Archive had submitted a response to the GPO's RFP for mass digitization. A friend just sent me the link to the proposal submitted to GPO (embedded below and here's the link to the proposal and supporting documents).

As you can probably guess, we've been pulling for the Archive to get the bid, not least of which because the Archive is a 501(c)(3) non-profit library and we've stated on more than one occasion that privatization of public domain government information is a very bad idea. But also, we've been heartened by the quality of the Archive's scans to date, their openness and willingness to be collaborative in their processes and data access and sharing. Those qualities certainly come through in their proposal for mass digitization -- not to mention the fact that they've actually made their proposal public!

While the award has not been officially announced, we really hope that the Archive wins the award. Perhaps GPO will name them as an official depository library and work with them not only on the "legacy" collection (there needs to be a better description of the deep and rich collections of depository libraries than the somewhat pejorative "legacy" :-| ) but on digital deposit of government documents going forward.

--that is all.


Recent report on Link Rot

A recent report evaluating a two-year web harvesting project found that 14.3 percent of the original URLs of all titles harvested from the Web and archived during the first year of project had become inactive within at least one year of harvesting. (p.33)

The report is an evaluation of the Legal Information Archive of The Chesapeake Project, which was designed to preserve born-digital legal information published directly to the Web. The project was implemented in early 2007 by the Georgetown Law Library and the State Law Libraries of Maryland and Virginia.

The report notes that more than 95 percent of the titles in the sample were PDF files. Of these titles, 8.2 percent were found to have inactive original URLs in 2008 and 14.1 percent in 2009. (p.35)

Ten percent of government (.gov) URLs became inactive in the first year and an additional three percent became inactive in the second year. (p.34)

The report concludes:

More than 4,300 digital items, representing nearly 1,900 titles, have been harvested from the Web and archived, and roughly 14 percent of these titles have already been removed from their original locations on the Web, demonstrating the importance and effectiveness of the project’s efforts. Moreover, the project’s access figures demonstrate both the broad, international reach of the project’s efforts, as well as the successful selection of high-interest and high-use materials by project participants.

Alaska State Library Archiving Governor Palin’s Resignation Announcement and End of Term Website

Alaska Governor Sarah Palin’s resignation announcement earlier this month and the transition of power to Lieutenant Governor Sean Parnell gave the Alaska State Library a great chance to preserve this "at risk" content. 

Using Archive-It and the manual "start on demand" feature inside the web application  the Alaska State Library crawled Governor Palin and Lt. Governor Parnell's web sites on the eve of the transition of power and was 
able to capture valuable information that is now offline and no longer accessible.

The Alaska State Library’s Alaska Governor/Lt. Governor Web Sites collection was originally conceived to archive these government websites over time.  Once Sarah Palin left office, the governor’s website changed to reflect Sean Parnell as governor, and the lieutenant governor’s
website
changed to reflect Craig Campbell as lieutenant governor. Thus all of the information on former Governor Palin’s website as well as speeches and press releases from Sean Parnell’s time as lieutenant governor are no longer available on the live web. 

The foresight of the staff of the Alaska State Library and the availability of the Archive-It web archiving service made it possible to preserve the final changes to these "at risk" websites before they were taken offline.

Ten Great Government Web Sites

GCN's list of "great" .gov web sites this year includes GPO's FDsys.

  • Great .Gov Web Sites SPECIAL REPORT: "10 sites that take online government to the next level" by Joab Jackson, Government Computer News (Jul 27, 2009)

Other sites GCN lists include: data.gov, The California Metropolitan Transportation Commission's Transit.511.org, the U.S. State Department, the State of Utah, and Science.gov.

While the description of FDsys in the GCN article has no new information for those who have been following its development for years, its presence in the list is notable and important for at least two reasons. First, it is the only one of the ten that emphasizes permanence and long term access.

Second, it is revealing to see the technologies that GCN lists for each site. Every site on the list is noted for use of technologies that provide good access and rich content. These include the current batch of usual suspects, from Adobe Flash and Microsoft Silverlight, to RSS and Cascading Style sheets; from Wikipedia and Twitter, to Google keyhole Markup Language and ArcGIS. But only FDsys also includes technologies that are specifically designed for long-term preservation and for authenticating content: The Reference Model for an Open Archival Information System, and "Digital signatures."

Now if we could just combine that with digital deposit into FDLP libraries, we'd be able to multiply the technical guarantees of long-term free public access to government information by the number of participating FDLP libraries.

Team Digital Preservation saves the day!

Digital Preservation Europe is in the process of creating a series of short animations introducing and explaining digital preservation problems and solutions for the general public. Their first one is cute. Future ones will be released on their YouTube channel.

Keeping up with data rot

Here's an interesting segment from CBS Sunday Morning from March 1, 2009 in which NY Times Technology writer David Pogue talks about data preservation. There are some great pictures from the Computer History Museum in Silicon Valley -- and don't miss Pogue's interview with the museum's curator Dag Spicer. One of the most memorable quotes is from Don Menerick (sp?), Archivist @ NY Public Library:


"...there's a consensus that as the ability to store more and more data, the data itself has become less and less reliable."



ChangeTracker

ProPublica.org has launched ChangeTracker, a new tool that watches pages on whitehouse.gov, recovery.gov and financialstability.gov, "so you don’t have to"!

When the White House adds or deletes anything— say a blog post, or executive order—ChangeTracker will let you know.

The latest changes are listed on the ChangeTracker website, or you can sign up to get alerts via their RSS Feed, Twitter, or email.

They also have a guide to show you how to make a tracker for your own website.

Powerpoint to the people

Thanks to Carl Malamud for the heads-up about this new article by Fred Kaplan at Slate entitled "PowerPoint to the People: The urgent need to fix federal archiving policies." Basically, according to Kaplan, the National Archives (NARA) are a mess. "Electronic records ... are generally not disposed of in accordance with federal regulations," and NARA is "still unable to accept M$ Word docs and PowerPoint slides." This endemic problem has been known for a long time. Kaplan links to a 2003 NARA study titled "Records Maintenance and Disposition in Headquarters Air Force Offices," (report written in January 2005) that had been FOIA'd -- oddly all of the recommendations were blacked out?! This shows that government agencies are not following protocols for digital records management and the agencies charged with preservation are not doing enough to either build solid preservation systems or work with agencies to help them get into the 21st century. Is this really the "end of history?"

["retweeted" from Carl malamud!]

Are Federal Record-keeping Laws Out of Step With Modern Communications?

All the President's IMs: Are Federal Record-keeping Laws Out of Step With Modern Communications?, by MICHAEL C. DORF, FindLaw, Jan. 12, 2009.

Dorf argues that our federal record-keeping laws are out of step with the ways in which people now communicate.

Forecast of Worldwide Information Growth

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011, An IDC White Paper - sponsored by EMC, John F. Gantz, Project Director, March 2008.

This report estimates the dimensions of the digital information explosion. With figures like 281 billion gigabytes (the size of the "digital universe" in 2007, which is a million times the amount of digital data hosted by the Library of Congress in 2008 -- see Berman, Francine. Got data?: a guide to data preservation in the information age. Commun. ACM 51, no. 12 (2008): 50-56) and estimates like "By 2011, the digital universe will be 10 times the size it was in 2006" the report has sobering implications for digital preservation. In fact, it notes that:

As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.

Syndicate content Syndicate content