digital preservation

New GPO Preservation Librarian talks to WaPo

A few days ago, the Washington Post interviewed David Walls, GPO's new -- and first! -- preservation librarian. We're really excited to work with GPO and Mr Walls on building our public digital govt information preservation architecture.


David Walls is overseeing the transition at the GPO to digital archiving

By Lisa Rein. Friday, July 30, 2010; B03

The U.S. Government Printing Office provides Americans with permanent access to government information, printing about 2 billion pages every year.

As it celebrates its 150th anniversary this year, it has hired its first preservation librarian to oversee, among other things, the transition to digital archiving. David Walls comes to Washington from Yale University, where he worked as a preservation librarian for 12 years.

Walls, 47, just finished his fourth month on the job.

Q.How did you get interested in library preservation?

I volunteered years ago in the rare-books collection at the Baylor University library in Texas. I got bitten by the bug then. It's a very small field and a young one. You could probably put every preservation person in the U.S. in one large hotel ballroom. Most people who do this work are in academic settings or private libraries, but there are government libraries, too, beyond the Library of Congress. You've got the National Library of Medicine, for example.

Why did the office create a position for a preservationist?

We're in an era of digital publications being produced all over government. We're continuing to supply printed copies of the Federal Register and other publications, but most every federal agency is producing things with only digital content. If you get on almost every federal Web site you'll click on things that, in a previous age, would have been produced in a report or a book.

The GPO is updating a digital system we rolled out last year to disseminate and authenticate all of this government information. If you go to http://www.fdsys.gov, you'll see the Federal Register, the new health-care law, the financial reform law, congressional bills, the president's budget on there. And a lot more, of course. We're designing a new server that's more robust. The federal digital system is part of our mission.

Where were government documents such as legislation and the federal budget preserved in the paper-only days?

This is an organization that for 150 years has been distributing publications to various libraries across the country. It's called the Federal Depository Library Program. There are 1,220 of these libraries in the U.S. and Guam, usually departments or units within other libraries. Georgetown Law Library is one. They may specialize in saving Supreme Court briefs or statutes at large. A library in Missouri might preserve Small Business Administration publications or Fish and Wildlife documents.

Some libraries accept everything the government puts out and keep it forever. Others select which stuff to keep. Right now, we're reaching out to this community to do a basic review of our operational plan, to look at our content and develop a set of preservation services to offer the libraries, digitally as well as on paper.

Some publications the libraries carry are old enough to become brittle. If we're about permanent access to government information, what is our plan for reaching out to these libraries to make sure that happens? My job is to provide some leadership and act as a facilitator.

What is digital security, and why is it important?

We need to make sure the information that the printing office disseminates is secure. So right now, we're doing an internal audit project to make sure our digital repository is trusted. It'll be preserved according to modern standards. Think of it as a bank audit. A bank has to go through an audit to make sure it's a trusted repository of money. The same is true for us.

What are the challenges of carrying out the agency's mission without paper?

The paper publication had a physical form, so there was some intellectual control over what it was and where you could find it. You could sit there for quite a long time without worrying about it becoming obsolete. You weren't going to go into the library one day and find out that a publication was inaccessible because it was in a different file format.

With digital, you have the whole issue of how do you know it's authentic? That all the information is there? The digital publication requires almost constant vigilance.

GPO Establishes First Preservation Librarian Position

FOR IMMEDIATE RELEASE: July 14, 2010 No. 10-23
MEDIA CONTACT: GARY SOMERSET 202.512.1957, 202.355.3997 cell gsomerset@gpo.gov

GPO ESTABLISHES FIRST PRESERVATION LIBRARIAN POSITION

WASHINGTON-The U.S. Government Printing Office (GPO) is continuing its commitment to preserving the documents of our democracy by establishing the agency's first preservation librarian position. GPO's preservation librarian will be tasked with updating the Federal Depository Library Program (FDLP) collection management plan for the preservation of federal government documents. David Walls will serve as GPO's first preservation librarian; he is a member of the American Library Association (ALA) and comes to the agency from Yale University where he worked as a preservation librarian for 12 years. While at Yale, Walls established practices for the digital conversion of library and special collection materials.

Digital preservation is an ongoing initiative for GPO. In 2009, the agency launched GPO's Federal Digital System (FDsys), a content management system, preservation repository and advanced search engine that provides the public with permanent public access to federal government information. GPO is also a member of LOCKSS (Lots of Copies Keep Stuff Safe), a worldwide digital preservation alliance that collaborates with libraries and organizations on preservation initiatives.

Link to FDsys: www.fdsys.gov

"David's experience and expertise in preservation will be an asset to GPO and its mission of Keeping America Informed," said Acting Superintendent of Documents Ric Davis. "This is an important position for the agency as we work with the library community on the continuing transition to a primarily electronic FDLP, and ensure that the content can be migrated in the future to guarantee current and permanent public access to federal government information."

The GPO is the federal government's primary centralized resource for gathering, cataloging, producing, providing, authenticating, and preserving published U.S. government information in all its forms. GPO is responsible for the production and distribution of information products and services for all three branches of the federal government. In addition to publication sales, GPO makes government information available at no cost to the public through GPO's Federal Digital System (www.fdsys.gov) and through partnerships with approximately 1,220 libraries nationwide participating in the Federal Depository Library Program. For more information, please visit www.gpo.gov. Follow GPO on Twitter http://twitter.com/USGPO and on YouTube http://www.youtube.com/user/gpoprinter.

A New Push for the Office of Technology Assessment

Some good news for those who value public policy based on well informed science, "the possibility of reconstituting OTA itself is gaining new momentum."

Steven notes that there is a comprehensive archive of OTA publications from 1972-1995 available on the Federation of American Scientists web site.

There is also, of course, the " OTA Legacy" collection at the University of North Texas Libraries, "CyberCemetary."

How Much Digital Information?

Since 2007, on behalf of EMC Corporation, IDC has been sizing what it calls the Digital Universe, or the amount of digital information created and replicated in a year. The newest report is now available:

These reports estimate the size of everything digital. IDC looks at the installed base of devices or applications that could capture or create digital information and estimates (based on their research and "other sources") how much information was created in a year. They also estimate the number of times a a unit of information is replicated. The include devices such as mobile phones and bar code readers and video games as well as cameras, scanners, email, office applications, databases, GPS, medical imaging, and lots more. A lot of this is estimates and I found it hard to tell how much was gathered evidence and how much was speculation (see their methodology in the first IDC Digital Universe paper, published in 2007). To me this means that the figures they come up with may not be very accurate. Predictions of the future based on these estimates are, I think, very speculative.

Nevertheless, I've been following these ever since I noticed that Fran Berman quoted an earlier report and referred to 2007 as the "cross-over" year: the year in which more digital data was created than there was data storage to host it. (Berman, Francine, Got data?: a guide to data preservation in the information age, Commun. ACM, 51 (2008), 50-56.)

Even if you don't believe that the IDC numbers are 100% accurate, the general ideas that they promote are probably not that far off the mark. Some of those ideas:

  • Last year, despite the global recession, the Digital Universe set a record growing by 62% to nearly 800,000 petabytes.
  • The average file size is getting smaller. The number of things to be managed is growing twice as fast as the total number of gigabytes.
  • The growth of the Digital Universe is like a perpetual tsunami. How will we find the information we need when we need it?
  • How will we know what information we need to keep, and how will we keep it?

That last item is my favorite. Regardless of exactly how much digital information is created each year, regardless of how much storage space we have, regardless of the fact that a lot of the "digital universe" that IDC describes is throw-away information that no one would think is worth keeping, we are still faced with Lots of Stuff and we need to figure out What to Preserve. That, I believe, is the next big challenge for digital preservation.

One way to face that challenge is to rely on producers to decide what to save. If a government agency produces something digital, allow that agency (or GPO, or LoC, or NARA, or OMB or OPM, or your favorite TLA) to decide for you if that information is worth saving.

Another way to face the challenge is to rely on a few big organizations. That is: pool our resources and outsource preservation to a few big organizations that will do this for us. Some of the same players pop up here: LoC and NARA, for example, but there are also organizations like Portico, and the Internet Archive, and ICPSR.

Both of the above solutions hope that someone else will take into account the needs of all possible users and make the right decisions. That model can work for some classes of information with appropriate governance and decision-making structures in place.

But, I believe, the lesson from the IDC report is that the "digital universe" is so large that we should not assume that any single solution will be enough. There is just too much information and there are too many decisions to make about what is worth saving. While information producers and a few big preservation organizations can do a lot, they cannot do everything. And, their size alone will constrain their decisions. It will be harder for big organizations to respond to the needs of smaller communities of interest.

What is the alternative? I think that we need (what shall we call them...?) Libraries. Public Libraries, Special Libraries, College and University Libraries, and School Libraries. These can work together or independently. They can address the needs of their particular communities of interest. This will accomplish three things:

  1. It will aid preservation by making the preservation community bigger. This will not only increase redundancy, but will also help ensure that there is less chance that a single system or financial or governance failure will mean a loss of all information.
  2. It will help deal with the scale of the preservation problem (as identified by the IDC report). With more players and more stake holders, there will be more voices and more variety in the decision making process when we collectively decide what to save. This will mean, for example, that a group of School Libraries working together on digital preservation could ensure that an item of essential use to K-12 will be saved even if no university saves it. And vice versa.
  3. It will help users find and use the information they need. Today, it seems that everyone understands what librarians have always known: that there is a lot of information in the world. It used to take a library degree to get an appreciation of all the sources of information in the world. Today, everyone that uses the Web has that same appreciation. It seems like every day there is another newspaper article or blog posting about how great it is to have access to "everything." But the "everything" people see on the Web is really only a subset of everything and it only appears to be "everything" because there is so much in this very large subset of everything. And, when your only option is to search "everything" you quickly discover that that is not always the best way to find just what you want. (Even Google has segmented information into categories like movies, blogs, books, and scholarly information.) Having community-of-interest collections will enable libraries to build user-interfaces that work best for those communities and that provide access to the information those communities most want.

Libraries won't replace "everything" collections. They will complement each other and unfocused "everything" collections. They will enrich us all and help ensure that we will preserve what needs to be preserved as the "digital universe" expands more rapidly than we could otherwise deal with.

See also:
Forecast of Worldwide Information Growth.
How much Information.
Citizens in the Dark? Government Information in the Digital Age.

Open Library redesign and proposal for collaborative digitizing of documents

The Open Library announced yesterday that their redesigned site is now available with lots of new features and functionalities.

As I suggested in my tweet a few minutes ago, wouldn't it be great if lots of depository libraries bought cheap book scanners like the Decapod (A Mellon funded project), digitized government documents and uploaded them to the Open Library? There are tons of records for government documents just waiting for the attachment of a digital file. And GPO could help by sharing their records from the Catalog of Government Publications (CGP) with the Open Library where librarians and others could enhance to make more robust metadata (which could be fed back in to the CGP!). Lots of libraries with Decapods make light work!

(Full disclosure: I'm on the board of QuestionCopyright, a 501(c)(3) non-profit which has its own book scanning hardware/software project called Book Liberator. BL developers are in close contact with Decapod folks. But I get no economic benefit from either Book Liberator or Decapod.)

Special issue on Technology and digital preservation

Library Hi Tech, Volume 28, issue 2 (2010) is a special issue on "Technology and digital preservation." Preprints of articles are now available, though subscription may be required to access some. A few that you may find interesting:

  • Economics, sustainability, and the cooperative model in digital preservation, by Mr. Tyler O. Walters, Dr. Katherine Skinner. "The authors provide an examination of the emerging field of digital preservation and its economics. They consider in detail the cooperative model and the path it provides toward sustainability as well as how it fosters participation by cultural memory organizations and their administrators, who are concerned about what digital preservation will ultimately cost and who will pay."
  • "Land of the lost": a discussion of what can be preserved through digital preservation, by Mr. David Pearson, Mr. Nicholas del Pozo, Mr. Andrew Stawowczyk Long. "...proposes the concept of preservation intent: a clear articulation of a commitment to preserve an object, the specific elements of that object that should be preserved, and a clear time line for the duration of preservation. It investigates these concepts through simple and practical examples."
  • Keeping It Simple: The Alabama Digital Preservation Network (ADPNet), by Mr. Aaron Trehub, Mr. Thomas C. Wilson. "The purpose of this paper is to present a brief overview of the current state of Distributed Digital Preservation (DDP) networks in North America and to provide a detailed technical, administrative, and financial description of a working, self-supporting DDP network: the Alabama Digital Preservation Network (ADPNet). The authors view ADPNet in a comparative perspective with other Private LOCKSS Networks (PLNs) and argue that the Alabama model represents a promising approach to DDP for other states and consortia."

There is much more. A very rich and informative issue.

MetaArchive publishes guide to distributed digital preservation

Please check out the new book published by the MetaArchive Cooperative called A Guide to Distributed Digital Preservation. It's both timely and handy.

[Full disclosure: the book is primarily about LOCKSS and mentions specifically the project that I'm working on LOCKSS-USDOCS, FGI and I receive no compensation from the sales of the book.]

Announcement: publication of A Guide to Distributed Digital Preservation

Authored by members of the MetaArchive Cooperative, A Guide to Distributed Digital Preservation is the first of a series of volumes from the Educopia Institute describing successful collaborative strategies and articulating specific new models that may help cultural memory organizations work together for their mutual benefit.

This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways.

This guide is written with a broad audience in mind that includes librarians, archivists, scholars, curators, technologists, lawyers, and administrators. Readers may use this guide to gain both a philosophical and practical understanding of the emerging field of distributed digital preservation, including how to establish or join a network.

Readers may access A Guide to Distributed Digital Preservation as a freely downloadable pdf and/or as a print publication for purchase. Please visit http://www.metaarchive.org/GDDP to download or order the book.

******

The MetaArchive Cooperative provides low-cost, high-impact preservation services to help ensure the long-term accessibility of the digital assets of universities, libraries, museums, and other cultural memory organizations. In addition to preserving members' digital content in a distributed digital preservation network, the Cooperative also offers consulting and education services to institutions that seek training in digital preservation planning, policy creation, and implementation, including setting up and running Private LOCKSS Networks (http://www.lockss.org).

For more information, please contact Program Manager Katherine Skinner (katherine.skinner@metaarchive.org).

Lunchtime Listen: How are we ensuring the longevity of digital documents?

Please check out the spring 2009 plenary at Coalition for Networked Information (CNI) by David Rosenthal, chief scientist of the LOCKSS program. He presents a "contrarian view" of digital preservation. The issues he raises are definitely important to think about for those of us working to preserve digital govt information/documents for the long term.

How Are We Ensuring the Longevity of Digital Documents? from CNI Video Editor on Vimeo.

Lost Conversations, Lost Decisions, Lost History

Oliver Bell summarizes the issue of new forms of government communication and the need for new ways of preserving them:

Work needs to begin on archiving standards that will retain the information that is driving decisions today and as technology plays an increasingly larger role in the business of government archiving standards needs to be a core part of systems design, not a problem that we try and solve after the fact.

One important element of this issue is Title 44 of the US Code that defines what GPO and the FDLP can handle. Its definitions limit what we can archive within those boundaries of the FDLP. But... if we had digital deposit, GPO could deposit official Title-44-approved content in FDLP digital libraries and those libraries could combine that content with non-Title-44 (Gov-2.0) content. GPO can't do this because of the limits of Title 44, but individual FDLP libraries have the flexibility to build their own collections combining Title 44 content with other content. We can do this today without changing the law. But we need digital deposit to make those collections rich and useful.

Do zombies care about digital preservation?

I guess this is the day of big thoughts :-) “How secure is our civilization’s accumulated knowledge?” This is the question mulled over by Richard Heinberg in his essay Our Evanescent Culture and the Awesome Duty of Librarians. “Ultimately,” Heinberg writes,

the entire project of digitized cultural preservation depends on one thing: electricity. As soon as the power goes off, access to the Internet goes down. CDs and DVDs become meaningless plastic disks; e-books become inscrutable and useless; digital archives become as illegible as cuneiform tablets — or more so. Altogether, digitization represents a huge bet on society’s ability to keep the lights on forever . . . . It’s ironic to think that the cave paintings of Lascaux may be far more durable than the photos from the Hubble space telescope. Altogether, if the lights were to go out now, in just a century or two the vast majority of our recently recorded knowledge would be gone or inaccessible.

This isn't just an idle thought experiment. With GPO, FDLP libraries, Google, Internet Archive etc actively engaged in and/or planning for massive digitization of the historic record of the US govt.

Matt Cardin of The Teeming Brain expounds on Heinberg's essay and makes some good points in "Zombies, Digital Media, and Cultural Preservation in the New Dark Age".

[Thanks for the tweet about this ArchivesNext!]

What is wrong with this picture?

The Executive Office of the President is looking for a private company to archive some of its presidential records. The records in question are those published by EOP on publicly-accessible web sites including social networking sites like Facebook and Twitter.

Does this strike anyone else as indicative of a major problem with how the government is approaching records retention and preservation? Will they just call it "outsourcing" and hope that magic word makes all the questions go away?

Wouldn't it be easier, more comprehensive, and more authoritative for the government to capture its own records on the way to social networking sites? Its workflow of a) posting to commercial services, b) paying contractors to extract content from those servers and c) reformat it for preservation sounds like very bad planning to me.

Surely we need governments to start taking their own communications seriously and create preservable records rather than try to re-create records in order to preserve them. The idea of creating hard-to-preserve records using commercial services and having to go through extra steps to re-capture and re-format the information is doing two things wrong instead of one thing right.

By the way, the solicitation also asks for capturing "information posted by non-EOP persons on publicly-accessible web sites where the EOP maintains a presence both comments posted on pages created by EOP and messages sent to EOP accounts on those web sites" which brings up a different set of questions. Some of people are upset by the idea of capturing the public's publicly-posted comments. I don't really follow that train of reasoning. The comments are voluntarily posted as public comments, but the government shouldn't save them? Archivists are faced with big challenges! (See also: Sustaining the Future of Archives. Hat tip to George!)

Internet Archive proposal for mass digitization

I had known that the Internet Archive had submitted a response to the GPO's RFP for mass digitization. A friend just sent me the link to the proposal submitted to GPO (embedded below and here's the link to the proposal and supporting documents).

As you can probably guess, we've been pulling for the Archive to get the bid, not least of which because the Archive is a 501(c)(3) non-profit library and we've stated on more than one occasion that privatization of public domain government information is a very bad idea. But also, we've been heartened by the quality of the Archive's scans to date, their openness and willingness to be collaborative in their processes and data access and sharing. Those qualities certainly come through in their proposal for mass digitization -- not to mention the fact that they've actually made their proposal public!

While the award has not been officially announced, we really hope that the Archive wins the award. Perhaps GPO will name them as an official depository library and work with them not only on the "legacy" collection (there needs to be a better description of the deep and rich collections of depository libraries than the somewhat pejorative "legacy" :-| ) but on digital deposit of government documents going forward.

--that is all.


Recent report on Link Rot

A recent report evaluating a two-year web harvesting project found that 14.3 percent of the original URLs of all titles harvested from the Web and archived during the first year of project had become inactive within at least one year of harvesting. (p.33)

The report is an evaluation of the Legal Information Archive of The Chesapeake Project, which was designed to preserve born-digital legal information published directly to the Web. The project was implemented in early 2007 by the Georgetown Law Library and the State Law Libraries of Maryland and Virginia.

The report notes that more than 95 percent of the titles in the sample were PDF files. Of these titles, 8.2 percent were found to have inactive original URLs in 2008 and 14.1 percent in 2009. (p.35)

Ten percent of government (.gov) URLs became inactive in the first year and an additional three percent became inactive in the second year. (p.34)

The report concludes:

More than 4,300 digital items, representing nearly 1,900 titles, have been harvested from the Web and archived, and roughly 14 percent of these titles have already been removed from their original locations on the Web, demonstrating the importance and effectiveness of the project’s efforts. Moreover, the project’s access figures demonstrate both the broad, international reach of the project’s efforts, as well as the successful selection of high-interest and high-use materials by project participants.

Alaska State Library Archiving Governor Palin’s Resignation Announcement and End of Term Website

Alaska Governor Sarah Palin’s resignation announcement earlier this month and the transition of power to Lieutenant Governor Sean Parnell gave the Alaska State Library a great chance to preserve this "at risk" content. 

Using Archive-It and the manual "start on demand" feature inside the web application  the Alaska State Library crawled Governor Palin and Lt. Governor Parnell's web sites on the eve of the transition of power and was 
able to capture valuable information that is now offline and no longer accessible.

The Alaska State Library’s Alaska Governor/Lt. Governor Web Sites collection was originally conceived to archive these government websites over time.  Once Sarah Palin left office, the governor’s website changed to reflect Sean Parnell as governor, and the lieutenant governor’s
website
changed to reflect Craig Campbell as lieutenant governor. Thus all of the information on former Governor Palin’s website as well as speeches and press releases from Sean Parnell’s time as lieutenant governor are no longer available on the live web. 

The foresight of the staff of the Alaska State Library and the availability of the Archive-It web archiving service made it possible to preserve the final changes to these "at risk" websites before they were taken offline.

Ten Great Government Web Sites

GCN's list of "great" .gov web sites this year includes GPO's FDsys.

  • Great .Gov Web Sites SPECIAL REPORT: "10 sites that take online government to the next level" by Joab Jackson, Government Computer News (Jul 27, 2009)

Other sites GCN lists include: data.gov, The California Metropolitan Transportation Commission's Transit.511.org, the U.S. State Department, the State of Utah, and Science.gov.

While the description of FDsys in the GCN article has no new information for those who have been following its development for years, its presence in the list is notable and important for at least two reasons. First, it is the only one of the ten that emphasizes permanence and long term access.

Second, it is revealing to see the technologies that GCN lists for each site. Every site on the list is noted for use of technologies that provide good access and rich content. These include the current batch of usual suspects, from Adobe Flash and Microsoft Silverlight, to RSS and Cascading Style sheets; from Wikipedia and Twitter, to Google keyhole Markup Language and ArcGIS. But only FDsys also includes technologies that are specifically designed for long-term preservation and for authenticating content: The Reference Model for an Open Archival Information System, and "Digital signatures."

Now if we could just combine that with digital deposit into FDLP libraries, we'd be able to multiply the technical guarantees of long-term free public access to government information by the number of participating FDLP libraries.

Syndicate content Syndicate content