preservation

A guide to preservation

Francine Berman, director of the San Diego Supercomputer Center, has a new article on digital preservation worth reading:

In the article, Berman lists four trends. Number 4 is particularly relevant to us:

Trend 4. Increasing commercialization of digital data storage and services. The 2006 introduction of Amazon Simple Storage Solutions (www.amazon.com/gp/browse.html?node=16427261) was a high-profile example of the trend toward commercialization of data storage and data services. Today, there is considerable activity in the private sector around data storage and services for the consumer; for example, we share and store digital photos through Flickr, employ Apple’s Time Capsule for regular personal computer backup, and use LexisNexis for online legal services.

The commercialization of data storage and services contributes an important component of the data CI [cyberinfrastructure] environment needed to harness the potential of our information-rich world. However, private-sector storage and services are not the solution to all digital data needs. For some digital data considered to be “in the public interest” (such as census data, oficial records, critical scientiic data collections, and a variety of irreplaceable data), a greater level of trust, monitoring, replication, and accountability is required to minimize the likelihood of loss or damage and ensure the data will be there for a very long time. For such community data sets, stewardship by a trusted entity (such as libraries, archives, museums, universities, and institutional repositories), whose mission is the public good rather than profit, is generally required.

Affirmative Disclosure of Government Information

John Wonderlich, a Program Director of the Sunlight Foundation and a great friend of libraries, has posted some useful suggestions over at The Sunlight Foundation Blog:

Obama and Affirmative Disclosure.

I really like John's concept of "affirmative disclosure." I think we could go even further by explicitly addressing the problems of long-term preservation caused by the shift to e-government.

I am starting from the assumption that society needs a reliable way to preserve an accurate, complete historical record. Unfortunately, the systems we have in place today makes it difficult, and in some cases impossible, to guarantee that we will preserve a record that is either complete or accurate.

Consider, for example, the recent case where researchers at the University of Illinois discovered that the White House removed original documents from its web site, altered them, and replaced them with backdated modifications that appear to be originals but are not.

Also consider the project of the Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive and the U.S. Government Printing Office to try to capture web pages of the current administration by performing a "comprehensive crawl of the.gov domain."

These examples illustrate the problem of preserving the historical record.

The first shows how the historical record can easily be lost and altered (intentionally or unintentionally -- it doesn't matter which) by lack of accurate metadata (dates, versioning). The second shows the sad state of current preservation: the best record we will have of the government web will be a single, incomplete snapshot of the end of an eight year administration. (Harvesting is imperfect and incomplete: links can break, embedded content can be lost, databases can prohibit or inhibit crawls of their content, and crawls can only save a snapshot of dynamic sites.)

In essence, the government has made a major change in information policy by changing the technology of information dissemination and has done so without really examining the implications of the change or even acknowledging that a policy has changed.

What was the policy change? In the old policy, the role of government was to collect and assemble and edit and create information and then instantiate it in publications and distribute those instantiations to the public. At that point the role of preservation was in the hands of libraries (mostly FDLP libraries) and archives. But, in the new policy, the government does not actively distribute, but "posts" information on web sites where it is subject to alteration and removal without ever being instantiated anywhere. It is up to the public, consumer groups, individuals, libraries, and special projects to identify when information is posted or changed and then attempt to preserve that information. While that may succeed sometime, the approach has two fatal flaws. First, it is ad-hoc and therefore will almost certainly be incomplete at best. Second, it puts the responsibility of instantiation in the wrong hands: not those who create the information (the government) but those who "discover" the information. The government essentially is renouncing its responsibility to actively, affirmatively create a preseveable instance of the information it creates.

While some agencies (e.g. GPO, EIA) are saying that it is now their role to preserve information, other agencies (e.g., NARA) are actually narrowing their role in long-term preservation (notice that NARA is not participating in the ".gov crawl" and says explicitly that "most web records do not warrant permanent retention").

So, let's explicitly expand the idea of "affirmative disclosure" to include "active deposit." By that I mean that the government should be required to actively inform and distribute to the public notifications (metadata) and documents (data) every time a "document" is created or modified or superseded. "Deposit" could be accomplished with technology (e.g., RSS, APIs, OAI and OAI-PMH, etc.) and should be required to include dates and version information.

This is the right way to do this because it recognizes the appropriate roles for the different participants in the life cycle of information: government agencies create information products that are preservable and libraries and others preserve those products outside the .gov domain.

White House documents found to be altered

Researchers at the University of Illinois say they have found evidence on the Whitehouse Web site that suggests "a pattern of revision and removal from the public record that spans several years, from 2003 through at least 2005. Instead of issuing a series of revised lists with new dates, or maintaining an updated master list while preserving copies of the old ones, the White House removed original documents, altered them, and replaced them with backdated modifications that only appear to be originals."

Once again, our reliance on government websites for current information fails to preserve the historical record and yields an incomplete, unverifiable, and even altered record.

We need government to instantiate information and actively deposit those instantiations outside the dot-gov realm (e.g., with FDLP libraries) to help guarantee a complete and accurate record.

Harvesting .gov

Harvest time, By William Jackson, GCN, 10/27/08.

A nice article about the end-of-administration web harvest.

See also: Library Partnership Saves Government Sites.

Where is that official government information?

According to a press release, the Committee on House Administration has adopted new rules that "permit Members to post content on outside websites so long as the content is for 'official purposes'...."

On the one hand, this is a welcome relief from the rules the House was using, which seemed more appropriate for the nineteenth century than the twenty-first.

On the other hand, it will it make the job of identifying, authenticating, and preserving official government information that much more difficult.

John Wonderlich reports that the new rules say that Members of the House may post "official content" outside of .gov:

In addition to their official (house.gov) Web site, a Member may maintain another Web site(s), channel(s) or otherwise post material on third-party Web sites.

With official government information migrating to YouTube and other dot-coms and without deposit of official government information in depository libraries, even web harvesting projects will have little hope of being comprehensive.

NYT: Federal Files Blip Into Oblivion

This is a pretty good popular-press overview of the problems of digital preservation of government information and some of the steps being taken to address the problems.

Sample of the problems:

The Achilles' heel of record-keeping is people.

In an effort to save money, federal agencies are publishing fewer reports on paper and posting more on the Web.

The Web site of the Environmental Protection Agency lists more than 50 "broken links" that once connected readers to documents on depletion of the ozone layer of the atmosphere.

At least 20 documents have been removed from the Web site of the United States Commission on Civil Rights. They include a draft report highly critical of the civil rights policies of the Bush administration.

93 percent of [top officials surveyed at NASA] were violating federal requirements for preserving e-mail correspondence.

"Most Web records do not warrant permanent retention," because they do not have "long-term historical value," the [National] Archives said.

Alarmed at the possible loss of White House e-mail messages, the House passed a bill in July that would require agencies to preserve more electronic records. ... Republican opponents said the requirements would be onerous and costly. Mr. Bush has threatened to veto the bill, saying it could "interfere with a president's ability to carry out his or her constitutional and statutory responsibilities."

See also: Citizens in the Dark? Government Information in the Digital Age.

Citizens in the Dark? Government Information in the Digital Age. SAA 2008.

These are my speaker-notes for the presentation, "Citizens in the dark?
Government Information In the Digital Age," which I gave on Friday
Aug 29, 2008, at the meeting of the Acquisitions and Appraisal Section
of the Society of American Archivists Convention in San Francisco.
The theme of the convention was "Archival R/Evolution & Identities."

This is not a transcript of what I actually said, but an outline from which
I spoke. There are sentence fragments and inconsistent capitalization and
other less-than-final-draft editing. I hope that this is useful to you in
spite of these distractions.

I do include the points I tried to make and a bit of the verbiage and all
of the links I have.

- Jim Jacobs.

---
The title of this presentation is:

Citizens in the dark? Government Information In the Digital Age

We are seeing a fundamental change in the way governments communicate
with citizens. These changes are NOT caused by technology, although
they are enabled by technologies. They are driven and determined by
economic, political, and social issues.

The solutions are therefore, not technological either, although they
will be enabled by technology. The solutions are economic, social,
and political.

Abby Smith, Director of Programs at the Council on Library and Information
Resources, in a CLIR report on "authenticity" in a digital age, summed
this up quite nicely:

Interestingly, the scholar-participants suggested that technological
solutions to the problem [of establishing the authenticity of a digital
object] will probably emerge that would obviate the need for trusted
third parties. Such solutions may include, for example, embedding
texts, documents, images, and the like with various warrants (e.g.,
time stamps, encryption, digital signatures, and watermarks). The
technologists replied with skepticism, saying that there is no
technological solution that does not itself involve the transfer of
trust to a third party. Encryption -- for example, public key
infrastructure (PKI) -- and digital signatures are simply means of
transferring risk to a trusted third party. Those technological
solutions are as weak or as strong as the trusted third party. To
devise technical solutions to what is, in their view, essentially a
social challenge is to engender an "arms race" among hackers and their
police.45

Abby Smith, "Digital Authenticity in Perspective." in "Authenticity in a
Digital Environment," Council on Library and Information Resources,
Publication 92. (May 2000).
http://www.clir.org/pubs/reports/pub92/smith.html

"Trust" is a social issue, not a technological one.

-----------------------------------------------------------------------
As we look at the technological changes, the way governments are using
and not using, adopting and avoiding, and in general coping with these
technological changes, i think we all see trends.

The agenda of this conference reflects these trends and changes.
with its theme of Revolution and Evolution, with sessions on
everything from
- digital repositories
- born digital materials,
- digitization
- digital manuscripts
- e-mail
- e-records
- e-discovery
- the "e-tiger"

And of course, representatives from NARA, LC, and GPO are here to discuss
their projects

I assume that all of us are familiar at least in a general way with the
many of the difficulties of digital archiving. things like:

- format obsolescence
- media deterioration
- content that is tied to a particular operating system or application
- the need for new kinds of metadata

and

- emulation and migration strategies.

So, I will not cover those today.

What I do want to do is to give you a (perhaps) slightly different
perspective and some (possibly) different ideas and approaches to these
challenges and bring up some issues that i believe do not have enough
attention yet.

THE PAST
-----------------------------------------------------------------------

1. In the past, government information archiving was straightforward

a) We knew and could fairly easily define and identify records

b) We could (again, in a fairly straightforward way) identify
responsibility for record creation, scheduling, retention, deposit,
preservation, access, etc.

c) We could establish procedures to get things done. predictable,
definable, etc.

So... in the past, we had a pretty clear path of preservation:

- of what we wanted to preserve and
- of how to preserve it and
- of who was responsible at each stage from record creation through
retention and disposition and preservation.

We could define and identify what we wanted to preserve and seek and
possibly fund the preservation.

We may not have always been 100% effective, there may have been failures,
gaps, short-funding, recalcitrant agencies, mistakes, etc. but we at least
knew what we were doing and where the gaps were...

THE PRESENT
-----------------------------------------------------------------------
A lot has changed, perhaps everything. Here are four areas
of fundamental change that affect our ability to archive the
complete historical record of governments:

1) WHAT. While to some extent we can still define and identify records,
the job of doing so is much less clear. There may be some things that
we cannot get a hold on to define as records. there may be things that
are part of the record which the govt does not even possess. Or for
which it lacks licensing or copyright permission to possess or copy.

2) WHO. Even to the extent that we can identify (broadly) what we want
to preserve, it may be hard to identify who is responsible and
difficult to create adequate, implementable, schedules for
preservation.

3) HOW. Even if we can do all that, digital preservation itself is
difficult and it is very hard to move from a quick-moving,
service-oreinted, bureaucratic, day-to-day, digital environment, to an
environment of digital preservation.

4) ACCESS. While preservation without access is not preservation at all,
"access" is a very different process than preservation.

It seems to me that the very processes that make it *easier* for a
current end-user to find and use digital information make it *harder*
for the archivist to preserve that same information and ensure its
usability in the future.

EXAMPLES
-----------------------------------------------------------------------

Let's look at some examples

EMAIL (1)
------------------------------------------------------------------------
E-mail certainly provides good examples of the "recalcitrant agency" problem.
But I want to emphasize some other issues that will plague us even if we
solve that one.

An article in Technology Review gave several good examples.

One related a story that Allen Weinstein tells about how he discovered
in his FBI files a newspaper clipping with a note hand-written on it by
J. Edgar Hoover.

If that same communication happened today, it would most likely happen
in an email with, perhaps an attachment of the article, or worse, a
link to the article.

Even if we had in place all the new laws and regulations that are being proposed
to ensure that we can actually save email, would we have complete record? Or
would we have a partial record with a key part missing. And would be able
to find or identify that part? Would we be permitted to archive it?

Talbot, David. "The Fading Memory of the State." Technology Review, July
2005.
http://www.technologyreview.com/printer_friendly_article.aspx?id=14583&c....

EMAIL (2)
------------------------------------------------------------------------
Another problem with email is the difficulty in knowing what to preserve.
the simplest algorithm for preservation of email is to preserve everything, but
that means preserving so many trivial, unimportant messages that would not
normally be scheduled for retention in any rational universe.

RECORD OF INFORMATION USED IN DECISION MAKING
------------------------------------------------------------------------
Another example from that same Technology Review article:

The mistaken bombing in 1999 of the Chinese embassy in Belgrade.
U.S. officials blamed the error on outdated maps used in targeting.

Today's planners would use GIS software to zoom and pan, and run
calculations about the topography to make a targeting decision.

Would the software preserve the decision making process?

There are layers of challenges here:
- the data used (spatial data, databases of locations, topography, etc.)
- the software used to analyze and use the spatial data
- the code behind that software that has its own algorithms for implementing
particular user-analyses
- the actual use by the end-users, the trail of how they used the software
to analyze the data

These are difficult things to archive!

When decisions are based on computer models working on dynamic databases,
will we be able to preserve for future historians the state of the database
and the algorithms built into the models?

PUBLIC DOCUMENTS
------------------------------------------------------------------------
When we think about the preservation of the historical record, we have to
include public documents as well as private communications and decision-making
records.

In the past, "public documents" meant "publications" that were widely distributed
to the public and depository libraries.

Today, it means web sites.

As you probably know the Library of Congress recently announced a big
project with several partners to crawl the .gov domain at the end of the
current presidential administration. This will harvest a lot of digital
content that might otherwise disappear with the change of administrations.

http://www.loc.gov/today/pr/2008/08-139.html

An article about this project discusses the software being used and some
of the issues.

Quint, Barbara. "Consortium--Minus NARA--Archiving Bush Administration
Websites." Information Today NewsBreaks, August 28, 2008.
http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=50486.

See also a discussion about NARA's role at the ArchivesNext blog:

http://www.archivesnext.com/?p=137

And some more on NARA here:

http://freegovinfo.info/taxonomy/term/189

Web-harvesting is not a definitive solution, though,

When we compare web harvesting with active deposit by the government of
documents in depository libraries we can get a glimpse at the scope of
the preservation problem we face.

Web harvesting puts the onus on the harvesters.

It releases the government from the obligation of actively depositing information.

It is a step back in time. It means that archivists and librarians have
less control on their own selection and acquisition and that agencies have
less responsibility.

WHERE IS GOVINFO? (ON THE WEB...?)
------------------------------------------------------------------------
A study by the Center for Democracy and Technology late last year
examined

"Why Important Government Information Cannot Be Found Through
Commercial Search Engines"
http://www.cdt.org/righttoknow/search/

The reasons for the failure of search engines to adequately index
government web sites are the same reasons that make it difficult for
web harvesting to be successful. If we can't find the information,
we cannot harvest it.

We cannot preserve what we cannot save.

BEYOND .GOV AND .MIL
------------------------------------------------------------------------
One problem (and a rapidly growing one) is that not all government information is
on the .GOV and .MIL domains.

Here are some examples:

- TWITTER.

twitter.com is the very popular "micro blogging" site, where people post
very short entries about what they are doing, where they are having lunch,
and so forth. did you know that many government agencies twitter?
among them:
- the white house Communications Office
- the Department of Health & Human Services: Office on Women's Health
- and more than 60 others.
- 20 or 30 members of congress

http://twitter.pbwiki.com/USGovernment
http://www.sourcewatch.org/index.php?title=Members_of_Congress_who_Twitt...

- YOUTUBE

The military is actively using YouTube to post videos

http://articles.latimes.com/2007/may/01/world/fg-cyberwar1
U.S. military offers up its side of the Iraq war on YouTube
la times By Alexandra Zavis May 01, 2007 in print edition A-4

One military youtube channel says that:

"Video clips document action as it appeared to personnel on the ground and
in the air as it was shot."
http://www.youtube.com/profile?user=MNFIRAQ

- NASA posts videos on YouTube and iTunes

- NOT JUST FEDERAL...

I'm mostly giving examples of the federal government today, but the trends
stretch across all levels of government.

for example:

the PRINCE WILLIAM COUNTY SERVICE AUTHORITY IN VIRGINIA is posting
videos on YouTube.
http://www.fcw.com/print/22_25/technology/153418-1.html?type=pf

- the STATE OF CALIFORNIA has a youtube channel
http://www.youtube.com/californiagovernment

and GOVERNOR SCHWARZENEGGER posts to twitter.
http://twitter.com/schwarzenegger

HOUSE MEMBERS
- According to GovTech magazine, more than 100 House members have
multimedia pages and YouTube links on their Web sites
http://www.govtech.com/gt/241670

FLICKR / LC

- FLICKR. I'm sure you read about the success the library of congress
has had by posting photos on flickr.com
http://www.flickr.com/photos/library_of_congress/

QIK.com

qik.com allows you to stream video live from their cell phones.

Congressman John Culberson of Texas is a big fan and has his own qik
channel where he streams and posts interviews, meetings and more.
http://qik.com/johnculberson

Is this official government information? or political? or both?

While these examples may strike you as "not official" or "non-governmental"
the point here is that the environment for distribution of information is
changing rapidly and we must keep up with the changes. If Culbertson's qik
site is not "official," for example, we need a way of appraising it as such
and a way of differentiating it from the next channel that appears that
that *is* official.

HYBRID SITES WITH MIXED MESSAGES
------------------------------------------------------------------------
The military is providing us with lots of examples of archiving problems.
These issues of provenance, use-rights, copyright, and just plain finding
and getting information.

This is an extension of what I call the "copyright poison-pill" in which
copyrighted material appears in an otherwise non-copyrighted government
publication and creates confusion over the rights of libraries and archives
to save, reproduce, and display any or all of such materials. We see this
today in the way the Google book project has blocked access to most
government publications because they "might" be covered by copyright.

DEPARTMENT OF DEFENSE
------------------------------------------------------------------------
The the DOD "official website of Multi-National Force - Iraq" is a .com site,
not a .mil.

http://www.mnf-iraq.com/

There we find
- links to other commercial sites with streaming video without download
links

- web links designed to be clever (with javascript and hidden urls)
but which add an additional level of difficulty in identifying and
bookmarking links and downloading pages.

Another DOD site that provides video clips is .mil but a lot of the content
is actually hosted by .coms

DODvClips.mil

- While this is a .mil domain, it is actually operated by the Intel
Corporation and is hosted and maintained by a commercial organization
known as The FeedRoom or Globix Corporation

- While you can download video, you are bound by an END-USER LICENSE
AGREEMENT, in which Intel claims all proprietary rights to the content
and videos on the site.

- Those who try to harvest the content from this site will find
that the site instructs robots that should must not save copies of
videos or even web pages.

HORMUZ
------------------------------------------------------------------------
Here is another example of DOD video problems.

In January of 2008, the Pentagon broadcast a video of the "straights of
Hormuz incident" in which an unidentified voice says, apparently to a US
battleship "You will ... explode."

It was much in the news. (I found more than 1000 items on LexisNexis over
about a 4 week period.)

In January, two defense department web sites linked to a video of the
incident and one labeled it as "From Defense Department Video." One
of those pages still exists.
http://www.defenselink.mil/transcripts/transcript.aspx?transcriptid=4116

But by June 2008, that url linked to a chef doing a promo for his show
called "Grill Seargent" and searches for "hormuz" turned up zero hits.

Last week, when I checked again, the link I got was to a 15 second ad for
"the pentagon channel" that said:
"Embrace accountability for all that you do -- for everything in your
area of responsibility."

A shorter version of the Hormuz video, complete with "you will explode" quote
is available here:
http://www.defenselink.mil/dodcmsshare/briefingslide%5C320%5C080107-D-65...

Background information including why it is hard to know what goes missing here:
Documenting the Government -- Strait of Hormuz edition
http://freegovinfo.info/node/1567

But, this is more than the tale of a broken link.

- The link was not to a .mil site, but to a commercial site (FEEDROOM again).

- The video was provided only as a streaming video and no download was
available.

So, here we have a critical piece of the historical record, with no
indication of who filmed it or edited it or posted it or took it down.

And we have no easy way to preserve this video and no guarantee that any
one will or can taken the responsibility for doing so.

WHAT MAKES A WEBSITE OFFICIAL?
------------------------------------------------------------------------
How do we determine what makes a website official?

One document I found is explicit, but vague. It says that

an "official website" includes any website hosted on the .mil domain,
but also "any website PUBLISHED or SPONSORED by a military comand but
hosted on a commercial server."
http://www.mnf-iraq.com/images/stories/For_The_Troops/bloggers_policy.pd...

Unfortunately, this creates a cascade of problems.

- Will archivists overlook these sites because they are not .mil or
.gov sites?

- Upon finding them, can we identify who is actually responsible for
the content? (were they Published or Sponsored by the government?)

- If we find the site and identify it as something that is
government-generated, are we allowed to archive it?

STANDARDS BY ANY OTHER NAME
------------------------------------------------------------------------
One of the biggest problems digital archivists face is that of file
formats. When formats are tied to particular software or operating
systems or operating environments, it creates barriers to preservation.

"Standards" that work well for the end-user (and the service provider)
one year may be exactly the wrong standard for the archivist.

We can see an example of user-friendly, archive-unfriendly at the EPA.

EPA
------------------------------------------------------------------------
The EPA has a nice site that has videos, audios, podcasts, and more.

But they have chosen the "Flash based" video format as a "standard"
this is indeed a common format for streaming video, but adds additional
layers of difficulty to anyone wanting to preserve the videos by
downloading them.

http://www.epa.gov/multimedia/

Feds set sights on small screen
By Wade-Hahn Chan FCW August 11, 2008
http://www.fcw.com/print/22_25/technology/153418-1.html?type=pf

DISAPPEARING PUBLICATIONS
------------------------------------------------------------------------
There have not been any substantial, comprehensive studies of what gets
withdrawn from the web by government agencies.

(See "Chronology of Disappearing Government Information" Data collected
through May 8, 2002, Compiled by Barbara Miller for ALA/GODORT
Education Committee With special assistance of Karrie Peterson, for an
example of one attempt.
http://www.library.okstate.edu/Govdocs/chronchart.doc )

We are left with anecdotes about things disappearing or being
withdrawn and random discoveries of something here today and gone
tomorrow.

Anyone who works with government agencies for very long will encounter,
as I have over the years, as many "policies" as their are individuals
who administer those policies.

So we sometimes see agencies that are very careful about keeping older
documents online and others that express that opinion that "No one wants
last year's (or last month's) report.

Here is a recent example:
------------------------------------------------------------------------
AT: http://www.mnf-iraq.com/

We find links to a issue 6 of a "newspaper" but no links or indication of
earlier issues being available.
http://www.mnf-iraq.com/images/Unit_Newsletters/080826_aam_al-binaa_engl...

E-GOV
------------------------------------------------------------------------
I have left for last the concept of "E-government" -- not because it is
less important, but because it is emerging and something to watch.

E-government is intended to transform the way government communicates with
citizens and business and itself.

To the extent that it creates communications that are faster, more
accurate, and more convenient, it is a Good Thing.

But, for us, it, again, fundamentally transforms the role of government.

In the past, the role of government in information dissemination ended at
the point of dissemination. Governments would collect and create and
assemble and edit and publish information products and distribute them to
the public and to libraries.

But today, the government is taking on a new, continuing role.

With e-government, governments are saying we must go them to get our
information today, and tomorrow, and forever.

As governments move to e-government, we are going to increasingly see
government information provided as "transactions" as opposed to
"instantiations."

Here is a simple example:

I can call 411 and get a phone number: that's a transaction and is a
big improvement over having to locate and use a bulky telephone book
which may not even be current.

Lots of kinds of government information lend themselves to this kind of
transaction delivery and make for better, more accurate, more timely
service.

But, if I am a journalist and I want to look at a directory of all
employees in a department, or if I'm an historian and want to see who
was in a particular office last year (or 10 or 50 years ago), or if I'm
a demographer and I want to do an surname or given-name analysis of an
agency's employees, then a current, up-to-date
one-transaction-at-a-time system won't help me at all. I need an
instantiation of the information from one or more time periods.

THE CENSUS
------------------------------------------------------------------------
Let me give you a concrete example: The Census

Every 10 years the federal government takes a population and housing census.

Through the government's American FactFinder web site, the Census bureau
delivers a transaction-based service where you can find census facts and
tables.
http://factfinder.census.gov/

But, in addition, the Bureau makes the raw, anonymized census data
available for downloading and has deposited the data in the largest social
science data archive in the U.S (ICPSR at the university of michigan).
http://www.icpsr.umich.edu/cocoon/ICPSR/SERIES/00166.xml

What this means for us is that we can preserve the census. There is an
instantiation of the census in a format that we can preserve over time.
That instantiation is what is behind American FactFinder, but it is a
preservable form of that information.

This means that, we can preserve the data
- without crawling a web site
- even if the census bureau budget is cut and it takes data offline

it also means that the raw data are available for uses and re-uses
beyond the transactions that the bureau makes available.

This is a model for making government information available and preservable
and usable and re-usable for the long-term.

It is important to note that this model benefits users today, not just in
the future. Transaction-interfaces offer a limited number of possible uses
of the underlying information. When the raw data are available, users
can analyze use, and re-use the data in many ways not provided by the
transaction-interface.

Clifford Lynch has written eloquently about the need that scholars
have to get access to the raw information in the realm of scholarly
literature (Clifford A. Lynch, "Open Computation: Beyond
Human-Reader-Centric Views of Scholarly Literatures," Open Access: Key
Strategic, Technical and Economic Aspects, Neil Jacobs Ed., Oxford: Chandos
Publishing, 2006, pp. 185-193.).
http://www.cni.org/staff/cliffpubs/OpenComputation.htm

Governments may not like the idea of doing this, though. They may want to
keep control and may want to do so under the vise of "accuracy." (E.g.,
"Last year's phone book isn't accurate anymore. We don't want copies of it
out in the world confusing people.") Indeed, we hear that very argument
from some who still argue that the people should not have free open access
to Congressional Research Service reports. Local governments in particular
may also see information as an "asset" and wish to charge for access or
use of it.

And the private sector may not like the idea of raw information being
freely distributed because they want to control access so they can charge
for it. (Indeed we see something like that with CRS reports!)

It may be a challenge to get governments to understand this concept and,
once they do, to embrace distribution.

WHAT CAN WE DO?
------------------------------------------------------------------------

There is no single solution. And we should not expect any single entity or
agency or archive to "solve" the problems.

We need a multifaceted approach to preserving the historical record.

Here are some general approaches that I hope will guide you in your
local environments.

1) Do you have influence over the creation of information? Then make
sure that the information is created with preservation in mind. Talk
to creators about providing an instantiation of information in addition
to transaction-based access. Advocate free and open access. Insist on
open formats (e.g., ODF http://opendocument.xml.org/) rather than
proprietary formats.

2) Identify your partners in your organization.
- IT depts. They may have tools that will help you do your job. They may
be able to do things differently that would enable preservation, but they
haven't thought of them.

- Managers who want information access in the near term. Managers may not
think of long-term access and usability, but they usually do understand the
benefits of having their own information usable in the near-term (1 to 5 years).
If you can *guarantee* something will be usable in 5 years, you can probably
guarantee that you are going to be able to preserve it for longer periods.

3) Identify other partners
- The Internet Archive is doing a lot right now to preserve information
on the web and you can work with them to have them do preservation for you.
http://www.archive.org/index.php
http://www.archive.org/create/
http://www.archive-it.org/

- Look for others locally and regionally with whom you can collaborate.
Universities may want to collaborate with governments and vice-versa, for
example.

4) Are you a partner?
Even if you are in an archive that has clearly no responsibility for
preservation of (say) the records of an agency, you may be in a
position, because of your own archival mandates (you have personal
records of a government official, soldier, elected offical) or because
of your constituency (users at a university who need the complete record for
historical analysis), you may have the opportunity (and obligation) to
collect information that is relevant to and even part of the complete
historical record.

The library model of having many copies dispersed over many institutions
has worked well for preserving and authenticating published materials, and
it may work in the archival environment as well when we are no longer tied
to a single copy of record. Software already exists to help with this:
Lots of Copies Keep Stuff Safe
http://www.lockss.org/lockss/Home
http://www.clockss.org/clockss/Home
http://lockss-docs.stanford.edu/

i want to close with a quote from that same Technology Review article
that I quoted earlier.

In it, computer scientist Robert F. Sproull of Sun Microsystems
Laboratories, who chaired a a National Academy of Sciences panel that
advised NARA, said:

"If you become obsessed with getting the technical solution, you will
never build an archive."

The challenges we face are as much political, sociological, and
economic, as technological.

In Case You Didn't Already Know...

...the U.S. is not the leader in e-Government...at least according to a study released last week by the Brookings Institution. However, we do rank third, but we are "falling behind other countries in broadband access, public-sector innovation and implementation of the latest interactive tools to federal Web sites".

Two other articles I read this morning also got me thinking about where we stand as a nation with digital government information: "Old-school Recordkeeping Meets the Digital Age" and "Government Data and the Invisible Hand". The first article made me feel quite frustrated with our lack of digital preservation progress, especially after reading this quote:

"...lacking a statutory prescription for maintaining electronic records, most agencies print and file [records] as they would paper documents, according to a recent investigation by the Government Accountability Office...Under current regulations, NARA does not require agencies to maintain records in their native formats. So for now, many agencies still print e-mail messages and file the paper versions.Although the filing process is relatively easy, the practice has a major weakness: It eliminates the searchability of digital documents". (Gee, ya think?!)

Envisioning all those emails being printed by government agency employees makes me think of Google's April Fool's joke: the "Google Paper" service!

I hope the next President and his administration will take the issue of e-government and digital preservation/authentication very seriously. Obama and McCain have touched on the issue a bit, including Obama's vague vision of online government transparency:

"I want people to be able to know, today, this issue is going on...Today, President Obama talked about his proposal for $4,000 student college-tuition credits. It’s going to be going to this congressional committee, these are the key leaders in the House and Senate who are going to be deciding on the bill, here are the groups that support it, you should contact your congressman. The more that we can enlist the American people to stay involved, that’s the only way we can move an agenda forward."

The second article touches on this issue as well, and urges the next Presidential administration to "embrace the potential of Internet-enabled government transparency [by reducing] the federal role in presenting important government information to citizens". A profound statement, but read the rest of their argument as stated in the abstract:

"Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use. We argue that this understanding is a mistake. It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

Rather than struggling, as it currently does, to design sites that meet each end-user need, we argue that the executive branch should focus on creating a simple, reliable and publicly accessible infrastructure that exposes the underlying data. Private actors, either nonprofit or commercial, are better suited to deliver government information to citizens and can constantly create and reshape the tools individuals use to find and leverage public data. The best way to ensure that the government allows private parties to compete on equal terms in the provision of government data is to require that federal websites themselves use the same open systems for accessing the underlying data as they make available to the public at large".

This makes sense if you think of it from the context of all the mashups, RSS feeds, and other interactivity with web content that exists. The rest of the article makes some other interesting points and counterarguments, such as

"A government data provider can provide a digital signature alongside each data item. A third party site that presents the data can offer a copy of the signature along with the data, allowing the user to verify the authenticity of the data item, by verifying the digital signature, without needing to visit the government site directly".

Easier said than done? Is the "digital signature" they talk about the same as GPO Digital Authentication?

We are making some progress in e-Government and digital preservation of government information but we need to do better. Like Obama said, we can start by contacting our congressmen to voice our concerns and suggestions for improvement on e-Gov initiatives and digital preservation...because I don't know about you, but I sure don't want the government to use "Google Paper".

Library Partnership Saves Government Sites

Library Partnership Preserves End-of-Term Government Web Sites, The Library of Congress, News Releases, August 14, 2008.

The Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive and the U.S. Government Printing Office today announced a collaborative project to preserve public United States Government web sites at the end of the current presidential administration ending January 19, 2009. This harvest is intended to document federal agencies' online archive during the transition of government and to enhance the existing collections of the five partner institutions.

...The project will also call upon government information specialists -- including librarians, political and social science researchers, and academics -- to assist in the selection and prioritization of web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting. The Government Printing Office will lend expertise to the curation process along with libraries in its Federal Depository Library Program. A tool has been designed by the project team and developed by the University of North Texas to facilitate the collaborative work of these specialists, and will be made available to participants in August 2008.

See also: Project will preserve Bush administration Web sites, By Jill R. Aitoro, NextGov (08/15/08).

The Federal Government Must Reimagine Its Role As An Information Provider

Here is a pre-print (not-final version) of a paper with fascinating ideas about distribution of government information:

They say that "the federal government must reimagine its role as an information provider" and more specifically, that the next administration should...

...reduce the federal role in presenting important government information to citizens. Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use. We argue that this understanding is a mistake. It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

While the paper does not address preservation and long term access explicitly, it does suggest that the government should provide a "permanent location" with a permanent URL for "each piece of government data." It also implies (I think) that something like LOCKSS will ensure authenticity and permanent access ("As long as there is vigorous competition between third party sites, we expect most citizens will be able to ?nd a site provider they trust.") I believe that oversimplifies the problem and relies too much on hope and not enough on a social commitment to preservation through public funding of memory organizations.

Thanks and a tip of the hat to Joshua Taubere (GovTrack.us) for pointing to this article. He describes and comments on the paper in a post on the Open House Project blog: (Government Data and the Invisible Hand June 6th, 2008 by Joshua Tauberer).

The White House: Off Limits to Historians?

Meredith Fuchs, the general counsel of the National Security Archive at George Washington University, writes that the Bush administration's hostility towards public access to and preservation of records combined with changes in technology that have transformed the way in which we all communicate are leading to a situation in which "primary sources on the most important decisions and activities in the government may be lost, destroyed, or closed to the public." [emphasis added]

  • The White House: Off Limits to Historians? by Meredith Fuchs, Passport: The Newsletter of the Society for Historians of American Foreign Relations (5-1-08), posted at History News Network on Thursday, May 8, 2008.

[O]ver the last seven years there have been a series of moves by the current administration that may ensure that the records of the White House and the federal offices and agencies that work closely with the White House will not be available to historians.

Agencies not complying with record preservation policies

Agencies not complying with record preservation policies, By Jill R. Aitoro, NextGov, April 24, 2008.

At the hearing, Linda Koontz, director of information management issues at the Government Accountability Office, released preliminary results from an ongoing GAO study of how four agencies managed e-mail and electronic records. ...Koontz said the agencies print and then file e-mails, but about half of senior officials were not following these procedures, and the e-mails for these officials were maintained in e-mail systems that lacked record-keeping capabilities, such as the ability to group the e-mails using a classification system.

The House is considering the Electronic Communications Preservation Act, which would strengthen policies for preservation of government records including White House e-mails.

Gary Stern, general counsel for NARA said that the legislation's potential cost to agencies could be "astronomical," and noted the bill's requirement that the National Archives would maintain authority over the White House's electronic records might be unconstitutional.

Patrice McDermott, director of OpentheGovernment.org, said:

"I understand the constitutional issues, and I don't have a good answer for that.... But one of the concerns is that there is no way to enforce accountability [of] records management in the White House. We understand it's a difficult dance [for NARA]. They're there at the invitation of the White House in many cases, but there needs to be some way for the outside community to hold the White House accountable."

NARA will NOT harvest at end of current administration

According to a post on .govwatch (The National Archives Is Quietly Destroying Millions of Documents April 08, 2008 by Coby Logen), a recent memo at the National Archives and Records Administration says:

After considering our other records management program priorities for FY 2008, availability of harvested web content at other "archiving" sites (e.g., www.archive.org), and the resources required for conducting and preserving a government-wide web snapshot, NARA has determined that we will not conduct a web harvest or snapshot at the end of the current Administration.

Logen says that "Not capturing federal web sites now may mean losing millions of web pages authored under the Bush administration when leadership changes in January 2009."

John Wonderlich at the Sunlight Foundations comments that "The fact that digital preservation is done by others outside NARA isn't an excuse for NARA to abdicate their responsibility, but an argument that they should be capable of fulfilling it." (Digital Preservation Under Threat? by John Wonderlich on April 9, 2008)

This seems yet another example of the government saying it cannot and therefore it won't. (The NARA/TGN contract as a bad precedent). Call it the Katrina of digital preservation?

The New York Times sums up the underlying issue nicely yesterday: "In Storing 1's and 0's, the Question Is $" (By John Schwartz, New York Times, April 9, 2008). It is not a technological issue; it is an issue of funding and policy and control. (See: The Technical is Political.)

Can we identify or verify or prevent government website scrubbing?

One issue we at FGI are concerned about is that, when government information is not officially distributed to depository libraries and when official digital government information is available only from government-controlled web servers, then that information can (intentionally or unintentionally) be deleted or altered leaving historians, journalists, economists and other citizens with no clear, complete record of government activities.

From time to time there are stories about government websites being "scrubbed," i.e., of information being removed from them, but it is often difficult to determine if these stories are accurate. Since stories like this are often (perhaps, usually) published to make a political point, discussion of them often revolves around the political issue rather than the issue of the integrity and permanence of government information in the larger sense.

One such story this week gives us an opportunity to at least quickly and superficially examine the existence of a problem, if not its extent:

Perr says that a flash animation and a paragraph on tax cuts, which were on the White House Jobs and Economic Growth web page (also referred to as the "Economy & Budget Policies in Focus" web page) on March 16, 2008, were removed and no longer available on March 20. The animation said:

  • "18,000 jobs created in December 2007,"
  • "Over 8.3 million new jobs created since August 2003"
  • "Unemployment rate remains low at 5%."
  • "President Bush's actions are moving our economy forward"

And the deleted paragraph read:

President Bush Continues To Call On Congress To Further Reduce Economic Uncertainty By Making His Tax Relief Permanent.

President Bush believes the most important action to ensure the long-term health of our economy is to make sure the tax relief that is now in place is made permanent. The 2001 and 2003 tax cuts are set to expire in less than three years. If Congress allows that to happen, 116 million taxpayers will see their taxes go up by $1,800 on average, and we will see an end to many of the measures that have helped our economy grow – including the 10 percent individual income tax bracket, reductions in the marriage penalty, the expansion of the child tax credit, and reduced rates on regular income, capital gains, and dividends.

Perr discoverd that MSN has a cached copy of that page (dated 3/8/2008) that includes the animation and text. This morning, I used WebCite to make a copy of the MSN copy. (The WebCite copy does not do a good job of retaining the layout of the original, but the Flash animation is there and viewable as is the text paragraph and should remain there even after MSN removes its cached copy.)

I checked the Internet Archive, but the most recent snapshot of www.whitehouse.gov/infocus/economy/ as of this morning is June 7, 2007. I did some Google searching and was not able to locate the Flash animation, but I was able to locate a series, of nearly identical ones:

If Google is an accurate way to judge the content of whitehouse.gov, it would appear that, the White House has maintained earlier versions of this animation but has not preserved this more recent one. But, we do not know how accurate or comprehensive or current Google is.

I also browsed the White House News releases for March 2008 page, because it appeared that similar information had migrated to various "Fact Sheets." Indeed, the text paragraph is in the March 7, 2008 Fact Sheet: Taking Responsible Action to Keep Our Economy Growing. I was not able to find a link to the animations, however.

This brings me to the question: "Can we identify or verify or prevent government website scrubbing?" My own tentative conclusions are:

  • We cannot prevent the government from changing its own websites, so we cannot prevent "scrubbing."
  • We can verify that a site has changed, but currently our tools are limited to a) commercial web crawlers (like google, MSN and Internet Archive, and b) individuals who regularly monitor websites, and c) web crawlers created by libraries using their own tools or those provided by others (such as Archive-it).
  • While tools exist to monitor changes in a web site (e.g., Change Detection), I don't believe that we can use these to look for significant (e.g., loss-of-information) alterations.

What conclusions can we draw from all this? Since we do not know how commercial indexers such as Google and MSN work and what their criteria are and since they do not have preservation as a mission, we can hardly rely on them. While this particular example may be trivial in itself, it demonstrates that government information in the digital age, the "e-government" age, is volatile and fragile and that we do not have a system in place that is as reliable for digital content as the FDLP libraries were for non-digital content. While it is hard to imagine a system that would be robust enough to catch every single digital bit of government information from every agency for all time, it is possible to imagine a system that would capture much more than we do now.

That leads me to a conclusion that we at FGI have long advocated: Libraries should be building collections of digital government information and GPO should facilitate this by depositing government information in FDLP libraries. If libraries created collections that could be text-mined by scholars and researchers, it would be possible to better audit, analyze, and preserve government information and make it more difficult for information to be scrubbed without being discovered and exposed. Indeed, it would remove, to some extent, the motivation to "scrub" if it was well know that the information was preserved and easily discoverable.

The question we should be asking ourselves is: How much are we losing every day? The task is too big for any one library or any one government agency (i.e., GPO). And it is not a task that commercial entities like Google and MSN are likely to take on.

A Wiki Grows at EPA

The February 4, 2008 issue of Government Computer News carries an interesting interview:

Molly O'Neill | EPA the Web 2.0 way
GCN Interview By Joab Jackson
http://www.gcn.com/print/27_3/45741-1.html?topic=&CMP=OTC-RSS

The article talks about some of the EPA's experiments with web 2.0 technologies including wikis. One of the wikis arose out of the Puget Sound Information Challenge:

So we decided to use the mashup camp as our staging area for the wiki. We had a form on the wiki site that you could download, fill out and send in. We also sent up an e-mail address and a phone number.

It was a little scary because we hadn’t told anyone about this beforehand. What if no one contributed? That wasn’t a problem — we had so many people interested and providing useful information.

We had people building applications. National librarians were culling data for library resources. We had people help organize it. The interesting thing was to watch how many hits we were getting through social networking. People took my e-mail and sent it to other people, who sent it off to even more people. We had a blog from Germany weigh in. We had over 17,000 page views and 175 good contributions.

We learned a lot, and we delivered something as well — in fact, several of us are going to Seattle to meet with the council to talk about these tools. They have to write a strategic plan, so maybe they could write a strategic plan with the wiki online. Instead of spending months trying to gather data, they could do it a lot faster using social networking.

Wikis are interesting animals as government documents. While they are very changable, wikis carry their own version control. Think about what implications that might have if you think a wiki is worth saving for preservation. Would you try to copy every version? Take a snapshot once a month? Or decide it was ephemera you didn't need? We'd like to know what you think. If you'd like to see EPA's Puget Sound wiki for yourself, please visit http://pugetsound.epageo.org/.

As a tool for quickly gathering community input, I think EPA is onto something. Especially if most contributers are identified. It would become easier to distinguish special interest group input from regular community input. Or at least the potential is there.

Aside from the wiki, the interview has a great insight from Ms. O'Neill that I think has relevancy to the library community. She is asked "Why do you think federal agencies have such a hard time disseminating information on the Web? " and the last part of her answer is:

But the third reason is that we tend to organize data in a way that it makes sense to us. Although this is changing a little bit now, at EPA we still primarily organize our data by how we are organized as an agency. People outside the agency don’t think of things that way. They get frustrated because they want all the information about a subject, like climate change or environmental indicators. So where do they go? We’re doing a lot to improve search on our site. When you do a search on the main page, it will give you folder options. When you type in “waste water,” it will organize by folder topics like stormwater or industrial effluent.

This is both warning and opportunity for libraries. The warning is that we also tend to organize data in a way that makes sense to us in databases (catalogs) that make sense to us but not to users. But the good news is that one of the ways we organize materials is by subject. And documents librarians are very good about searching across agency boundaries for materials. It's one of the many ways we add value to government information.

Syndicate content