preservation

The Federal Government Must Reimagine Its Role As An Information Provider

Here is a pre-print (not-final version) of a paper with fascinating ideas about distribution of government information:

They say that "the federal government must reimagine its role as an information provider" and more specifically, that the next administration should...

...reduce the federal role in presenting important government information to citizens. Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use. We argue that this understanding is a mistake. It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

While the paper does not address preservation and long term access explicitly, it does suggest that the government should provide a "permanent location" with a permanent URL for "each piece of government data." It also implies (I think) that something like LOCKSS will ensure authenticity and permanent access ("As long as there is vigorous competition between third party sites, we expect most citizens will be able to ?nd a site provider they trust.") I believe that oversimplifies the problem and relies too much on hope and not enough on a social commitment to preservation through public funding of memory organizations.

Thanks and a tip of the hat to Joshua Taubere (GovTrack.us) for pointing to this article. He describes and comments on the paper in a post on the Open House Project blog: (Government Data and the Invisible Hand June 6th, 2008 by Joshua Tauberer).

The White House: Off Limits to Historians?

Meredith Fuchs, the general counsel of the National Security Archive at George Washington University, writes that the Bush administration's hostility towards public access to and preservation of records combined with changes in technology that have transformed the way in which we all communicate are leading to a situation in which "primary sources on the most important decisions and activities in the government may be lost, destroyed, or closed to the public." [emphasis added]

  • The White House: Off Limits to Historians? by Meredith Fuchs, Passport: The Newsletter of the Society for Historians of American Foreign Relations (5-1-08), posted at History News Network on Thursday, May 8, 2008.

[O]ver the last seven years there have been a series of moves by the current administration that may ensure that the records of the White House and the federal offices and agencies that work closely with the White House will not be available to historians.

Agencies not complying with record preservation policies

Agencies not complying with record preservation policies, By Jill R. Aitoro, NextGov, April 24, 2008.

At the hearing, Linda Koontz, director of information management issues at the Government Accountability Office, released preliminary results from an ongoing GAO study of how four agencies managed e-mail and electronic records. ...Koontz said the agencies print and then file e-mails, but about half of senior officials were not following these procedures, and the e-mails for these officials were maintained in e-mail systems that lacked record-keeping capabilities, such as the ability to group the e-mails using a classification system.

The House is considering the Electronic Communications Preservation Act, which would strengthen policies for preservation of government records including White House e-mails.

Gary Stern, general counsel for NARA said that the legislation's potential cost to agencies could be "astronomical," and noted the bill's requirement that the National Archives would maintain authority over the White House's electronic records might be unconstitutional.

Patrice McDermott, director of OpentheGovernment.org, said:

"I understand the constitutional issues, and I don't have a good answer for that.... But one of the concerns is that there is no way to enforce accountability [of] records management in the White House. We understand it's a difficult dance [for NARA]. They're there at the invitation of the White House in many cases, but there needs to be some way for the outside community to hold the White House accountable."

NARA will NOT harvest at end of current administration

According to a post on .govwatch (The National Archives Is Quietly Destroying Millions of Documents April 08, 2008 by Coby Logen), a recent memo at the National Archives and Records Administration says:

After considering our other records management program priorities for FY 2008, availability of harvested web content at other "archiving" sites (e.g., www.archive.org), and the resources required for conducting and preserving a government-wide web snapshot, NARA has determined that we will not conduct a web harvest or snapshot at the end of the current Administration.

Logen says that "Not capturing federal web sites now may mean losing millions of web pages authored under the Bush administration when leadership changes in January 2009."

John Wonderlich at the Sunlight Foundations comments that "The fact that digital preservation is done by others outside NARA isn't an excuse for NARA to abdicate their responsibility, but an argument that they should be capable of fulfilling it." (Digital Preservation Under Threat? by John Wonderlich on April 9, 2008)

This seems yet another example of the government saying it cannot and therefore it won't. (The NARA/TGN contract as a bad precedent). Call it the Katrina of digital preservation?

The New York Times sums up the underlying issue nicely yesterday: "In Storing 1's and 0's, the Question Is $" (By John Schwartz, New York Times, April 9, 2008). It is not a technological issue; it is an issue of funding and policy and control. (See: The Technical is Political.)

Can we identify or verify or prevent government website scrubbing?

One issue we at FGI are concerned about is that, when government information is not officially distributed to depository libraries and when official digital government information is available only from government-controlled web servers, then that information can (intentionally or unintentionally) be deleted or altered leaving historians, journalists, economists and other citizens with no clear, complete record of government activities.

From time to time there are stories about government websites being "scrubbed," i.e., of information being removed from them, but it is often difficult to determine if these stories are accurate. Since stories like this are often (perhaps, usually) published to make a political point, discussion of them often revolves around the political issue rather than the issue of the integrity and permanence of government information in the larger sense.

One such story this week gives us an opportunity to at least quickly and superficially examine the existence of a problem, if not its extent:

Perr says that a flash animation and a paragraph on tax cuts, which were on the White House Jobs and Economic Growth web page (also referred to as the "Economy & Budget Policies in Focus" web page) on March 16, 2008, were removed and no longer available on March 20. The animation said:

  • "18,000 jobs created in December 2007,"
  • "Over 8.3 million new jobs created since August 2003"
  • "Unemployment rate remains low at 5%."
  • "President Bush's actions are moving our economy forward"

And the deleted paragraph read:

President Bush Continues To Call On Congress To Further Reduce Economic Uncertainty By Making His Tax Relief Permanent.

President Bush believes the most important action to ensure the long-term health of our economy is to make sure the tax relief that is now in place is made permanent. The 2001 and 2003 tax cuts are set to expire in less than three years. If Congress allows that to happen, 116 million taxpayers will see their taxes go up by $1,800 on average, and we will see an end to many of the measures that have helped our economy grow – including the 10 percent individual income tax bracket, reductions in the marriage penalty, the expansion of the child tax credit, and reduced rates on regular income, capital gains, and dividends.

Perr discoverd that MSN has a cached copy of that page (dated 3/8/2008) that includes the animation and text. This morning, I used WebCite to make a copy of the MSN copy. (The WebCite copy does not do a good job of retaining the layout of the original, but the Flash animation is there and viewable as is the text paragraph and should remain there even after MSN removes its cached copy.)

I checked the Internet Archive, but the most recent snapshot of www.whitehouse.gov/infocus/economy/ as of this morning is June 7, 2007. I did some Google searching and was not able to locate the Flash animation, but I was able to locate a series, of nearly identical ones:

If Google is an accurate way to judge the content of whitehouse.gov, it would appear that, the White House has maintained earlier versions of this animation but has not preserved this more recent one. But, we do not know how accurate or comprehensive or current Google is.

I also browsed the White House News releases for March 2008 page, because it appeared that similar information had migrated to various "Fact Sheets." Indeed, the text paragraph is in the March 7, 2008 Fact Sheet: Taking Responsible Action to Keep Our Economy Growing. I was not able to find a link to the animations, however.

This brings me to the question: "Can we identify or verify or prevent government website scrubbing?" My own tentative conclusions are:

  • We cannot prevent the government from changing its own websites, so we cannot prevent "scrubbing."
  • We can verify that a site has changed, but currently our tools are limited to a) commercial web crawlers (like google, MSN and Internet Archive, and b) individuals who regularly monitor websites, and c) web crawlers created by libraries using their own tools or those provided by others (such as Archive-it).
  • While tools exist to monitor changes in a web site (e.g., Change Detection), I don't believe that we can use these to look for significant (e.g., loss-of-information) alterations.

What conclusions can we draw from all this? Since we do not know how commercial indexers such as Google and MSN work and what their criteria are and since they do not have preservation as a mission, we can hardly rely on them. While this particular example may be trivial in itself, it demonstrates that government information in the digital age, the "e-government" age, is volatile and fragile and that we do not have a system in place that is as reliable for digital content as the FDLP libraries were for non-digital content. While it is hard to imagine a system that would be robust enough to catch every single digital bit of government information from every agency for all time, it is possible to imagine a system that would capture much more than we do now.

That leads me to a conclusion that we at FGI have long advocated: Libraries should be building collections of digital government information and GPO should facilitate this by depositing government information in FDLP libraries. If libraries created collections that could be text-mined by scholars and researchers, it would be possible to better audit, analyze, and preserve government information and make it more difficult for information to be scrubbed without being discovered and exposed. Indeed, it would remove, to some extent, the motivation to "scrub" if it was well know that the information was preserved and easily discoverable.

The question we should be asking ourselves is: How much are we losing every day? The task is too big for any one library or any one government agency (i.e., GPO). And it is not a task that commercial entities like Google and MSN are likely to take on.

A Wiki Grows at EPA

The February 4, 2008 issue of Government Computer News carries an interesting interview:

Molly O'Neill | EPA the Web 2.0 way
GCN Interview By Joab Jackson
http://www.gcn.com/print/27_3/45741-1.html?topic=&CMP=OTC-RSS

The article talks about some of the EPA's experiments with web 2.0 technologies including wikis. One of the wikis arose out of the Puget Sound Information Challenge:

So we decided to use the mashup camp as our staging area for the wiki. We had a form on the wiki site that you could download, fill out and send in. We also sent up an e-mail address and a phone number.

It was a little scary because we hadn’t told anyone about this beforehand. What if no one contributed? That wasn’t a problem — we had so many people interested and providing useful information.

We had people building applications. National librarians were culling data for library resources. We had people help organize it. The interesting thing was to watch how many hits we were getting through social networking. People took my e-mail and sent it to other people, who sent it off to even more people. We had a blog from Germany weigh in. We had over 17,000 page views and 175 good contributions.

We learned a lot, and we delivered something as well — in fact, several of us are going to Seattle to meet with the council to talk about these tools. They have to write a strategic plan, so maybe they could write a strategic plan with the wiki online. Instead of spending months trying to gather data, they could do it a lot faster using social networking.

Wikis are interesting animals as government documents. While they are very changable, wikis carry their own version control. Think about what implications that might have if you think a wiki is worth saving for preservation. Would you try to copy every version? Take a snapshot once a month? Or decide it was ephemera you didn't need? We'd like to know what you think. If you'd like to see EPA's Puget Sound wiki for yourself, please visit http://pugetsound.epageo.org/.

As a tool for quickly gathering community input, I think EPA is onto something. Especially if most contributers are identified. It would become easier to distinguish special interest group input from regular community input. Or at least the potential is there.

Aside from the wiki, the interview has a great insight from Ms. O'Neill that I think has relevancy to the library community. She is asked "Why do you think federal agencies have such a hard time disseminating information on the Web? " and the last part of her answer is:

But the third reason is that we tend to organize data in a way that it makes sense to us. Although this is changing a little bit now, at EPA we still primarily organize our data by how we are organized as an agency. People outside the agency don’t think of things that way. They get frustrated because they want all the information about a subject, like climate change or environmental indicators. So where do they go? We’re doing a lot to improve search on our site. When you do a search on the main page, it will give you folder options. When you type in “waste water,” it will organize by folder topics like stormwater or industrial effluent.

This is both warning and opportunity for libraries. The warning is that we also tend to organize data in a way that makes sense to us in databases (catalogs) that make sense to us but not to users. But the good news is that one of the ways we organize materials is by subject. And documents librarians are very good about searching across agency boundaries for materials. It's one of the many ways we add value to government information.

Documenting the Government -- Strait of Hormuz edition

The recent encounter between U.S. warships and Iranian boats in the Strait of Hormuz provides an opportunity to reflect on the role of depository libraries in the digital age.

Background

The incident occured on January 6. On January 7, Navy Vice Adm. Kevin J. Cosgriff briefed Pentagon reporters via video teleconference from his headquarters in Manama, Bahrain. On the defenselink web site, a transcript of that news conference includes a link to page with a video of the incident. The video includes a voice saying "You will explode [unintelligible] minutes." In an undated post, apparently from January 7 or 8, on the U.S. Central Command web site, there is a story that includes a link to the same video and a still photo from it labeled "From Defense Department Video."

The incident received widespread news coverage. This morning, a search on news.google.com for "hormuz iran explode" retrieved over 2000 stories.

The initial stories were followed by accusations from Iran that the video was faked and second thoughts and analyses.

According to a story on TelegraphTV (The Pentagon have released amateur footage of the alleged encounter between the US Navy and Iranian vessels) the "amateur footage" released by the Pentagon was "filmed by a crew member of the USS Hopper" and "the audio and video recordings were made separately but have been put together in a compilation which showed more than 20 minutes of the alleged confrontation."

You can see copies of the 4 minute and 20 second video here:

Government Information Issues.

While the government has hundreds if not thousands of videos online, most go unnoticed by most Americans. But this video got extremely high media attention at a time when newspapers are running stories with headlines like "Bush's Iran war plot."

One analysis of the news coverage said:

ABC's Jonathan Karl quoting a Pentagon official as saying the Iranian boats "were a heartbeat from being blown up".

Bush administration officials seized on the incident to advance the portrayal of Iran as a threat and to strike a more threatening stance toward Iran. National Security Adviser Stephen Hadley declared Wednesday that the incident "almost involved an exchange of fire between our forces and Iranian forces". President George W. Bush declared during his Mideast trip Wednesday that there would be "serious consequences" if Iran attacked U.S. ships and repeated his assertion that Iran is "a threat to world peace".
- Official Version of Naval Incident Starts to Unravel Analysis by Gareth Porter, IPS, Jan 10, 2008.

This prompts me to ask several questions related to the role of the FDLP and libraries that wish to preserve and provide access to government information.

  1. Is the video an official government publication? You can ask this question from the point of view of FDLP and Title 44 and wonder if it is "within scope of GPO's information dissemination programs." GPO is careful to not get information that is not within its scope (see: Web Publication Harvesting). But we can also ask, Is an "amateur video" by a crewman on a warship "official"? Can ask, When does audio, apparently from a warship radio, become a public document? Does a label of "From Defense Department Video" on a still photograph make the video official? Does anyone happen to know what that official-looking number (080107-D-6570C-001) attached to the defenselink version of the video means?
  2. Who assembled the video and audio? This question is, of course, related to the one above. The video consists of 4 or 5 separate shots and the "explode" audio over a black screen. Was it assembled officially and if so, by whom?
  3. What is the provenance of the video and audio? Were the separate shots and the audio recorded by one individual or several?
  4. Will the video be deposited with the National Archives? or GPO? or FDLP? And, the converse of this question is, of course, Will it disappear from the .mil web sites? Will take-down notices be filed with copies on Youtube? Will its "official" status later become "unofficial"? Will the military claim the video is owned by a private individual and the military has no rights to it?
  5. Has any digital library saved a copy of this video? There are copies of the video and exerpts of it all over the net. My search for the "official" version took some time and I came across news media excerpts, copies of newsmedia versions, and other copies. But none of the places I found the video (including the "official" versions on the .mil sites) have any library-role of guaranteeing long-term preservation and free public access. There once was a time when more ephemeral documents distributed by the government might have some life in newspapers and the stories about them. But, in an age of video, are any libraries saving copies of significant public documents like this video? Or are they hoping that someone else (ABC? CNN? DOD? GPO?) will do that for them? And who is in control of those copies? Daniel has put a copy in the Internet Archive: http://www.archive.org/details/DodFootageOfJan62008IranianEncounter, but even the IA has a policy that web site owners "can voluntarily restrict access to their material."
  6. Where is the "20 minute" compilation cited in several news stories? Perhaps it is somewhere on the web, but I did not find it. If anyone has found it please let me know.

I believe that libraries should be asking these questions in general, not just of the highly-visible items like the Hormuz video. In fact, if anything, the Hormuz video will probably be saved somewhere because, like toothpaste out of a tube, it is hard to put something back once it's been release on the net. But libraries are the only places that will preserve the things that are not high profile today but which will have great value tomorrow. Unless libraries create explicit policies to select, acquire, organize, and preserve digital information, much will be lost -- whether it is "within the scope" of FDLP or not.

CLIR Seeks Public Comment on White Paper

 Preservation in the Age of Large-Scale Digitization By Oya Rieger

CLIR Seeks Public Comment on White Paper: Preservation in the Age of Large-Scale Digitization

The Council on Library and Information Resources (CLIR) seeks public comment on a white paper examining preservation issues relevant to large-scale digitization projects such as those being done by Google, Microsoft, and the Open Content Alliance. The paper, Preservation in the Age of Large-Scale Digitization, was written by Oya Rieger, Interim Assistant University Librarian for Digital Library and Information Technologies at Cornell University Library. It is available at http://www.clir.org/activities/details/mdpres.html.

The paper identifies issues that will influence the availability and usability, over time, of the digital books being created by large-scale digitizing projects, and considers the relationship of these new resources to our print collections. It concludes with a set of recommendations for rethinking a preservation strategy.

In issuing this paper, CLIR aims to stimulate discussion among stakeholders and to generate productive thinking about collaborative approaches to enduring access. To this end, CLIR invites those who submit comments to indicate whether they would like their comments posted publicly on our Web site. CLIR will make public only those comments accompanied by permission to post (let us know if the comments are to be anonymous or signed), and all such comments will be moderated. Comments received without permission to post will be shared only with CLIR staff and the author.

Public comment is sought through Friday, October 5. Please address comments to Kathlin Smith (ksmith@clir.org). CLIR will issue a final print and electronic report later this fall.

 

Karen G. Schneider on LOCKSS

Karen explains LOCKSS software clearly and succinctly -- its technology, costs, benefits, and purpose. She compares LOCKSS and Portico, too.

[Libraries] have never before owned so little of the content they manage. LOCKSS offers one solution... [T]he use of LOCKSS for preserving local born-digital content -- with a free download, plus one morning's worth of time -- is certainly worth a spin around the block.

Wikipedia Scanner and government information

There's a very interesting article in Wired about a data mining tool developed to discover instances of whitewashing (e.g. editing in one's self-interest; presumably inappropriately) of Wikipedia entries. As has been noted before, Wikipedia has no authority control over the entries and is therefore particularly subject to self-serving or highly partisan edits. Now a clever grad student has developed a tool to identify those instances based on the version tracking built into wikis. While it doesn't necessarily identify a particular person, just knowing that, as described in the article, someone at Diebold HQ removed negative information about Diebold voting machines is adequate because it forces Diebold to prove they weren't the ones to make the changes. In short, it provides accountability by making use of the Wikipedia equivalent of the historical record.

I mention this story because I think that this kind of activity is going to be increasingly important in determining what constitutes a real and/or official government publication. Traditionally, you held a government accountable by getting offiical documentation of its activities and holding on it for comparison with other official documentation. However, government information published electronically has made this a lot harder because of the changable nature of digital files. A longstanding concern of government information librarians with respect to electronic govnernment information has been how to know when changes have been made, what the changes consisted of and who made them.

In this respect, the surging popularity of web 2.0 -style tools may be a great boon for government information. These tools -- wikis, online collaborative software like Google Documents or Zoho and so on -- derive their value from their ability to be shared. Government agency personnel are no different from anyone else - they've got work to do, a limited patience with messing around with how to do it and a desire to take the path of least resistance. So, for government employees, i.e. the folks creating government information, there's just as much reason to use these kinds of software as there is for me right now writing this post.

And that means that neither the historical record nor legal accountability is necessarily lost, although it will entail expanding the definition of preservation of the historical record to include methods of acting on databases (creating data mining software to run against databases) in addition to the collection of objects (finding that last copy of a Serial Set volume) and any other activities that may become necessary as technology evolves.

As with everything, the possibilities are not limitless. The Wikipedia Scanner was developed in cooperation with Wikipedia and required a full download of the whole database. Allowing that level of access is an option that individual agencies could turn on or off and certainly some agencies would never allow those levels of access to their publications. However, the agencies unlikely to play well with others in this scenario probably already don't provide much access to their information. For agenices that would be amenable to this kind of datamining, a benefit would be not just automated archiving (which the version tracking amounts to), but no-cost-to-the-agency management of those archives since they'll be allowing others to do it for them.

Government Information in Legacy Formats

Government Information in Legacy Formats: Scaling a Pilot Project to Enable Long-Term Access, by Gretchen Gano and Julie Linden, D-Lib Magazine (July/August 2007) Volume 13 Number 7/8.

The Yale Library pilot project described here has served not only as a means for analyzing and documenting aspects of a CD-ROM migration approach, but also as a launching pad for a community-wide consideration of a large-scale, distributed project to migrate this legacy collection and ensure permanent public access to government information distributed on CD-ROMs.

Web-at-risk: preserving govt and political information

Valerie Glenn, University of Alabama Libraries nee University of North Texas, has an article out in the current First Monday entitled, "Preserving Government and Political Information: The Web–at–Risk Project" that talks about ... wait for it ... Web harvesting!

It's based on her talk at 2007 WebWise Conference on Libraries and Museums in the Digital World. In fact the whole issue of First Monday 12(7) is dedicated to selected papers from the WebWise. Valerie's article the what and why of Web harvesting, gives some sample collections, tools, and services and talks a little about some of the overarching issues involved in Web harvesting. There's more information on the Web-at-risk wiki.

Besides Valerie's article, there are podcasts of all of the sessions from WebWise07 where you'll hear the likes of Liz Bishoff, Günter Waibel, Steve Puglia, Deanna B. Marcum etc.

And if you haven't heard of First Monday you owe it to yourself to get over to that link and check out all their past issues. Or look at Best Mondays, their most read -- or at least most accessed -- articles.

OSTI using Archive-It for E-Prints

The Energy Department's Office of Scientific and Technical Information (OSTI) is using the Internet Archive's Archive-It service to "provide uninterrupted access to more than a million online research papers from OSTI's E-print Network."

  • EPrint Network Special Collection
    This collection provides searching of more than 1 million scientific e-prints. The E-print Network is a deep Web source of scientific and technical information created by researchers active in a wide range of fields, including chemistry, biology and life sciences, materials science, nuclear sciences and engineering, energy research, and computer and information technologies. Information customers can use E-print Network to browse scientific Web sites, find scientific societies, receive alerts and search and access scientific e-prints, the documents circulated electronically to facilitate peer exchange and scientific advancement. OSTI leads development and adaptation of new capabilities for preservation and dissemination of research important to the U.S. Department of Energy (DOE).

See also: OSTI archives scientific data on the Web, by Trudy Walsh, GCN, 06/29/07.

"Without a way to periodically archive this material, important science content within this ever-growing, ever-changing online, e-print environment could disappear," said Walter Warnick, director of OSTI.

When one copy is not enough...

When the University of Illinois Library joined others in the Google Book Search Library Project, one thing the librarians there knew was that relying on a single digital repository would not fit their preservation or access needs.

The libraries are taking a long-term view on the preservation and availability of the newly digitized materials, too. Rather than leaving the storage to Google alone, as some partners have done, they're creating at least one, and probably two, shared digital repositories of their own to archive and manage the information and content to be added in the future.

The idea is that if Google disappears, or changes its business, the information will remain accessible with no hassles.

"We know we'll be here," Sandler [Mark Sandler, the director of the committee's Center for Library Initiatives] said.

See: Google partnership will help preserve UI collection, by Greg Kline, The [Champaign-Urbana and East Central Illinois] News-Gazette, Sunday, June 17, 2007.

Lost webpages?

CBS reported that the Alabama Department of Homeland Security has a website that listed groups they consider possible terrorists. However, after the agency received complaints from people, some Web pages were removed from the list.

The original list of the groups that the agency considered possible terrorists were:

  • Environmentalists
  • Anti-Genetics (those opposed to genetically-altered crops)
  • Animal Rights
  • Anti-Abortion
  • Anti-Nuclear
  • Anti-War
  • Pro-Gay Right

This is a classic example of the abuse that can occur when a politically motivated governmental body controls information without any sort of vetting process. I am wondering if any library or individuals were able to capture the removed websites. Or will this be a permanently lost document? I did a quick, not thorough search in the WayBack Machine but no luck.

Update 5/29/2007

Thanks to Valerie for pointing out this page has been saved by a library/archives agency.

Syndicate content