Can we identify or verify or prevent government website scrubbing?
One issue we at FGI are concerned about is that, when government information is not officially distributed to depository libraries and when official digital government information is available only from government-controlled web servers, then that information can (intentionally or unintentionally) be deleted or altered leaving historians, journalists, economists and other citizens with no clear, complete record of government activities.
From time to time there are stories about government websites being “scrubbed,” i.e., of information being removed from them, but it is often difficult to determine if these stories are accurate. Since stories like this are often (perhaps, usually) published to make a political point, discussion of them often revolves around the political issue rather than the issue of the integrity and permanence of government information in the larger sense.
One such story this week gives us an opportunity to at least quickly and superficially examine the existence of a problem, if not its extent:
- White House Scrubs Web Site on the Economy, Perrspectives [Jon Perr], March 20, 2008.
Perr says that a flash animation and a paragraph on tax cuts, which were on the White House Jobs and Economic Growth web page (also referred to as the “Economy & Budget Policies in Focus” web page) on March 16, 2008, were removed and no longer available on March 20. The animation said:
- “18,000 jobs created in December 2007,”
- “Over 8.3 million new jobs created since August 2003”
- “Unemployment rate remains low at 5%.”
- “President Bush’s actions are moving our economy forward”
And the deleted paragraph read:
President Bush Continues To Call On Congress To Further Reduce Economic Uncertainty By Making His Tax Relief Permanent.
President Bush believes the most important action to ensure the long-term health of our economy is to make sure the tax relief that is now in place is made permanent. The 2001 and 2003 tax cuts are set to expire in less than three years. If Congress allows that to happen, 116 million taxpayers will see their taxes go up by $1,800 on average, and we will see an end to many of the measures that have helped our economy grow – including the 10 percent individual income tax bracket, reductions in the marriage penalty, the expansion of the child tax credit, and reduced rates on regular income, capital gains, and dividends.
Perr discoverd that MSN has a cached copy of that page (dated 3/8/2008) that includes the animation and text. This morning, I used WebCite to make a copy of the MSN copy. (The WebCite copy does not do a good job of retaining the layout of the original, but the Flash animation is there and viewable as is the text paragraph and should remain there even after MSN removes its cached copy.)
I checked the Internet Archive, but the most recent snapshot of www.whitehouse.gov/infocus/economy/ as of this morning is June 7, 2007. I did some Google searching and was not able to locate the Flash animation, but I was able to locate a series, of nearly identical ones:
- 30 Straight Months of Job Gains Unemployment rates are below the …
- 29 Straight Months of Job Gains Unemployment rates are below the …
- 24 Straight Months of Job Gains Unemployment rate drops to lowest …
- 20 Straight Months of Job Gains Unemployment rate drops to lowest …
- 92000 Jobs Created in July 2007 Unemployment Rate Remains Low at …
If Google is an accurate way to judge the content of whitehouse.gov, it would appear that, the White House has maintained earlier versions of this animation but has not preserved this more recent one. But, we do not know how accurate or comprehensive or current Google is.
I also browsed the White House News releases for March 2008 page, because it appeared that similar information had migrated to various “Fact Sheets.” Indeed, the text paragraph is in the March 7, 2008 Fact Sheet: Taking Responsible Action to Keep Our Economy Growing. I was not able to find a link to the animations, however.
This brings me to the question: “Can we identify or verify or prevent government website scrubbing?” My own tentative conclusions are:
- We cannot prevent the government from changing its own websites, so we cannot prevent “scrubbing.”
- We can verify that a site has changed, but currently our tools are limited to a) commercial web crawlers (like google, MSN and Internet Archive, and b) individuals who regularly monitor websites, and c) web crawlers created by libraries using their own tools or those provided by others (such as Archive-it).
- While tools exist to monitor changes in a web site (e.g., Change Detection), I don’t believe that we can use these to look for significant (e.g., loss-of-information) alterations.
What conclusions can we draw from all this? Since we do not know how commercial indexers such as Google and MSN work and what their criteria are and since they do not have preservation as a mission, we can hardly rely on them. While this particular example may be trivial in itself, it demonstrates that government information in the digital age, the “e-government” age, is volatile and fragile and that we do not have a system in place that is as reliable for digital content as the FDLP libraries were for non-digital content. While it is hard to imagine a system that would be robust enough to catch every single digital bit of government information from every agency for all time, it is possible to imagine a system that would capture much more than we do now.
That leads me to a conclusion that we at FGI have long advocated: Libraries should be building collections of digital government information and GPO should facilitate this by depositing government information in FDLP libraries. If libraries created collections that could be text-mined by scholars and researchers, it would be possible to better audit, analyze, and preserve government information and make it more difficult for information to be scrubbed without being discovered and exposed. Indeed, it would remove, to some extent, the motivation to “scrub” if it was well know that the information was preserved and easily discoverable.
The question we should be asking ourselves is: How much are we losing every day? The task is too big for any one library or any one government agency (i.e., GPO). And it is not a task that commercial entities like Google and MSN are likely to take on.