Digital deposit

Demystifying Digital Deposit: What It Is and What It Could Do for the Future of the FDLP

At the Fall Depository Library Council Meeting in Arlington, VA, Rebecca Blakeley gave a presentation that she and I wrote on "Demystifying Digital Deposit: What It Is and What It Could Do for the Future of the FDLP." Although a PDF version of the presentation is available on the FDLP web site, it only has the slides, not the text of the presentation.

The complete, original PowerPoint file, including the "speaker notes" with the complete text of the presentation, is available on slideshare:

Digital Deposit

This page is to collect information on digital deposit.

Kentucky Shows How to Publish and Deposit Government Documents

The State of Kentucky has developed a best-practices manual for publishing -- and depositing -- government documents digitally.

The Kentucky Department for Libraries and Archives (KDLA) has been the official repository for Kentucky state agency publications since 1958

...Kentucky state agencies are required to send their publications to KDLA. The Public Records Division (PRD) and State Library Services (SLS) at KDLA work together not only to provide access to the valuable information contained in state agency publications, but also to preserve the publications for future generations.

The handbook says that "Electronic publications should be forwarded in Adobe Portable Document Format (PDF)." We would love to see all U.S. government publications in PDF format deposited in FDLP libraries. It would be a great first step toward digital deposit of all government information.

Comment on article: Depository Library Program in 2023

A recent article reports on a survey of ARL library directors and their vision of their libraries' roles in the depository library program:

The survey asked directors to choose among several future scenarios for the FDLP and their role of provision of government information. The authors are explicit about their intentions saying that "the study neither directly addresses whether the depository program itself will exist fifteen years hence nor offers a vision of what future will emerge after 2023." They also note that the survey explicitly focused on the question "how many libraries want to remain in the depository library program and what role do they intend to play?" This focus predetermines the outcome of the survey somewhat. It doesn't tell us what FDLP should be or how libraries could have a role in ensuring the long term, free access to government information. Instead we get a lot of information about what directors worry most about: money and resources.

The authors point out that no other study has systematically surveyed library directors for their perspective on the FDLP. This is particularly interesting given the rumors, gossip, and scuttlebutt going around about how many university librarians want to get rid of their depository collections, don't trust their depository librarians, and see depository status as costing more than it is worth.

The study reinforces some of those stereotypes and provides some evidence that some ARL library directors do indeed think that way. Sample quotes: "Several directors look forward to a time when they can 'dump the print.'" "Although some directors believe they have 'forward-thinking' documents librarians, others feel the opposite. As the director of a regional depository explains, 'the more that directors know about the program and a library's responsibilities, the less likely documents librarians can bluff about the legal obligations and seek to maintain the status quo.'" "The burden of participation in the program, including that of cost, is a recurring theme." "The directors I talk to all want to get rid of the [depository] collection and drop out of the program as soon as possible."

Not surprisingly, the directors who think that way are apparently part of the minority (13% that chose "scenario 1") who believe that libraries should withdraw from the depository program or that the program will simply wither away.

What the survey documents for the first time, however, is how much value ARL directors put in government information and digital collections. Many of the directors see government information as essential to their academic communities and have serious concerns about how to ensure its availability. Fully half the respondents envision (scenarios 3 and 4) some sort of digital collections as part of their responsibility -- either in partnership with GPO or separate from GPO if GPO does not provide adequate leadership.

While this survey is very interesting and provides much food for thought, it is far from the final word on the future of the FDLP, GPO, or government information. It leaves many questions unanswered and raises other questions. For example:

  • The survey's use of the term "digital depository" is confusing at best and misleading at worst. One of the "scenarios" presented to directors in the survey describes "digital deposit" as the library providing "a digital feed of government information resources to its Web site, thereby becoming a portal for access to e-government information resources. The library receives, but does not create, digital content." We wonder how directors interpreted this? Did they think that "receiving" digital content meant getting copies of digital files that they would keep in a digital collection? Or did they think that "providing a feed" and "becoming a portal" was a passive job of pointing to content at GPO or elsewhere? The article does not make this clear and we would have to guess that directors may not have provided responses that we can interpret consistently. (And, we would have to ask the authors, whose work we respect, why they chose the outdated word, indeed the outdated concept, "portal"? Does anyone really believe that users want or will use "portals" anymore?)
  • Another term that is used in a confusing way in the article (at least I was confused by it) is the term "dark archive." We normally associate this with digital archives such as Portico (which archives digital copies of journal articles but is "dark" because no one can see the articles unless a particular kind of event -- such as a publisher going out of business -- allows the archive to make articles available). In this article, the authors use "dark archive" in that sense but they also use it to refer to print collections that have copies of last resort. Was this confusing to the surveyed directors? Did different interpretations skew their answers?
  • Some of this confusion is evidently apparent to the authors. When they analyzed the directors' comments, they discovered that there was some "imprecision" by directors in choosing a scenario. Some were unable to place their institution fully in one of the provided scenarios. There were many reasons for this, but it makes it harder for us to interpret and understand the results.
  • The survey did not specifically present a scenario of real digital deposit in which GPO sends (i.e., deposits) authentic digital files to depository libraries. As noted above, the survey focused on two different but related questions: who wants to remain in the FDLP and what role do they intend to play. Combining those two questions may have further muddied the responses and left out options (e.g., true digital deposit).
  • One theme mentioned several times in the article is the need for a shared digital archive of digitized materials similar to the JSTOR model. To me, this seems to be an indication that the directors value digital information, see a need for a trusted repository in addition to GPO, and would support shared responsibilities for such an archive. This should spark some good discussions at the next DLC meeting.
  • The survey seems to perpetuate and even reinforce misleading concepts about the permanent availability of digital government information. Although the authors acknowledge that "government entities often do not retain all resources permanently on their homepages, and content can be difficult to find and can be subject to removal, redacting, or alteration", they also passively quote directors who say they will rely on search engines and other libraries and government web sites to provide government information for them. There are certainly some libraries (even among ARL libraries) that will not have large digital collections of government information, but the survey does an injustice by passing along these comments without follow up questions to those directors about who will ensure access.
  • Another questionable idea that came out of the survey was about staffing. Several directors said "they would cease to employ separate, dedicated government documents librarians. They assume the specialized knowledge will be passed to reference librarians." Shouldn't ARL directors be thinking about the need for new skills to manage digital deposit and digital preservation and digital access to locally held files? Shouldn't they be concerned about the special skills that will be needed to locate government information and provide reference service for it if they do not have a collection that they control?

In summary, the article provides much to discuss and good opportunities for further research. It also provides some clear evidence that the rumors that ARL directors want to dump their depository collections and drop their depository status are well founded, but that these directors are in the minority. Most ARL directors highly value government information and are looking for smart, efficient ways to ensure long term access to digital collections.

"Chat with GPO" Session on Authentication

Today I attended the "Chat with GPO" OPAL session, which focused on authentication and authentication for FDLP partners.Ted Priebe, GPO's Director of Library Planning & Development (LPD) and Lisa Russell, the Manager of LPD's Content Management unit presented material and answered questions.

Basically, LSCM wants to partner with Federal Depository Libraries and find ways to authenticate content hosted by the FDL partners. The digital signatures of authentication will indicate partnership with the FDL institution and the contact information for that institution. This is great news, especially for those FDLs also interested in hosting digital content in partnership with GPO.

The authentication session is archived on the GPO OPAL site.

Response to Public Printer

We at FGI would like to thank Robert C. Tapella, the Public Printer of the United States, for his response to our comments on his letter to President Obama regarding open government.

Mr. Tappella's response has some information that should be very encouraging and heartening to the depository library community. It also leaves some issues troublingly unaddressed.

Bulk Data Access to Legislative Information

First, it is wonderful to know that GPO is working with the Library of Congress, Congressional Research Service, the Law Library of Congress, and the Senate and House on the issue of access to bulk legislative data!

That news is important and significant. It is also very encouraging because it marks a new direction for dissemination of government information. Taken to its logical conclusion, this would mean that we will have a new route to obtaining government information. No longer will we be limited to information presented as web pages through government-built interfaces. No longer will we have to hope that web scraping will find all the information we want to gather or preserve. Raw information -- once locked in the dark web of government databases -- will be, potentially, available for libraries and others to download and repurpose.

Unfortunately, we can't look for this right away. Congress has only asked for a report, not action. The report itself is due "within 120 days of the release of Legislative Information System 2.0." Presumably that is a reference to a new version of the LIS that is currently only available within the legislative branch. I have not seen an announcement of a date for the release of a new version of the LIS, so it is not clear even when we can expect the report.

Nevertheless, it is certainly good to hear directly from Mr. Tapella that the task force working on this report will develop "a position on access to bulk data" and even intends to "work on making bulk data accessible."

It is somewhat ironic that this long, drawn-out process itself demonstrates the need for bulk data access. Although there have been calls for bulk data access for years, it literally took a legislative directive to get GPO and LOC and CRS to take the tentative steps they are taking now: to "develop a position" and "work on" the problem. Such passivity and long delays are, perhaps, inherent in a large, bureaucratic system, but they are crippling when it comes to keeping up with technological changes. This demonstrates why it is essential for the government to provide easy, free, reliable access to the raw information of government: doing so will enable others -- who can more quickly adopt new technologies -- to provide better access to that information faster than the government can.

What about Non-Legislative Data?

It is also unfortunate that the task force is only looking at bulk delivery of legislative information. Will it take another legislative directive to get GPO to "develop a position" on bulk access to other data? See Bulk Data Downloads: A Breakthrough in Government Transparency (by Tim O'Reilly, O'Reilly Radar, Mar 4, 2009) for a short list of other other data for which we need bulk access.

Will GPO Support Collections in FDLP Libraries or Just Backups?

Mr. Tapella's statement does not indicate that GPO has yet grasped the difference between 'backups' and digital deposit.  GPO's focus is apparently still on making sure that its own collection is functional rather than facilitating digital collections in FDLP libraries. The "geographically dispersed content repository" described by Mr. Tapella is only "our backup" designed to ensure GPO's "continuity of operations" if GPO's own data repository becomes inoperable. This is a good and necessary feature but it is only a backup for GPO and has nothing to do with digital deposit.

Although Mr. Tapella points out that FDsys supports "repositories that can accept data much like libraries today accept tangible publications distributed from GPO," it seems clear that this generic design is intended as providing "backups" and would require "enhancements" to include bulk data access. This is a GPO-centric way of thinking. This is still a long way from GPO having a "position" on digital deposit and even further from "working on" making it possible.

Until GPO understands that it needs to support digital deposit so that FDLP libraries can build their own digital collections with their own functionality, FDLP libraries will not be partners in preservation and access; they will be, at best, little more than a backup for GPO.

APIs are not Digital Deposit

Mr. Tapella repeats the advantages of APIs, but fails to address the need for digital deposit. Providing APIs is not the same thing as providing digital deposit. As we have said in our original comment APIs are not magic. Each is a design for access and the product of choices made by the designer. Each has its own constraints built in. But don't take our word for it; read what developers say about the constraints of using existing government APIs:

We love APIs! We think they are great! We want more! We are so very glad that GPO will support them at last! But, please, Mr. Tapella, understand that APIs and a web site are only two of the three parts of a complete access system. Bulk data access is essential and we'd like to hear that GPO is planning for it now.

OAIS is not Digital Deposit

We are so very happy that FDsys is based on OAIS. It is something we have long advocated. But, again, Mr. Tapella, please understand that telling us about your preservation system and your intentions to preserve information does not reassure us that everything will be preserved and freely available to everyone forever. As we pointed out in our original comments, regardless of your intentions and the quality of your system, GPO may not always have the funding, resources, or mandate to provide free, permanent, public access to all government information and we therefore cannot rely on it alone to do so. And no single digital archive or repository can ever be as secure and safe as multiple archives. We need digital deposit to guarantee preservation and free access.

The GPO-centric approach to preservation and access is like a medieval town that stores all of its grain in one barn. When lightening strikes, the whole town goes hungry. In this day and age of $200 terabyte hard drives, peer-to-peer networks, and successful preservation systems like LOCKSS, it concerns us greatly that you still don't understand the need to have many collaborators working together to ensure long-term, free, public access.

Good News?

There are a couple of sentences in Mr. Tapella's reply that make me optimistic that GPO is on a path to change and does understand this need for collaborators. He says:

We need help from you and others in the community to help define future enhancements to access and data distribution. We see APIs as a one of the methods to provide advanced access tools, and realize that this is just one part of the ultimate solution.

To me, this says two important things: First, "data distribution" is on the GPO agenda, at least nominally; second, APIs are just one part of a bigger, ultimate, solution.  This gives me hope for more. I hope I'm not reading too much into this.

See also:

Army Journal removal highlights need for digital deposit

According to Secrecy News, the Army has pulled the unclassified Military Intelligence Professional Bulletin from the open web:

The former MIPB website states that “The MIPB is now being hosted on the Intelligence Knowledge Network (IKN). (AKO account required).” AKO (Army Knowledge Online) accounts can only be obtained by military and contractor personnel.

The MIPB, which is unclassified, has long been available on the world wide web and has even been sold commercially. Back issues from 1995 to 2005 are available online from the FAS website, though no longer from the Army.

In addition to being sold commercially, this journal was also distributed through the Federal Depository Library Program until 2006, according to its entry in GPO's Catalog of Government Publications at http://catalog.gpo.gov. After 2006, it went online only and access was through a PURL.

As of today, that PURL directed folks to the takedown page. Libraries that depended on the "official repository" of the Army for post-2005 issues were out of luck. If these digital copies had been instead deposited to depository libraries, access might have gone on unhindered. Unless the Army had asked GPO to have depositories destroy their electronic archives of MIPB. But even then, the fact that multiple digital copies of MIPB existed would have triggered GPO's public process laid out in ID 72: Withdrawal of Federal Information Products from GPO’s Information Dissemination (ID) Programs. With that public process and the fact that prior issues were widely available, I think that the MPIB archive would have been safe. Instead, the Army as "The Official Repository" has made the online archives go away until FAS gets its FOIA request responded to.

Or maybe it will come sooner. The fact that MPIB had a PURL indicates that GPO may have been archiving it. But can they now post their copy of the archive? Do they need to consult the Army first? What if the Army says no?

Has anyone contacted GPO Help on this issue yet? What kind of a response have you gotten? Be sure to be kind to GPO as the decision on documents withdrawals rests with the agency. In this case, the Army. Don't blame Ric Davis if the Army nixes an FDLP restoration of the 2005-2009 MIPB archive.

It's cases like these where decisions are made with a flip of the switch without a public process that makes us wary of the Official Single Repository of Federal Publications, no matter who the federal agency is. Sunlight and good decision making require digital deposit outside the federal government.

Public Printer's Letter to President Obama Regarding Open Government

The Public Printer recently released GPO's letter to the President regarding open government (PDF) (Robert C. Tapella, Public Printer, March 9, 2009). Since it specifically mentions FreeGovInfo, we feel the need to comment and contextualize a bit.

On the one hand, it's great that GPO is reaching out publicly to offer infrastructural help with the government transparency initiative. We're happy to assist in any way we can. We hope FDLP libraries will join GPO in such efforts.

On the other hand, FGI has always argued for a geographically dispersed system of local, official digital repositories, so we cannot support GPO’s goal 1 to make FDsys the official repository for Federal Government publications -- unless it includes a network of distributed repositories modeled after the Federal Depository Library Program (FDLP). What we can support is FDSys as the official distribution channel for federal government publications.

It's not a trivial distinction. "Repository" means that GPO assumes sole responsibility for preservation, a role not specified in legislation. "Distribution channel" means GPO continues its solid century and a half record of distributing information to other institutions which will continue their solid century and a half record of preserving government information for future use while making sure it remains freely available over the internet. Since digital deposit is currently #2 on The Sunlight Foundation's Our Open Government List (OOGL) of top ideas for the President's open government initiative, we can only assume that the public -- or at least those that are most interested in government transparency -- agrees that a geographically dispersed system is a key ingredient in government transparency.

We also believe it is important in discussions of transparency to plan for preservation of and long-term access to information. If, in concentrating on short-term access and on information-as-service, we fail to consider long-term access and instantiation of information for long-term preservation, we will inevitably lose information -- and that would be bad for transparency.

Incomplete Access

We commend and support GPO for building APIs into FDSys. It is heartening and encouraging to see that GPO is publicly and officially proclaiming that "access" means more than providing a web site. But APIs and a web site are only two of the three parts of a complete access system. GPO has yet to acknowledge or even mention the third part of access: the provision of unfiltered bulk data access to government information.

A GPO web site can provide a human-friendly interface for the public and APIs can provide a computer-program-friendly way of querying, fetching, and using information. But, even taken together, these two access points provide only the government-approved, government-designed, government-hosted view of government information.

The problem with these government-only views of government information is that they are limited. No single provider (government or non-government) can provide unlimited access points or views or interfaces.

APIs are not magic. Each is a design for access and the product of choices made by the designer. Each has its own constraints built in. For example, an API might be tied to a particular agency or department, which would limit cross-agency utility. Or an API might be generalized to work across agencies or departments and thus lose rich access to agency-specific information content or structure.

One way to overcome these limitations is for the government to provide bulk data access. This means allowing the public to download raw content in bulk. Where web sites provide one "page" at a time and APIs can provide one or many "facts" at a time, bulk data access provides the raw information so that users can build their own collections, interfaces, and APIs.

This could improve access in ways that GPO could never hope to do all by itself. Imagine, for example, an agricultural library building a digital collection that contains agricultural reports, data, and audio visual content from the The Department of Agriculture, the EPA, the SBA, and NOAA combined with reports, maps, and GIS data from state and local government agencies and other content from its own institutional repository or university press. Then imagine that specialized digital collection having its own state-specific, agriculture-specific API and web site and bulk data access. Then imagine that these repositories are part of the rapidly expanding cloud and you get a sense of a rich govt information ecology.

Such scenarios are possible, but only if GPO and other government agencies make raw content easily, freely available in bulk for use and re-use and re-purposing. Providing only government web sites and government APIs without bulk data downloads and the ability for others to build collections for specific or general purposes will provide only a tiny fraction of open usability and transparency that we could have. There is nothing standing in the way of this happening today except the will of government agencies to make it happen.

Incomplete Preservation

The Public Printer's letter glosses over the problems of long-term access and preservation.

Let's be as clear as we can: we cannot and should not rely solely on GPO for long-term preservation and free access. The shift to digital does not change the methodology for long-term preservation and access. On the contrary, the tenuousness of digital information means that a distributed methodology is even more vital.

We cannot rely solely on GPO because the GPO Electronic Information Access Enhancement Act of 1993 does not even mention permanent access, nor does it guarantee that access will always be free. Indeed, the law specifically allows GPO to charge for access and even for use of its "directory" of information. The law also covers only "appropriate publications distributed by the Superintendent of Documents" -- effectively excluding huge bodies of born-digital information from the scope of what is GPO is allowed to handle. Regardless of GPO's intentions, there is no existing legislative mandate for GPO to provide free, permanent, public access to government information and we therefore cannot rely on it alone to do so.

We should not rely solely on GPO because no single digital archive or repository can ever be as secure and safe as multiple archives, libraries, and repositories. Even if GPO had a legislative mandate to provide permanent preservation and access (which it does not), and even if anyone could guarantee that GPO would always get adequate funding so that it never had to withdraw anything or charge for access for anything (which no one can), it would still be impossible to guarantee that GPO would never lose any information. The nature of digital information is that it can easily be corrupted, altered, lost, or destroyed. It can become unreadable or unusable without constant attention. Relying on any single entity is simply not as safe as relying on multiple organizations. It is more than a truism that Lots of Copies Keep Stuff Safe -- safer than backups and "mirror sites." But this is about more than redundant copies. It is also about relying on different organizations because they have different funding sources, different constituencies, different technologies, and different collections. No single digital collection can ever be as safe as multiple, reliable digital collections.

The good news

The good news is that there are existing organizations that can start working on this right away. There is nothing standing in the way of GPO and the existing FDLP libraries from implementing a digital depository system in which GPO enables FDLP libraries to download bulk data and build local digital collections.

There are existing technologies to facilitate this. The U.S. Government Documents Private LOCKSS Network is preserving "harvested" government information. Peer-to-peer (P2P) networks (like Napster and BitTorrent) have become increasingly popular because more and more people and some businesses have begun to realize that "distributed files" equals faster access and better preservation. (A geographically dispersed system of local, official digital repositories would be, for all intents and purposes, a P2P network.) Open source software for building digital repositories is widely available and increasingly easy to use.

Summary

APIs are good. They are a necessary part of adequate government information access. But digital distribution is also essential because only digital distribution will enable FDLP libraries and others to build new APIs, to de-ghettoize government information by better integrating it with non-government information, and to ensure long-term, free, public access and usability of government information.

The Intersection of Education, Technology, and Open Content

In a couple of recent posts, Lev Gonick, who is the CIO at Case Western Reserve University, has noted that we have "an educational economy that makes information abundant confronting an educational delivery system built for a time in which information was scarce."

His description of the educational economy and the educational delivery system struck me as analogous to the situation we face with government information. We live in an environment where government information is abundant and gains value by being distributed and reusable. In this environment it is incredibly inexpensive to distribute information, yet governments too often treat it as if it were scarce and expensive to deliver.

It is ironic, for example, that GPO refuses to deposit ninety percent or more of government information in FDLP libraries because it is digital (SOD 301, Superintendent Of Documents Policy Statement, "Dissemination/Distribution Policy for the Federal Depository Library Program" Effective Date: June 1, 2006) and then wonders why libraries find it hard to justify being a depository library.

Imagine a system closer to what Gonick describes. Imagine a system that recognizes that digital information is different from paper and ink information: both more valuable (because it is more easily used and re-used) and less expensive to distribute. Imagine an approach that is a more modern, more appropriate response to digital information than what we have now. Go further and imagine what the depository system would look like if it adopted the vision that Carl Malamud proposes.

Carl says all government information should be available in three ways (all for free):

  1. as bulk data for downloading and repurposing;
  2. through an API for querying, retrieving, embedding in other web sites;
  3. as better official web sites aimed at end users.

(Carl outlines these in his interview with Timothy M. O'Brien February 24, 2009, and in his Rebooting the Federal Register document):

Lev Gonick expands on what truly open information could mean to communities. He contrasts "the largely proprietary learning economy that exists now" with the new environment of "more and more open educational resources." He sees these open resources as creating new opportunities that were not available when we could only rely on proprietary, closed, scarce information resources.

Goncik's specific ideas actually sound a lot like the kind of collaborative, civic-centered services that John Shuler has long advocated and is describing here. Specifically Goncik describes a "a university-led 'connected cities' project" in which "we could invite different communities within our cities (children, schools, professionals, unions, educators, artists, elected officials, and so forth) to communicate with others in this new connected Web." He continues:

They might share oral histories and multimedia presentations about their communities with one another. Or they might participate in formal educational and research exchanges. Scientists could discuss research on sustainability, for instance, in ways that connect to high-school students seeking to learn about ecology and the economics of recycling. We can and we should leverage our universities’ ability to create powerful networks of technology and learners to create binding partnerships that matter.

The oceans that once separated us are now made smaller by the technology that we have helped invent and deploy. Deepening the linkages within and between our communities and across our cities is a 21st challenge worthy of great universities.

But this is not just about technology enabling sharing. It is also about having something to share. In order to do this, of course, we will need to guarantee free access to robust, preservable, re-usable collections of information. We could do that by hoping that GPO will always get the funding to do it for us and that it will do it right and meet all the needs of all communities equally well forever. We could hope that the government (GPO, OMB, Congress, etc.) will never privatize information or withdraw information, or alter information. Or, we could take on the task ourselves as depository libraries in the FDLP by demanding digital deposit. Then we could begin building digital collections for different communities-of-interest, world-wide. Libraries could then not only do interesting things with the information that they manage for their communities, but they could also facilitate others re-using the information.

Obama’s Inaugural Speech: visualized, video-searchable

President Obama's inaugural speech has generated some interesting examples of how technology can be applied to government information when the information is freely available for use and re-use and not locked into government databases or proprietary formats. It is a small piece of text with a lot of public interest and high visibility and, therefore, ripe for these kinds of demonstrations and experiments. Of course, to make use of the information, we have to actually have a copy of it. Imagine what would happen if all government information was actually distributed in open formats to libraries so that we could build collections that were index-able, search-able, visually browsable, and analyzable in interesting ways. Imagine freeing government information from its .gov silos and integrating it with non-government information in digital collections created for particular virtual communities of interest. Imagine the future of digital collections that are as easily re-usable as this small bit of text.

Check out these examples!

  • Inaugural Words: 1789 to the Present, New York Times. "A look at the language of presidential inaugural addresses. The most-used words in each address appear in [an] interactive chart..., sized by number of uses. Words highlighted in yellow were used significantly more in this inaugural address than average."
  • Visual of the Inaugural Address, ProPublica. [Compare this to the NYT version. Stop words matter!]
  • Search Inside Obama’s Inaugural Speech. Delve Networks. "We invite you to experience President Obama’s inaugural speech using our search inside technology. To do this, type what you’re looking for into the player searchbar above. A heatmap will show you where information related to your topic appears in the speech. You can move your mouse over the heatmap to see the matches. Click to jump to that place in the speech."

Affirmative Disclosure of Government Information

John Wonderlich, a Program Director of the Sunlight Foundation and a great friend of libraries, has posted some useful suggestions over at The Sunlight Foundation Blog:

Obama and Affirmative Disclosure.

I really like John's concept of "affirmative disclosure." I think we could go even further by explicitly addressing the problems of long-term preservation caused by the shift to e-government.

I am starting from the assumption that society needs a reliable way to preserve an accurate, complete historical record. Unfortunately, the systems we have in place today makes it difficult, and in some cases impossible, to guarantee that we will preserve a record that is either complete or accurate.

Consider, for example, the recent case where researchers at the University of Illinois discovered that the White House removed original documents from its web site, altered them, and replaced them with backdated modifications that appear to be originals but are not.

Also consider the project of the Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive and the U.S. Government Printing Office to try to capture web pages of the current administration by performing a "comprehensive crawl of the.gov domain."

These examples illustrate the problem of preserving the historical record.

The first shows how the historical record can easily be lost and altered (intentionally or unintentionally -- it doesn't matter which) by lack of accurate metadata (dates, versioning). The second shows the sad state of current preservation: the best record we will have of the government web will be a single, incomplete snapshot of the end of an eight year administration. (Harvesting is imperfect and incomplete: links can break, embedded content can be lost, databases can prohibit or inhibit crawls of their content, and crawls can only save a snapshot of dynamic sites.)

In essence, the government has made a major change in information policy by changing the technology of information dissemination and has done so without really examining the implications of the change or even acknowledging that a policy has changed.

What was the policy change? In the old policy, the role of government was to collect and assemble and edit and create information and then instantiate it in publications and distribute those instantiations to the public. At that point the role of preservation was in the hands of libraries (mostly FDLP libraries) and archives. But, in the new policy, the government does not actively distribute, but "posts" information on web sites where it is subject to alteration and removal without ever being instantiated anywhere. It is up to the public, consumer groups, individuals, libraries, and special projects to identify when information is posted or changed and then attempt to preserve that information. While that may succeed sometime, the approach has two fatal flaws. First, it is ad-hoc and therefore will almost certainly be incomplete at best. Second, it puts the responsibility of instantiation in the wrong hands: not those who create the information (the government) but those who "discover" the information. The government essentially is renouncing its responsibility to actively, affirmatively create a preseveable instance of the information it creates.

While some agencies (e.g. GPO, EIA) are saying that it is now their role to preserve information, other agencies (e.g., NARA) are actually narrowing their role in long-term preservation (notice that NARA is not participating in the ".gov crawl" and says explicitly that "most web records do not warrant permanent retention").

So, let's explicitly expand the idea of "affirmative disclosure" to include "active deposit." By that I mean that the government should be required to actively inform and distribute to the public notifications (metadata) and documents (data) every time a "document" is created or modified or superseded. "Deposit" could be accomplished with technology (e.g., RSS, APIs, OAI and OAI-PMH, etc.) and should be required to include dates and version information.

This is the right way to do this because it recognizes the appropriate roles for the different participants in the life cycle of information: government agencies create information products that are preservable and libraries and others preserve those products outside the .gov domain.

Digital Deposit: Lack of storage space is no excuse

This past weekend I was at my local Costco and not one, but two brands of 1 Terabyte (1000 GB) drives selling at around $300. I also saw a 500 GB (1/2 T) drive for $130. All of the drives were USB friendly meaning you could take one off the shelf and plug it into a USB port and have all that memory available to you.

What can you store in a Terabyte? According to an FBI article on digital forensics, plenty:

"a terabyte is equivalent to about 250 million pages of text, which would stack 10 miles high if printed on both sides of the page."

Surely that's enough space for even smaller libraries considering telling the Government Printing Office that they would like PDF ("access derivitives") delivered to them based on their profiles.

I admit, space isn't the only issue. But it's the objection I've heard most often and I honestly believe that technology has taken it away.

Cuts in LC budget threaten NDIIPP

And this is *exactly* why we need a distributed system of digital deposit, collection, preservation and access.

"Cuts Impact Digital Work At Library Of Congress", National Journal's Technology Daily, Sep 11, 2007 PMedition by Aliya Sternstein

Budget cuts this year and a paltry funding outlook for fiscal 2008 are frustrating digitization efforts at the Library of Congress, according to Library employees. Meanwhile, Democrats and Republicans disagree on how much money to supply the program in the future.

The National Digital Information Infrastructure and Preservation Program, established by Congress in 2000, devises means of finding, saving and providing long-term access to cultural resources that exist only in electronic format. But $47 million -- half of the program's funding -- was rescinded in fiscal 2007 to support other critical library programs.

Another Example of Access Bad, Ownership Good

Update 8/19/2007

It turns out that this story is both more complex and more unnerving than I present here. So while I am leaving up my original post as a historical record that I am falliable, I suggest you stop reading right now and go to my colleague Jim Jacob's very well researched pieces:

 

 

But the best part about Jim's story is that it doesn't end with hopeless fear. He suggests concrete actions you can take to ensure the gov't will not "googlize" our info away.  - Daniel


 

The BBC is reporting "Google is shutting down its premium video service, leaving users who have bought or rented content unable to view their videos in the future."

These people paid to access their content and now they not only can't have it, but according to the BBC, they won't be getting refunds, either.

If these same users had bought DVDs from a company that either went under or stopped selling them, these would be happy people watching their programs. But because they bought into an access model that assured them content would be available 24/7 on a third party server, they have nothing but their memories.

I'm positive Google didn't start up the service with the intention of shutting it down. They had every intention of being good to their users. But that was something that for whatever reason, they could not live up to.

Can you see where I'm going with this? The Government Printing Office wants us to accept a centralized model where we point to content housed in FDSys rather than depositing digital documents with the nation's Federal Depository Libraries. They assure us, just like Google assured their subscribers, that their servers will always be available to us on today's usage terms. But they can't really commit future generations of public servants any more than Google could keep its promise to its subscribers. And Google has WAY more money that GPO ever will.

Try to take some time this week to tell your Congress members that you want to see the decentralized depository system of the future, not another monolithic model that can be unplugged at will.

UPDATE: We have updated the information about this issue and its implications in a two part article:

 

David Rosenthal says "Do it for Preservation!"

David Rosenthal is a member of Stanford's LOCKSS development team who maintains a blog about his professional work. It is well worth reading and deserves a place in everyone's list of RSS feeds.

In a June 10, 2007 posting on reasons to preserve e-journals, David explains that multiple, independently hosted government publications are a good thing because they are TAMPER EVIDENT:

The goal of the FDLP was to provide citizens with ready access to their government's information. But, even though this wasn't the FDLP's primary purpose, it provided a remarkably effective preservation system. It created a large number of copies of the material to be preserved, the more important the material, the more copies. These copies were on low-cost, durable, write-once, tamper-evident media. They were stored in a large number of independently administered repositories, some in different jurisdictions. They are indexed in such a way that it is easy to find some of the copies, but hard to be sure that you have found them all.

Preserved in this way, the information was protected from most of the threats to which stored information is subject. The FDLP's massive degree of replication protected against media decay, fire, flood, earthquake, and so on. The independent administration of the repositories protected against human error, incompetence and many types of process failures. But, perhaps most important, the system made the record tamper evident.

Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.

An illustration is Volume XXVI of Foreign Relations of the United States, the official history of the US State Department. It covers Indonesia, Malaysia, Singapore and the Philippines between 1964 and 1968. It was completed in 1997 and underwent a 4-year review process. Shortly after publication in 2001, the fact that it included official admissions of US complicity in the murder of at least 100,000 Indonesian "communists"by Suharto's forces became an embarrassment, and the CIA attempted to prevent distribution. This effort became public, and was thwarted when the incriminating material was leaked to the National Security Archive and others.

The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention, as happened with Volume XXVI. Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace. That is what tamper-evident means. It is almost impossible to make the record tamper-proof against the government in power, but the paper FDLP was a very good implementation of a tamper-evident record.

You'll notice that David refers to the depository program in the past tense. He does so because, like GPO itself, he sees the Future Digital System (FDSys) as an inevitable total replacement:

It should have become evident by now that I am using the past tense when describing the FDLP. The program is ending and being replaced by FDSys. This is in effect a single huge web server run by the GPO on which all government documents will be published. The argument is that through the Web citizens have much better and more immediate access to government information than through an FDLP library. That's true, but FDSys is also Winston Smith's dream machine, providing a point-and-click interface to instant history suppression and re-writing.

David thinks this is a bad thing, GPO assures us it is a good thing, but both assume this is where we are going.

But it doesn't have to be this way. We in the FDLP are definitely "Not Dead Yet!" We have a vital role to play in continuing to preserve the tangible materials entrusted into our care. Further, hundreds of new tangible titles are being shipped each month by GPO to the 1200 plus federal depository libraries.

And while the depository community hasn't exactly leaped up and embraced their responsibility to preserve federal electronic publications, individual libraries like the University of North Texas and the New Mexico State Library have. Together with others who have held views on preservation similar to David's for years these libraries will help build the depository system of the future.

Or we can sit back and let Winston Smith control our government information. If you are a government information specialist, it's up to you.

Syndicate content Syndicate content