Public Printer's Letter to President Obama Regarding Open Government
The Public Printer recently released GPO's letter to the President regarding open government (PDF) (Robert C. Tapella, Public Printer, March 9, 2009). Since it specifically mentions FreeGovInfo, we feel the need to comment and contextualize a bit.
On the one hand, it's great that GPO is reaching out publicly to offer infrastructural help with the government transparency initiative. We're happy to assist in any way we can. We hope FDLP libraries will join GPO in such efforts.
On the other hand, FGI has always argued for a geographically dispersed system of local, official digital repositories, so we cannot support GPO’s goal 1 to make FDsys the official repository for Federal Government publications -- unless it includes a network of distributed repositories modeled after the Federal Depository Library Program (FDLP). What we can support is FDSys as the official distribution channel for federal government publications.
It's not a trivial distinction. "Repository" means that GPO assumes sole responsibility for preservation, a role not specified in legislation. "Distribution channel" means GPO continues its solid century and a half record of distributing information to other institutions which will continue their solid century and a half record of preserving government information for future use while making sure it remains freely available over the internet. Since digital deposit is currently #2 on The Sunlight Foundation's Our Open Government List (OOGL) of top ideas for the President's open government initiative, we can only assume that the public -- or at least those that are most interested in government transparency -- agrees that a geographically dispersed system is a key ingredient in government transparency.
We also believe it is important in discussions of transparency to plan for preservation of and long-term access to information. If, in concentrating on short-term access and on information-as-service, we fail to consider long-term access and instantiation of information for long-term preservation, we will inevitably lose information -- and that would be bad for transparency.
Incomplete Access
We commend and support GPO for building APIs into FDSys. It is heartening and encouraging to see that GPO is publicly and officially proclaiming that "access" means more than providing a web site. But APIs and a web site are only two of the three parts of a complete access system. GPO has yet to acknowledge or even mention the third part of access: the provision of unfiltered bulk data access to government information.
A GPO web site can provide a human-friendly interface for the public and APIs can provide a computer-program-friendly way of querying, fetching, and using information. But, even taken together, these two access points provide only the government-approved, government-designed, government-hosted view of government information.
The problem with these government-only views of government information is that they are limited. No single provider (government or non-government) can provide unlimited access points or views or interfaces.
APIs are not magic. Each is a design for access and the product of choices made by the designer. Each has its own constraints built in. For example, an API might be tied to a particular agency or department, which would limit cross-agency utility. Or an API might be generalized to work across agencies or departments and thus lose rich access to agency-specific information content or structure.
One way to overcome these limitations is for the government to provide bulk data access. This means allowing the public to download raw content in bulk. Where web sites provide one "page" at a time and APIs can provide one or many "facts" at a time, bulk data access provides the raw information so that users can build their own collections, interfaces, and APIs.
This could improve access in ways that GPO could never hope to do all by itself. Imagine, for example, an agricultural library building a digital collection that contains agricultural reports, data, and audio visual content from the The Department of Agriculture, the EPA, the SBA, and NOAA combined with reports, maps, and GIS data from state and local government agencies and other content from its own institutional repository or university press. Then imagine that specialized digital collection having its own state-specific, agriculture-specific API and web site and bulk data access. Then imagine that these repositories are part of the rapidly expanding cloud and you get a sense of a rich govt information ecology.
Such scenarios are possible, but only if GPO and other government agencies make raw content easily, freely available in bulk for use and re-use and re-purposing. Providing only government web sites and government APIs without bulk data downloads and the ability for others to build collections for specific or general purposes will provide only a tiny fraction of open usability and transparency that we could have. There is nothing standing in the way of this happening today except the will of government agencies to make it happen.
Incomplete Preservation
The Public Printer's letter glosses over the problems of long-term access and preservation.
Let's be as clear as we can: we cannot and should not rely solely on GPO for long-term preservation and free access. The shift to digital does not change the methodology for long-term preservation and access. On the contrary, the tenuousness of digital information means that a distributed methodology is even more vital.
We cannot rely solely on GPO because the GPO Electronic Information Access Enhancement Act of 1993 does not even mention permanent access, nor does it guarantee that access will always be free. Indeed, the law specifically allows GPO to charge for access and even for use of its "directory" of information. The law also covers only "appropriate publications distributed by the Superintendent of Documents" -- effectively excluding huge bodies of born-digital information from the scope of what is GPO is allowed to handle. Regardless of GPO's intentions, there is no existing legislative mandate for GPO to provide free, permanent, public access to government information and we therefore cannot rely on it alone to do so.
We should not rely solely on GPO because no single digital archive or repository can ever be as secure and safe as multiple archives, libraries, and repositories. Even if GPO had a legislative mandate to provide permanent preservation and access (which it does not), and even if anyone could guarantee that GPO would always get adequate funding so that it never had to withdraw anything or charge for access for anything (which no one can), it would still be impossible to guarantee that GPO would never lose any information. The nature of digital information is that it can easily be corrupted, altered, lost, or destroyed. It can become unreadable or unusable without constant attention. Relying on any single entity is simply not as safe as relying on multiple organizations. It is more than a truism that Lots of Copies Keep Stuff Safe -- safer than backups and "mirror sites." But this is about more than redundant copies. It is also about relying on different organizations because they have different funding sources, different constituencies, different technologies, and different collections. No single digital collection can ever be as safe as multiple, reliable digital collections.
The good news
The good news is that there are existing organizations that can start working on this right away. There is nothing standing in the way of GPO and the existing FDLP libraries from implementing a digital depository system in which GPO enables FDLP libraries to download bulk data and build local digital collections.
There are existing technologies to facilitate this. The U.S. Government Documents Private LOCKSS Network is preserving "harvested" government information. Peer-to-peer (P2P) networks (like Napster and BitTorrent) have become increasingly popular because more and more people and some businesses have begun to realize that "distributed files" equals faster access and better preservation. (A geographically dispersed system of local, official digital repositories would be, for all intents and purposes, a P2P network.) Open source software for building digital repositories is widely available and increasingly easy to use.
Summary
APIs are good. They are a necessary part of adequate government information access. But digital distribution is also essential because only digital distribution will enable FDLP libraries and others to build new APIs, to de-ghettoize government information by better integrating it with non-government information, and to ensure long-term, free, public access and usability of government information.













The points you make about
The points you make about the need for a "geographically dispersed system of local, official digital repositories" apply analogously to the administration of a state publications depository library program. Presently the Montana State Library is the sole repository for digital state publications. We promote the creation of digital-only depository libraries by hosting a ftp site whereby any library can download MARC records that link to our full text publications. Nonetheless, we recognize the need for other institutions besides ourselves to have possession of the digital files so we are looking at different models for distribution including LOCKSS. We believe that not only do publishing state agencies need to provide the Montana State Library their digital state publications, we in turn as a state library need find ways to distribute these digital files just as we have historically distributed print state publications.
Models for distributed systems needed
Thanks Jim, I couldn't agree more. On p.109 of the report "Managing and Sustaining A State Government Publications Program in California: A Report on the Existing Situation and Recommendations for Action" (2004) there's a copy of a bar napkin kind of drawing that I drew to map out what the CA state depository system *ought* to look like. It's a combination of social and technological, which I think is an important point to mention. No digital system is going to be a purely technological solution, so building the social aspects into the system (collaboration on cataloging, collection building, public access etc) is critical.
Please keep us posted as you move forward with your Montana digital depository. You're on the front edge and working models will be extremely important as others begin to build their systems.
RSS for catalog records
James,
Thanks for alerting me to that report on state pubs in California. Fascinating reading. Your napkin model has me wondering how we at the state library could help state agencies create RSS for their new pubs so that we can bring those pubs into our workflow. It's too much to expect publishing agencies to submit their publications to us even via our current online submission site. If you have any ideas on a relatively easy way for us to assist state agencies in setting up RSS for their pubs, I'd love to hear from you.
Thanks.
Jim
RSS could be really helpful
Hi Jim. there's 2 ways to look at this:
1) agencies could create publications sites with built-in RSS feeds that you could monitor. Many web publishing platforms/content management systems (CMS) these days create RSS as a matter of course. At FGI, we're using Drupal which has rss all over the place - for each user's blog, the blog as a whole, taxonomy terms, pages, searches etc. It might be an easier sell to at least get state agencies to create distinct publications sections on their Websites, and if they did that with more up-to-date Web publishing platforms, so much the better.
2) Even if agencies are not using tools that automatically generate RSS, there are tools out there to create RSS for sites that don't have RSS. So your staff could create RSS feeds for agency publication pages to be notified when new content is added to an agency page of interest. For example, there's the Firefox plugin called update scanner. Dapper is another tool that does that. I described the process of building RSS for sites without in a comment a while back. In this scenario, you'd just have to create the RSS and have staff able to periodically check the feeds for new content.
I'd be interested to know if anyone's using RSS in a similar manner. please drop a comment here if you are.
Response to James Jacobs' FreeGovInfo Comments
Thank you for your thorough comments to our letter recently submitted to President Obama regarding Open Government. GPO is committed to providing free and perpetual access to authentic Federal publications. Long-term access, preservation of content, and authentication of content are difficult issues and we feel that FDsys has been developed in a way that will support our efforts to deliver on these needs. But, we can not do this alone. It needs to be our continued collaborative approach with you and others in the community to clearly define the requirements, developing options, evaluate the performance of options though pilots as required, and eventually implement solutions.
We have recently been called upon by Congress in the joint explanatory statement on the H.R. 1105, to work with the Library of Congress, including the Congressional Research Service, and the Law Library of Congress, to discuss access to bulk data. Specifically, the language is as follows:
Public Access to Legislative Data.-There is support for enhancing public access to legislative documents, bill status, summary information, and other legislative data through more direct methods such as bulk data downloads and other means of no-charge digital access to legislative databases. The Library of Congress, Congressional Research Service, and Government Printing Office and the appropriate entities of the House of Representatives are directed to prepare a report on the feasibility of providing advanced search capabilities. This report is to be provided to the Committees on Appropriations of the House and Senate within 120 days of the release of Legislative Information System 2.0.
To address this request, a Legislative branch task force has been assembled consisting of representatives from the offices of the Secretary of the Senate, the Clerk of the House, the Library of Congress, Congressional Research Service, the Law Library of Congress, and GPO. This task force has already met and is working to develop a position on access to bulk data. We will look to this work and the review by Congress to help guide our work on making bulk data accessible.
In parallel with the Legislative branch task force, we believe that work on the development of Application Programming Interfaces (API) specifications allowing the implementation of applications to access data in FDsys will be beneficial in providing more advanced access to information.
FDsys has been designed to support geographically dispersed content repositories. The dispersed repository within GPO is our backup or failover site. This is being designed and implemented to be a part of our continuity of operations plan (COOP) for GPO. This COOP instance of FDsys will provide full failover capability in the event the production instance of the system, including the data repository is inoperable. Other instances envisioned in the design of the system are repositories that can accept data from FDsys, much like libraries today accept tangible publications distributed from GPO. Aside from the basic access tools of a website, FDsys today is configured to provide ZIP files that include text, PDF, MODS metadata, and PREMIS metadata files. Enhancements to this functionality could include bulk data.
We agree that it is important in discussions of transparency to plan for preservation of and long-term access to information. The FDsys design follows the reference model for an Open Archival Information System (OAIS), and has Archival Information Packages (AIP) specifically designed to support preservation. AIPs are stored in separate repositories within the system and are self-describing, allowing multiple content management systems the capability to access this data, if required, and reconstruct information for access. Additionally, the PREMIS metadata has been structured based on current industry best practices to provide critical information about the data to support preservation processes in the future. We have taken the challenge to develop a robust and reliable system to preserve information, and feel that FDsys is well positioned to meet the needs. Utilizing OAIS, operationally separate AIP stores, failover instances, and industry best practices will help ensure information in FDsys will be well preserved and available for long-term access.
We believe GPO is on a good path. FDsys is a well designed system that provides simple and advanced access to authentic Federal publications. Enhancements to access capabilities and distribution of data can be accommodated within the design aspects of the system. We need help from you and others in the community to help define future enhancements to access and data distribution. We see APIs as a one of the methods to provide advanced access tools, and realize that this is just one part of the ultimate solution. We will also continue our work with the Legislative branch task force and look forward to guidance from Congress on as bulk data downloads and other means of no-charge digital access to legislative databases.
Thank you again for your detailed comments. We look forward to working with you further in the future.
Thank you
Mr. Tapella--- Thank you for taking the time both to address the language in HR 1105 and to tell us about how you are doing so here. It's also very encouraging to read about the coming technological improvements in FDSys and what you hope to accomplish.
I would, though, like to encourage GPO and this task force to keep in mind what the role of the public should be in guiding the development of services aimed squarely for them. At the least, keep us informed about what is happening with bulk data, please. The provision in HR 1105 that you noted didn't appear out of nowhere. It stemmed from a great deal of work and collaboration between the public and congressional staff in the Open House Project. We're all eagerly watching what you do but can't tell that the GPO is being responsive to public needs unless you tell us.
Thanks and all the best,
Josh Tauberer
GovTrack.us
Post new comment