Month of January, 2010
With the January 2010 Lost Docs Report and Appeal, we have come to the last of our "saved receipts" with which we first seeded the blog. This means that starting February 1, 2010, every single posting to the Lost Docs Blog will be a receipt submitted during that month or during the last week of the proceeding month. That means that if everyone who sent in a lost document report to GPO also sent to firstname.lastname@example.org, we would have an accurate report of the volume of document reports provided to GPO. We hope you will help make this happen.
Now on to the January 2010 Lost Docs Report and Appeal
Thanks to the continued generosity of documents librarians, we posted 85 reports of fugitive documents submitted to GPO. About two thirds of these items were reported during December 2009/January 2010.
Of these 85 reported items, 11 items have been cataloged by GPO. You can view this list by visiting lostdocs.freegovinfo.info/category/found/ and looking at the postings with January 2010 dates. We are appreciative of these new records.
In our view, three of the items reported to GPO and posted to the blog in January were either out of scope for the Catalog of Government Publications (CGP) or were already in the catalog. You can view these items by visiting lostdocs.freegovinfo.info/category/false/ and looking for items with January 2010 dates.
There were two items added to the "E-Version Needs Cataloging" category. You can view these items by visiting http://lostdocs.freegovinfo.info/category/catalog-eversion and looking for items with January 2010 dates. If your library has either of these documents, please consider adding an 856 field to the record(s) so your patrons will be able to link to the electronic version(s) through your catalog.
If you like the concept of a public listing of fugitive documents reported to GPO, there are a number of easy ways to help us:
- If you report a fugitive document to GPO, send your e-mailed receipt to email@example.com. We welcome any item reported to GPO in the past month. It is best if you can send us the receipt the same day you get it from GPO. Some e-mail programs will support auto-forwarding. If so, please consider autoforwarding items where the subject contains "lostdocs submission."
- Visit the blog at lostdocs.freegovinfo.info and comment on the listed items. Comments can include -- Did your library receive the item? Did you find it in the CGP? Do you think the item is out of scope for the CGP? Did you report the item as well and so on.
- Post the blog link to your website or share it on Facebook, Twitter, or other social media.
- Subscribe to the blog feed at lostdocs.freegovinfo.info/feed/
or better yet incorporate the feed into your website or blog.
White House bars agencies from posting some statistics, by Aliya Sternstein, NextGov (01/27/2010).
According to this article, datasets posted to data.gov by the Nuclear Regulatory Commission, the Peace Corps, the Agriculture Department's Food Safety and Inspection Service, the Interior Department's Bureau of Reclamation, and the Social Security Administration have been removed by the Office of Management and Budget "because they raised privacy, security or other concerns."
The article is based on work done by OpenTheGovernment.org which is tracking agency participation with the Open Government Directive here.
In my last post, I described the possibility of a systematic approach to data validation. A key feature of such an approach must be it’s availability to all who are responsible for data – and of special importance, its capacity to support efficient and timely use by creators or managers of data. Bill Michener (UNM), leader of one of the currently funded DataNet projects has published a chart describing the problem of “information entropy” [SEE: WK Michener “Meta-information concepts for ecological data management,” Ecological Informatics 1 (2006): 4 ] Within recent memory, I have heard an ecologist say that were it not possible to generate minimally necessary metadata “in 8 minutes,” he would not do it. Leaving aside -- for now -- the possibility of applying sticks and/or carrots (i.e. law and regulations, norms and incentives), it seems clear that a goal of applications development should be simplicity and ease of use.
[ Within the realm of ecology, a good set of guidelines to making data effectively available was recently published – these guidelines are well worth reviewing and make specific reference to the importance of using "scripted" statistical applications (i.e. applications that generate records of the full sequence of transformations performed on any given data) this recommendation complements the broader notion -- mentioned in my last post -- of using work flow mechanisms like Kepler to document the full process and context of a scientific investigation. SEE “Emerging Technologies: Some Simple Guidelines for Effective Data Management” Bulletin of the Ecological Society of America, April 2009, 205-214. http://www.nceas.ucsb.edu/files/computing/EffectiveDataMgmt.pdf ]
As a sidebar, it is worth noting that virtually all data are “dynamic” in the sense that they may be and are extended, revised, reduced etc. For purposes of publication – or for purposes of consistent citation and coherent argument in public discourse – it is essential that the referent instance of data or “version” of a data set be exactly specified and preserved. (This is analogous to the practice of "time-stamping" the citation of a Wikipedia article...)
Lest we be distracted by the brightest lights of technology, we should acknowledge that we now have available to us, on our desktops, powerful visualization tools. The development of Geographic Information Systems (GIS) has made it possible to present any and all forms of geo-referenced data as maps. Digital imaging and animation tools give us tremendous expressive power – which can greatly increase the persuasive, polemical effects of any data. (For just two instances among many possible, have a look at presentations at the TED meetings [SEE: http://www.ted.com/ ] or have a look Many Eyes [SEE: http://manyeyes.alphaworks.ibm.com/manyeyes/ ] .) But, these tools notwithstanding, there is always a fundamental obligation to provide for full , rigorous and public validation of data. That is, data must be fit for confident use.
Unanticipated uses of resources are one of the most interesting aspects of resource sharing on the Web. (At the American Museum of Natural History, we made a major investment in developing a comprehensive presentation of the American Museum Congo Expedition (1909-1915) – our site included 3-D presentation of stereopticon slides and one of the first documented uses of the site was by a teacher in Amarillo, Texas who was teaching Joseph Conrad – we received a picture of her entire class wearing our 3-D glasses.) It seems highly unlikely to me that we can anticipate or even should try to anticipate all such uses.
In the early 1980’s, I taught Boolean searching to students at the University of Washington and I routinely advised against attempts to be overly precise in search formulation – my advice was – and is – to allow the user to be the last term in the search argument.
An important corollary to this concept is the notion that metadata creation is a process not an event – and by “process” I mean an iterative, learning process. Clearly some minimally adequate set of descriptive metadata is essential for discovery of data but our applications must also support continuing development of metadata. Social, collaborative tools are ideal for this purpose. (I will not pursue this point here but I believe that a combination of open social tagging and tagging by “qualified” users -- perhaps using applications that can invoke well-formed ontologies – holds pour best hope for comprehensive metadata development.)
The folks at Communia the European Thematic Network on the digital public domain have laid out a clear, concise, easy to understand Public Domain Manifesto calling for the preservation and strengthening of the public domain and calling on cultural heritage organizations (including libraries!) to ensure that works in the Public Domain are available to all of society. Please read the manifesto and consider signing on.
On a side note, This isn't the first manifesto on the block. Also check out the Charter for Innovation, Creativity and Access to Knowledge and Columbia Professor of Law Eben Moglen's dotCommunist Manifesto (coming out of the Free Software movement). These three together show that there's a significant number of people around the world who think the public domain is something that's too important to cultures to let go by the wayside.
- The term of copyright protection should be reduced.
- Any change to the scope of copyright protection (including any new definition of protectable subject-matter or expansion of exclusive rights) needs to take into account the effects on the Public Domain.
- When material is deemed to fall in the structural Public Domain in its country of origin, the material should be recognized as part of the structural Public Domain in all other countries of the world.
- Any false or misleading attempt to misappropriate Public Domain material must be legally punished.
- No other intellectual property right must be used to reconstitute exclusivity over Public Domain material.
- There must be a practical and effective path to make available 'orphan works' and published works that are no longer commercially available (such as out-of-print works) for re-use by society.
- Cultural heritage institutions should take upon themselves a special role in the effective labeling and preserving of Public Domain works.
- There must be no legal obstacles that prevent the voluntary sharing of works or the dedication of works to the Public Domain.
- Personal non-commercial uses of protected works must generally be made possible, for which alternative modes of remuneration for the author must be explored.
As previously discussed, “free” and “open” dissemination of data are primary values, are fundamental premises for democracy. Data buried behind money walls, or impeded or denied to users by any of a variety of obstacles or “modalities of constraint” (Lawrence Lessig’s phrase) cannot be “effective”. But even when freely and/or openly available data can be essentially useless.
So what do we mean by “effective”? One possible definition of “statistics” is: “technology for extracting meaning from data in the context of uncertainty”. In the scientific context – and I have been arguing that all data are or should be treated as “scientific” – if data are to be considered valid, they must be subject to a series of tests respecting the means by which meaning is extracted...
By my estimation, these tests in logical order are:
Are the data well defined and logically valid within some reasoned context (for example, a scientific investigation – or as evidentiary support for some proposition)?
-- Is the methodology for collecting the data well formed (this may include selection of appropriate, equipment, apparatus, recording devices, software)?
-- Is the prescribed methodology competently executed? Are the captured data integral and is their integrity well specified?
-- To what transformations have primary data been subject?
-- Can each stage of transformation be justified in terms of logic, method, competence and integrity?
-- Can the lineages and provenances of original data be traced back from a data set in hand?
The Science Commons [SEE: “Protocol for Implementing Open Access Data” http://www.sciencecommons.org/projects/publishing/open-access-data-protocol/] envisions a time when “in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database” and, later in the protocol, imagines a compilation involving 40,000 data sets. Just the prospect of proper citation for the future “meta-analyst” researcher suggests an overwhelming burden.
So, of course, even assuming that individual data sets can be validated in terms of the tests I mention above, how are we to manage this problem of confidence/ assurance of validity in this prospectively super-data-rich environment?
(Before proceeding to this question let’s parenthetically ask how these test are being performed today? I believe that they are accomplished through a less than completely rigorous series of “certifications” – most basically, various aspects of the peer review process assure that the suggested tests are satisfied. Within most scientific contexts, research groups or teams of scientists develop research directions and focus on promising problems. The logic of investigation, methodology and competence are scrutinized by team members, academic committees, institutional colleagues (hiring, promotion, and tenure processes), by panels of reviewers – grant review groups, independent review boards, editorial boards -- and ultimately by the scientific community at large after publication. Reviews and citation are the ultimate validations of scientific research. In government, data are to some extent or other "certified by the body of agency responsible.)
If we assume a future in which tens of thousands of data sets are available for review and use, how can any scientists proceed with confidence? (My best assumption, at this point, is that such work will proceed with a presumption of confidence – perhaps little else?)
Jumping ahead, even in a world where confidence in the validity data can be assured, how can we best assure that valid data are effectively useful?
A year ago in Science a group of bio-medical researchers raised the problem of adequate contextualization of data [SEE: I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713] Specifically, they suggested:
“a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis.“ While their focus was on clinical studies in the bio-medical realm, the logic of their argument extends to all data. We already have tools available to us that can specify scientific work flows to a very precise degree. [SEE for example: https://kepler-project.org/ ] It seems entirely possible to me that such tools can be used – in combination with well-formed ontologies built by consensus within disciplinary communities to systematize the descriptions of scientific investigation and data transformation. – and moreover – by the combinations with socially collaborative applications -- to support a systematic process of peer review and evaluation of such work flows.
OK -- so WHAT ABOUT GOVERNMENT INFORMATION??? We’re just government document librarians or just plain citizens trying to make well-informed decisions about policy? Stay tuned…
You almost certainly have seen at least one story in the past week about "Open Government" and the release of new data. Reporters have slowly been picking up on a massive release of information spurred by President Obama's Open Government Directive. (See: New 'high value' data posted to data.gov.)
Below are a few announcements and stories that you may find of interest.
But, in addition to all the data released this week was a new policy that will, potentially, affect usability of government information in the future. In the December 8, 2009 memo (Open Government Directive [pdf] Memorandum For The Heads Of Executive Departments And Agencies, M-10-06, Peter R. Orszag Director, Office of Management and Budget) that implemented the President's Open Government Initiative, OMB specifically mandates open file formats.
To increase accountability, promote informed participation by the public, and create economic opportunity, each agency shall take prompt steps to expand access to information by making it available online in open formats.
And, OMB defines open formats as:
An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information.
This is big news for two reasons. First, it should lead the government away from proprietary formats which are hard to preserve, hard to re-use, and typically require either proprietary software or only operate on specific platforms, or both. Think: documents in ODF format rather than Microsoft Word. Second, the directive mandates formats "without restrictions [on] re-use." Think: no DRM (and no licensing restrictions!).
As the ODF Alliance noted back in December when the OMB memo was released, much of government information is still released in "documents" which are not ideal for re-use of information even when the document formats are open. But, this is still an important, essential step:
Like it or not, government bureaucracies are still very document-centric and there is a lot of government “data” stored in documents, the challenge being how to provide easy access to this data.
...With today's announcement, the Obama Administration has taken an important step on open government data and acknowledged the role open formats play in this regard. For document-centric governments, an open document format remains essential to delivering on this promise.
-- Obama Administration To Require Government Agencies to Make Information Available in Open Formats. ODF Alliance, December 08, 2009.
Open formats will help libraries that want to preserve digital government information by making it easier and less costly to do so.
Here are some of the announcements about releases of new government data:
- Open Government Initiative White House.
- Another Milestone In Making Government More Accessible and Accountable. White House.
- U.S. Government, OSTP, Open New Troves of Data to the Public
- Justice Department Announces Release of New Information Online as Part of President’s Open Government Initiative
- How "Open Gov" Datasets Affect Parents and Consumers. White House.
- Open Government Initiative. White House.
If you'd like to hone your skills at locating and reporting fugitive documents, check out this e-mail from GPO:
From: Announcements from the Federal Depository Library Program [mailto:GPO-FDLP-L@LISTSERV.ACCESS.GPO.GOV] On Behalf Of FDLP Listserv Sent: Thursday, January 21, 2010 12:40 PM To: GPO-FDLP-L@LISTSERV.ACCESS.GPO.GOV Subject: Chat with GPO: Helping GPO Identify Fugitive Publications On Thursday, January 28, 2010 at 1:30PM EST, Joe McClane, Manager of GPO's Content Acquisitions and Linda Nainis, GPO's Acquisitions Librarian will discuss how documents librarians can help GPO identify fugitive publications. The presentation will feature a 30-minute slideshow that explains how GPO staff find fugitive documents and ways the community can help GPO improve the researching and processing of new documents. Time will be allocated at the end of the session for questions. Space is limited to the first 100 participants on a first come basis. GPO recommends arriving at least 10 minutes early in order to reserve your spot and test your connection. Connect to the GPO OPAL Room: <http://www.conference321.com/masteradmin/room.asp?id=rs38bb0e4b3a5a>. For more information on GPO's OPAL implementation and OPAL requirements, visit: <http://www.fdlp.gov/outreach/onlinelearning/68-opal>. _________________________________ If you have questions or comments, please use the askGPO help service at: <http://www.gpoaccess.gov/help>. When submitting a question, please choose the category "Federal Depository Libraries" and the appropriate subcategory, if any, in order to ensure that your question is routed to the correct area.
If you have an interest in identifying fugitive publications, I strongly encourage you to attend this OPAL session. The better reports that GPO has, the faster any given item will be cataloged. This benefits everyone. Hope to see you there.
No doubt folks have seen at least 1 of the growing video remixes of Hitler in the bunker. Well here's a new one from Critical Commons that highlights digital scholarship, open courseware, and fair use. Nicely done.
Critical Commons provides information about current copyright law and its alternatives in order to facilitate the writing and dissemination of best practices and fair use guidelines for scholarly and creative communities. Critical Commons also functions as a showcase for innovative forms of electronic scholarship and creative production that are transformative, culturally enriching and both legally and ethically defensible. At the heart of Critical Commons is an online tool for viewing, tagging, sharing, annotating and curating media within the guidelines established by a given community. Our goal is to build open, informed communities around media-based teaching, learning and creativity, both inside and outside of formal educational environments.
Government Posting Wealth of Data to Internet, By THE ASSOCIATED PRESS, New York Times (January 22, 2010).
The Obama administration on Friday is posting to the Internet a wealth of government data from all Cabinet-level departments, on topics ranging from child car seats to Medicare services.
...Under a Dec. 8 White House directive, each department must post online at least three collections of "high-value" government data that never have been previously disclosed.
...All the new data collections will be added to the government's Web site, data.gov.
The United Kingdom has it's own version of data.gov and it has the added cachet of being promoted and advised by Sir Tim Berners-Lee.
This site seeks to give a way into the wealth of government data. [T]his means it needs to be: easy to find; easy to licence; and easy to re-use. We are drawing on the expertise and wisdom of Sir Tim Berners-Lee and Professor Nigel Shadbolt to publish government data as RDF – enabling data to be linked together.
- Tim Berners-Lee unveils government data project, BBC (21 January 2010).
Web founder Sir Tim Berners-Lee has unveiled his latest venture for the UK government, which offers the public better access to official data.
A new website, data.gov.uk, will offer reams of public sector data, ranging from traffic statistics to crime figures, for private or commercial use.