Recently, when I have spoken about “data as evidence” in several academic settings, there has been a recurring question. Essentially it concerns the fact that dishonest people acting in bad faith will generate false, badly formed, or misleading data and propose it as evidence in support of predetermined (i.e. prejudiced / pre-judged) positions. To this day, parties or groups that base themselves in “values” or “beliefs” that are assumed a priori – i.e. values that are non-negotiable – in fact, not subject to discussion — dominate our political landscape. One has only to watch the response of some of the Republican Congressional caucus to President Obama’s discussion there this past week to see clear evidence of this. A fundamental tenet for these believers is that compromise with any other set of beliefs represents moral “relativism” – which is equivalent to amorality (if not immorality).
I believe that much of the trouble we experience in contemporary civil discourse can be traced to a confusion, conscious or otherwise, of the distinctions between “Church” (institutionalization of religious belief) and “State” (government based on trust in a diverse and tolerant community). From the time of European settlement of this continent we have had problems in separating Church and State [See LoC for an excellent summary history ] AND, concomitantly, in maintaining the distinction between empirical knowledge as a basis for public policies and commitments to “truth” based in belief. The former can be understood as “objective and invariant” (as discussed previously) the latter as subjective and highly variable — the phrase used by John Searle of UC Berkeley, “first person ontology” is well applicable.
With objective, scientifically based knowledge, we have the opportunity of arriving – through investigation and discourse — at common agreements (within some bounds of reasonable, relative probability). Respecting contending “truths,” based in belief, we have the very strong possibility of violence and conflict — consider – Northern Ireland or South Asia? It is wrong and misguided to characterize the separation of Church and State as somehow inimical to one system of belief or another.
Separation of Church and State is fundamental to a diverse and inclusive society and protects religious freedom and the right of individual conscience. Without separation – and religious tolerance (as clearly expressed in the Bill of Rights) – a change in political power may result in murder. We are all too familiar – elsewhere in the world — with the consequences of confusing government and religion.
And so we must return to the problem of objectivity and pragmatism in civil discourse. Today we are faced with a range of a priori values – beliefs that are considered “true” and above debate. On the right, the most fundamental of these a priori tenets is that “government is bad” [The Reagan/Thatcher formulation being: “Government is not the solution to our problem; government is the problem.”] coupled with the corollary that raising funds to support government (taxing) is bad. Aside from the fact that this is fundamentally subversive (!) of the common welfare – it is also impractical and nonsensical. But I would also argue that on the left, there are similar a priori values – i.e. that government is good and corporations are bad.
All forms of human organization are subject to corruption and abuse – certainly this is true of government at all levels – but is absolutely true of corporate governance and is also true of private sector non-profit governance. I believe that the most stable and sustainable principle for our American system of democracy is justice based in the common value of fairness, and this value demands commitment to tolerant civil discourse embodying both rationality and science. It will be protected by an ongoing commitment to transparency and accountability in governance of all sectors: for profit, not for profit and public. (In recent years we have all seen flagrant examples of abuse in all three sectors. Journalism and publishing under first amendment protections together with free, open and effective access to data and information have been essential to the process of transparency and accountability.)
The previously mentioned GRI [SEE: http://www.globalreporting.org/ ] — and similar initiatives working for transparency, accountability and rigorous standards of evidence –- present a clear alternative to organizational business-as-usual.
(As an aside, I will here note that expressions of anger – verbal or physical – as a part of political discourse – for example shouts of “You lie!” — are signs of impotence, sure evidence of the abandonment of civil discourse, of the rational intention of serving the common welfare.)
As custodians of knowledge, as teachers and as advocates, librarians have always been primary defenders of fair and equitable access to knowledge for the common good. The World Wide Web is a technical fulfillment of the most basic ethos of librarianship. For the first time in human history, we have the technological means of sharing knowledge worldwide. But the existence of a global network does not assure that all people will have access, it does not assure that what flows across the network will be effectively useful in informing public discourse for the largest number of people.
We, librarians, have an obligation, in all our interactions to support the broadest possible access by all – freely, openly and effectively. We must maintain critical sensitivity to the practical usefulness of resources provided over global networks, to teach critical and evaluative skills and to assist wherever possible in interpreting and refining available resources.
In my last post, I described the possibility of a systematic approach to data validation. A key feature of such an approach must be it’s availability to all who are responsible for data – and of special importance, its capacity to support efficient and timely use by creators or managers of data. Bill Michener (UNM), leader of one of the currently funded DataNet projects has published a chart describing the problem of “information entropy” [SEE: WK Michener “Meta-information concepts for ecological data management,” Ecological Informatics 1 (2006): 4 ] Within recent memory, I have heard an ecologist say that were it not possible to generate minimally necessary metadata “in 8 minutes,” he would not do it. Leaving aside — for now — the possibility of applying sticks and/or carrots (i.e. law and regulations, norms and incentives), it seems clear that a goal of applications development should be simplicity and ease of use.
[ Within the realm of ecology, a good set of guidelines to making data effectively available was recently published – these guidelines are well worth reviewing and make specific reference to the importance of using “scripted” statistical applications (i.e. applications that generate records of the full sequence of transformations performed on any given data) this recommendation complements the broader notion — mentioned in my last post — of using work flow mechanisms like Kepler to document the full process and context of a scientific investigation. SEE “Emerging Technologies: Some Simple Guidelines for Effective Data Management” Bulletin of the Ecological Society of America, April 2009, 205-214. http://www.nceas.ucsb.edu/files/computing/EffectiveDataMgmt.pdf ]
As a sidebar, it is worth noting that virtually all data are “dynamic” in the sense that they may be and are extended, revised, reduced etc. For purposes of publication – or for purposes of consistent citation and coherent argument in public discourse – it is essential that the referent instance of data or “version” of a data set be exactly specified and preserved. (This is analogous to the practice of “time-stamping” the citation of a Wikipedia article…)
Lest we be distracted by the brightest lights of technology, we should acknowledge that we now have available to us, on our desktops, powerful visualization tools. The development of Geographic Information Systems (GIS) has made it possible to present any and all forms of geo-referenced data as maps. Digital imaging and animation tools give us tremendous expressive power – which can greatly increase the persuasive, polemical effects of any data. (For just two instances among many possible, have a look at presentations at the TED meetings [SEE: http://www.ted.com/ ] or have a look Many Eyes [SEE: http://manyeyes.alphaworks.ibm.com/manyeyes/ ] .) But, these tools notwithstanding, there is always a fundamental obligation to provide for full , rigorous and public validation of data. That is, data must be fit for confident use.
Unanticipated uses of resources are one of the most interesting aspects of resource sharing on the Web. (At the American Museum of Natural History, we made a major investment in developing a comprehensive presentation of the American Museum Congo Expedition (1909-1915) – our site included 3-D presentation of stereopticon slides and one of the first documented uses of the site was by a teacher in Amarillo, Texas who was teaching Joseph Conrad – we received a picture of her entire class wearing our 3-D glasses.) It seems highly unlikely to me that we can anticipate or even should try to anticipate all such uses.
In the early 1980’s, I taught Boolean searching to students at the University of Washington and I routinely advised against attempts to be overly precise in search formulation – my advice was – and is – to allow the user to be the last term in the search argument.
An important corollary to this concept is the notion that metadata creation is a process not an event – and by “process” I mean an iterative, learning process. Clearly some minimally adequate set of descriptive metadata is essential for discovery of data but our applications must also support continuing development of metadata. Social, collaborative tools are ideal for this purpose. (I will not pursue this point here but I believe that a combination of open social tagging and tagging by “qualified” users — perhaps using applications that can invoke well-formed ontologies – holds pour best hope for comprehensive metadata development.)
As previously discussed, “free” and “open” dissemination of data are primary values, are fundamental premises for democracy. Data buried behind money walls, or impeded or denied to users by any of a variety of obstacles or “modalities of constraint” (Lawrence Lessig’s phrase) cannot be “effective”. But even when freely and/or openly available data can be essentially useless.
So what do we mean by “effective”? One possible definition of “statistics” is: “technology for extracting meaning from data in the context of uncertainty”. In the scientific context – and I have been arguing that all data are or should be treated as “scientific” – if data are to be considered valid, they must be subject to a series of tests respecting the means by which meaning is extracted…
By my estimation, these tests in logical order are:
Are the data well defined and logically valid within some reasoned context (for example, a scientific investigation – or as evidentiary support for some proposition)?
— Is the methodology for collecting the data well formed (this may include selection of appropriate, equipment, apparatus, recording devices, software)?
— Is the prescribed methodology competently executed? Are the captured data integral and is their integrity well specified?
— To what transformations have primary data been subject?
— Can each stage of transformation be justified in terms of logic, method, competence and integrity?
— Can the lineages and provenances of original data be traced back from a data set in hand?
The Science Commons [SEE: “Protocol for Implementing Open Access Data” http://www.sciencecommons.org/projects/publishing/open-access-data-protocol/] envisions a time when “in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database” and, later in the protocol, imagines a compilation involving 40,000 data sets. Just the prospect of proper citation for the future “meta-analyst” researcher suggests an overwhelming burden.
So, of course, even assuming that individual data sets can be validated in terms of the tests I mention above, how are we to manage this problem of confidence/ assurance of validity in this prospectively super-data-rich environment?
(Before proceeding to this question let’s parenthetically ask how these test are being performed today? I believe that they are accomplished through a less than completely rigorous series of “certifications” – most basically, various aspects of the peer review process assure that the suggested tests are satisfied. Within most scientific contexts, research groups or teams of scientists develop research directions and focus on promising problems. The logic of investigation, methodology and competence are scrutinized by team members, academic committees, institutional colleagues (hiring, promotion, and tenure processes), by panels of reviewers – grant review groups, independent review boards, editorial boards — and ultimately by the scientific community at large after publication. Reviews and citation are the ultimate validations of scientific research. In government, data are to some extent or other “certified by the body of agency responsible.)
If we assume a future in which tens of thousands of data sets are available for review and use, how can any scientists proceed with confidence? (My best assumption, at this point, is that such work will proceed with a presumption of confidence – perhaps little else?)
Jumping ahead, even in a world where confidence in the validity data can be assured, how can we best assure that valid data are effectively useful?
A year ago in Science a group of bio-medical researchers raised the problem of adequate contextualization of data [SEE: I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713] Specifically, they suggested:
“a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis.“ While their focus was on clinical studies in the bio-medical realm, the logic of their argument extends to all data. We already have tools available to us that can specify scientific work flows to a very precise degree. [SEE for example: https://kepler-project.org/ ] It seems entirely possible to me that such tools can be used – in combination with well-formed ontologies built by consensus within disciplinary communities to systematize the descriptions of scientific investigation and data transformation. – and moreover – by the combinations with socially collaborative applications — to support a systematic process of peer review and evaluation of such work flows.
OK — so WHAT ABOUT GOVERNMENT INFORMATION??? We’re just government document librarians or just plain citizens trying to make well-informed decisions about policy? Stay tuned…
In this case, the Open Knowledge Foundation has gone a long way toward clarification…
See specifically SEE: http://opendefinition.org/
From the the “Open Knowledge Definition” home page:
“In the simplest form the definition can be summed up in the statement that ‘A piece of knowledge is open if you are free to use, reuse, and redistribute it’. “
In detail the definition suggests: [for sake of clarity, I have here deleted –- marked
“The term knowledge is taken to include:
1. Content such as music, films, books
2. Data be it scientific, historical, geographic or otherwise
3. Government and other administrative information
“Software is excluded despite its obvious centrality because it is already adequately addressed by previous work.
“The term ‘work’ will be used to denote the item of knowledge at issue.
“The term ‘package’ may also be used to denote a collection of works. Of course such a package may be considered a work in itself.
“The term ‘license’ refers to the legal license under which the work is made available. Where no license has been made this should be interpreted as referring to the resulting default legal conditions under which the work is available.”
“A work is ‘open’ if its manner of distribution satisfies the following conditions:
- Access: The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.
- Redistribution: The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution.
- Reuse: The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.
- Absence of Technological Restriction: The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, i.e. one whose specification is publicly and freely available and which places no restrictions monetary or otherwise upon its use.
- Attribution: The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work. If this condition is imposed it must not be onerous. For example if attribution is required a list of those requiring attribution should accompany the work.
- Integrity: The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.
- No Discrimination Against Persons or Groups: The license must not discriminate against any person or group of persons.
- No Discrimination Against Fields of Endeavor: The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.
- Distribution of License: The rights attached to the work must apply to all to whom the work is redistributed without the need for execution of an additional license by those parties.
- License Must Not Be Specific to a Package: The rights attached to the work must not depend on the work being part of a particular package. If the work is extracted from that package and used or distributed within the terms of the work’s license, all parties to whom the work is redistributed should have the same rights as those that are granted in conjunction with the original package.
- License Must Not Restrict the Distribution of Other Works: The license must not place restrictions on other works that are distributed along with the licensed work. For example, the license must not insist that all other works distributed on the same medium are open.”
Thomas Jefferson said:
“If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea.” And also –as noted in a previous blog — “The field of knowledge is the common property of all mankind.”
A century or so later US Supreme Court Justice Louis Brandeis (in the 1918 decision, International News Service v. Associated Press] wrote: “…the general rule of law is, that the noblest of human productions—knowledge, truths ascertained, conceptions, and ideas—became, after voluntary communication to others, free as the air to common use.” Jumping ahead a few decades, Stewart Brand famously declared at a Hackers Conference in the 1984’ – “Information wants to be free…” (while also noting that information is “expensive” creating an inevitable tension…)
When we call for “free” access to knowledge resources (used here to stand for data, information and knowledge [for some working definitions SEE: Moritz, Building the Biodiversity Commons Appendix 3 http://www.dlib.org/dlib/june02/moritz/06moritz.html ] ), we are saying that access to knowledge should not be a privilege with access granted only to those that can afford the current market price. Knowledge should not be placed behind “pay walls”. To assert this right to free access is to urge that as a national and global community our common welfare demands the free access to knowledge.
The creation of mechanisms of impedance to the free flow of knowledge has tremendous societal costs. Consider the “transaction costs” entailed every time any writer or researcher must simply contact an author or publisher for permission to use any resource. (I estimate that at the American Museum of Natural History, early in the last decade, we invested about $25,000 just to perform due diligence to secure our right to freely disseminate our own scientific publications.) Consider the transaction costs implied by the “inter-library loan” industry?
Add to transaction costs the possibility that additional charges may be assessed before an article can be actually used. As an independent researcher, without a current institutional base, I was forced to pay Nature/ MacMillan $32 US for access to the “Commonwealth of Science”(1941) article cited in a previous blog. Could I have found ways to circumvent? Yes, of course, that is hardly the point — I intend to act in good faith as do most people.) Consider the plight of any teacher ambitious enough to seek use of original source materials – or of any public school student or parent for that matter… Consider costs associated with health care information…?
Having asserted this right to access we are obliged to address the question of cost and of fair compensation for the creation of knowledge. Since the era of Ronald Reagan and Margaret Thatcher the Anglo-American polity has been in a kind of thrall – a few years ago when I proposed an alternative system of public compensation for knowledge creation – a colleague – very highly place in a professional scientific society asked me with incredulity – “you mean pay for it with taxes”?
It has become almost an a priori article of faith that public investment is somehow bad (except, I can not help adding, when required to bail out major financial institutions and insure exorbitant bonuses for financial executives).
Market fundamentalists (and “casino economists”! SEE: JM Keynes), notwithstanding, the United States has always depended upon public investment to insure the viability of our economy. Whether by investment in postal service, energy, public schools/libraries/museums, the Interstate Highway system, the National Science Foundation or the Internet, it has been public investment that has created the infrastructure for our economic success and for innovation. And it has been the economic opportunities created for individuals by public investment that have continued to draw to our shores the ambitious and the energetic, the innovative and the productive. It is the rich diversity of America’s population that is our greatest asset and that holds our greatest hope for meeting 21st Century challenges.
Freeing “government information” is a fine starting point but all knowledge must be free…
I believe that we need an Andrew Carnegie for the 21st Century – assets that have been locked away behind pay walls should be placed in the public domain. And we need new paradigms that sustainably support and fairly compensate research and intellectual work but require release of knowledge products for free public use. (The open access publishing model suggests one such strategy…)
Next, “open”? And effective?