As previously discussed, “free” and “open” dissemination of data are primary values, are fundamental premises for democracy. Data buried behind money walls, or impeded or denied to users by any of a variety of obstacles or “modalities of constraint” (Lawrence Lessig’s phrase) cannot be “effective”. But even when freely and/or openly available data can be essentially useless.
So what do we mean by “effective”? One possible definition of “statistics” is: “technology for extracting meaning from data in the context of uncertainty”. In the scientific context – and I have been arguing that all data are or should be treated as “scientific” – if data are to be considered valid, they must be subject to a series of tests respecting the means by which meaning is extracted…
By my estimation, these tests in logical order are:
Are the data well defined and logically valid within some reasoned context (for example, a scientific investigation – or as evidentiary support for some proposition)?
— Is the methodology for collecting the data well formed (this may include selection of appropriate, equipment, apparatus, recording devices, software)?
— Is the prescribed methodology competently executed? Are the captured data integral and is their integrity well specified?
— To what transformations have primary data been subject?
— Can each stage of transformation be justified in terms of logic, method, competence and integrity?
— Can the lineages and provenances of original data be traced back from a data set in hand?
The Science Commons [SEE: “Protocol for Implementing Open Access Data” http://www.sciencecommons.org/projects/publishing/open-access-data-protocol/] envisions a time when “in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database” and, later in the protocol, imagines a compilation involving 40,000 data sets. Just the prospect of proper citation for the future “meta-analyst” researcher suggests an overwhelming burden.
So, of course, even assuming that individual data sets can be validated in terms of the tests I mention above, how are we to manage this problem of confidence/ assurance of validity in this prospectively super-data-rich environment?
(Before proceeding to this question let’s parenthetically ask how these test are being performed today? I believe that they are accomplished through a less than completely rigorous series of “certifications” – most basically, various aspects of the peer review process assure that the suggested tests are satisfied. Within most scientific contexts, research groups or teams of scientists develop research directions and focus on promising problems. The logic of investigation, methodology and competence are scrutinized by team members, academic committees, institutional colleagues (hiring, promotion, and tenure processes), by panels of reviewers – grant review groups, independent review boards, editorial boards — and ultimately by the scientific community at large after publication. Reviews and citation are the ultimate validations of scientific research. In government, data are to some extent or other “certified by the body of agency responsible.)
If we assume a future in which tens of thousands of data sets are available for review and use, how can any scientists proceed with confidence? (My best assumption, at this point, is that such work will proceed with a presumption of confidence – perhaps little else?)
Jumping ahead, even in a world where confidence in the validity data can be assured, how can we best assure that valid data are effectively useful?
A year ago in Science a group of bio-medical researchers raised the problem of adequate contextualization of data [SEE: I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713] Specifically, they suggested:
“a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis.“ While their focus was on clinical studies in the bio-medical realm, the logic of their argument extends to all data. We already have tools available to us that can specify scientific work flows to a very precise degree. [SEE for example: https://kepler-project.org/ ] It seems entirely possible to me that such tools can be used – in combination with well-formed ontologies built by consensus within disciplinary communities to systematize the descriptions of scientific investigation and data transformation. – and moreover – by the combinations with socially collaborative applications — to support a systematic process of peer review and evaluation of such work flows.
OK — so WHAT ABOUT GOVERNMENT INFORMATION??? We’re just government document librarians or just plain citizens trying to make well-informed decisions about policy? Stay tuned…
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.