Home » post » Statistical reality check

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Statistical reality check

As noted here earlier, Sunlight Labs has announced the three finalists in its “Apps for America 2” competition. One of those is DataMasher which enables users to “have a little fun” with government data “by creating mashups to visualize them in different ways and see how states compare on important issues. Users can combine different data sets in interesting ways and create their own custom rankings of the states.”

A post about this on slashdot prompted a reply, Lies, Damned Lies, and DataMasher that worries that “in practice DataMasher would end up mostly generating a lot of bad information.” The reply continues:

The site as it exists now seems to encourage you to think about issues in a really simplistic way (with a simple arithmetic combination of two numbers on a state by state basis) that’s going to mislead more often than inform. The devil is always in the spurious correlations, and DataMasher just doesn’t give you ability to get at that sort of thing (nor do most people have the understanding of statistics anyway).

…Statistics are extremely useful in determining public policy, but only if used carefully. There’s already so much bad use of statistics in our public policy debates, and DataMasher seems perfectly designed (unintentionally, I’m sure) to exacerbate the problem.

I am very sympathetic to this argument but would add an additional caveat to it. Any tool can be misused or used badly. Decades ago, some statisticians were upset when commercial software like SAS and SPSS were being introduced because it allowed anyone to run a regression without knowing what it was or how to do the math or whether they were regressing variables that made sense. While it is certainly true that the design of tools can encourage misuse or bad use, it is also true, I think, that even well-designed tools can be used badly and even bad tools may be better than no tools because they can encourage imagination and exploration and curiosity. Those can lead to better, more informed questions and analysis.

For libraries and service providers there is another side to this story. As tools like DataMasher become more available and easier to use, it actually creates new challenges for information service providers. Rather than making our jobs easier, the availability of these kinds of tools actually makes our jobs more complex. Rather than pointing at a reliable book of statistics, created by government statisticians and published by the government, we now have ‘raw’ sources and sources that require more understanding and skill to use and interpret accurately and responsibly. Where once we tried to make sure that the people we helped looked at footnotes and table headers so they understood statistics, now we are faced with helping people use raw data and helping them produce their own statistics. Every library will have to decide on what level of service to provide in situations like this. No library should avoid addressing the service implications of the availability of new sources of information — no matter how good or bad they are.

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.