Death and Taxes - A Graphical Visualization of the Federal Budget
Death and Taxes is a large representational graph and poster of the federal budget. It contains over 500 programs and departments and almost every program that receives over 200 million dollars annually. The data is straight from the president's 2009 budget request and will be debated, amended, and approved by Congress to begin the fiscal year. All of the item circles are proportional in size to their spending totals and the percentage change from 2008 is included to spot trends and disproportion.
Recently this graphic was posted on the "The Gavel" blog of the Speaker of the House (What 3.6 Million Jobs Lost Over 13 Months Looks Like, by "Karina," February 6th, 2009). It shows the number of jobs lost (and recovered) in the recessions of 1990, 2001, and 2008. By juxtaposing the three time periods over each other starting with the peak job month and showing employment change by month, it gives a startling comparison that highlights the severity of the current situation. It also implies that we will have a long time to wait before we reach our previous peak month.
I was curious about this graph and did a little follow up that I share below. Data librarians may find this a bit tedious, but for those who have never used raw data, it may be useful as an illustration of the difference between "data" (the raw numbers that you put into statistical software) and "statistics" (the human-viewable tables and graphs that we see in publications).
Unfortunately, as is often the case with statistical information, the source given for the graphic is incomplete: simply "Bureau of Labor Statistics." I could not find the graphic itself on the bls.gov site, so I assume that the chart was constructed from BLS data, specifically, the Current Population Survey or the Current Employment Statistics Survey. These two surveys count employment differently -- one is a survey of individuals and the other is a survey of employers.
There is a similar, but not identical, chart ("Percent change in total nonfarm employment, from beginning of recession) in the January 2009 (released February 6, 2009) Current Employment Statistics Highlights, Monthly (Bureau of Labor Statistics), so my guess is that someone at the Speaker's office built the chart from the raw CES data.
Just out of curiosity, I went to the CES "Most Requested Statistics webpage and downloaded "Total Nonfarm Employment - CES0000000001" for 1990 through the end of 2008. Raw data suitable for analysis even look "raw," not even like a statistical table:
1990,Jan,109151 1990,Feb,109396 1990,Mar,109611 1990,Apr,109651 1990,May,109800 1990,Jun,109817 1990,Jul,109775 1990,Aug,109567 1990,Sep,109485 1990,Oct,109324 1990,Nov,109180 1990,Dec,109120 1991,Jan,109001 1991,Feb,108695 1991,Mar,108535 1991,Apr,108324 1991,May,108196 1991,Jun,108283 ...
Of course, it is relatively easy, using statistical software, to construct tables and graphs from raw data. Here, for example, is a published statistical table with essentially the same raw information (but from CPS, not CES) that I downloaded. (See the full table from Employment from the BLS household and payroll surveys: summary of recent trends, February 6, 2009).
By using the raw data to create a graph, one can tell a story that has more impact than just a table of numbers. It is relatively easy to get these data into statistical software. I used Excel and Stata to create a small time-series data file. I organized it by month (from month "1" to month "48") with each row of the data file having data for 3 recessions. The first row has data for the first month of the three recessions, the second row has data for the second month, etc. The CES data has employment totals in millions. For example, the employment for 2008:
138152 138080 137936 137814 137654 137517 137356 137228 137053 136732 136352 135755 135178
I had to compute a new variable for each recession: the cumulative number of jobs lost. So, for example, 2008:
138152 0 138080 -72 137936 -216 137814 -338 137654 -498 137517 -635 137356 -796 137228 -924 137053 -1099 136732 -1420 136352 -1800 135755 -2397 135178 -2974
The first 12 months with all three recessions (v1, v2, v3) and the computed variables (1990, 2001, 2008) look like this:
month v1 1990 v2 2001 v3 2008 1 109817 0 132530 0 138152 0 2 109775 -42 132500 -30 138080 -72 3 109567 -250 132219 -311 137936 -216 4 109485 -332 132175 -355 137814 -338 5 109324 -493 132047 -483 137654 -498 6 109180 -637 131922 -608 137517 -635 7 109120 -697 131762 -768 137356 -796 8 109001 -816 131518 -1012 137228 -924 9 108695 -1122 131193 -1337 137053 -1099 10 108535 -1282 130901 -1629 136732 -1420 11 108324 -1493 130723 -1807 136352 -1800 12 108196 -1621 130591 -1939 135755 -2397
Here is a complete tab-separated-values version of the data file I constructed. Then I used Stata to build a graph and it looks very much like the one at the Speaker's Blog.
Of course, when one tells one story, one leaves out other stories. This graphic doesn't show that the starting points of the recessions were different:
1990: 109 million
2001: 132 million
2008: 138 million
Open re-usable government information
One could use the raw data to tell a lot of different stories and analyze the data in many different ways. And that brings me to the connection between all this and why we need to be sure that government information is not just "free as in beer" but also "free as in open."
It is important for statistical agencies to publish statistics to help us understand their raw data. But, it is also essential that they provide us with the raw data so that we can better understand their statistics and do our own analyses. Most of the statistical agencies of the U.S. government do an excellent job of making their raw data easily available. In fact, the rest of government would do well to use statistical agencies as a model for instantiating their information in usable and re-usable formats (in addition to any publishing and presentation of their data/information) so that the information, whether it is text or images or video or sound or numbers, can be used, reused, analyzed, stored, and preserved.
President Obama's inaugural speech has generated some interesting examples of how technology can be applied to government information when the information is freely available for use and re-use and not locked into government databases or proprietary formats. It is a small piece of text with a lot of public interest and high visibility and, therefore, ripe for these kinds of demonstrations and experiments. Of course, to make use of the information, we have to actually have a copy of it. Imagine what would happen if all government information was actually distributed in open formats to libraries so that we could build collections that were index-able, search-able, visually browsable, and analyzable in interesting ways. Imagine freeing government information from its .gov silos and integrating it with non-government information in digital collections created for particular virtual communities of interest. Imagine the future of digital collections that are as easily re-usable as this small bit of text.
Check out these examples!
- Inaugural Words: 1789 to the Present, New York Times. "A look at the language of presidential inaugural addresses. The most-used words in each address appear in [an] interactive chart..., sized by number of uses. Words highlighted in yellow were used significantly more in this inaugural address than average."
- Visual of the Inaugural Address, ProPublica. [Compare this to the NYT version. Stop words matter!]
- Search Inside Obama’s Inaugural Speech. Delve Networks. "We invite you to experience President Obama’s inaugural speech using our search inside technology. To do this, type what you’re looking for into the player searchbar above. A heatmap will show you where information related to your topic appears in the speech. You can move your mouse over the heatmap to see the matches. Click to jump to that place in the speech."
Happy new year to all of you. Whether you are on vacation and peeking at the news, or reading this as you just get back to work, here is something interesting and fun to see:
Wattenberg is a computer scientist and new media artist. He is the founding manager of IBM’s Visual Communication Lab, which researches new forms of visualization and how they can enable better collaboration.
Check out his many projects (e.g., Name Voyager, the Baby Name Wizard with data from the Social Security Administration, or history flow, visualizing the editing history of Wikipedia pages, or Many-Eyes, an experiment in open, public data visualization and analysis).
This is another good example of what interesting things can be done when we have complete access to information. When the raw data are free, we can do so much more than the single views of data provided by government agencies.
Read more about Wattenberg here:
He creates ways of seeing information, by Billy Baker, Boston Globe, December 29, 2008.
In my first post, I wrote about making information useful for ordinary people. It's been a pleasure and an honor to guest blog here for the past month, and as the month of October is nearly gone, I figure it seems fitting to come back to this subject as my reign as "Blogger of the Month" comes to an end.
Large numbers in particular are difficult to comprehend and the world of government information is full of them: earmarks range from hundreds of thousands to tens of millions of dollars, Barack Obama's fundraising totals have eclipsed six-hundred million dollars, and the $700 billion dollar bailout package had pundits scrambling to describe things that cost $700 billion. The difficulty of explaining just how big some of these numbers are was seen to an absurd end when CNN presented a calculation as to how many McDonald's apple pies could be purchased for each US citizen with such a sum.
One of the most useful ways of putting information in context that I've seen involving government information or anything else are the sparklines at watchdog.net:
These graphics show the statistics of each lawmaker in context, as well as the general shape of the distribution of Congress as a whole. Knowing that a congressperson requested $147 million in earmarks may sound like a lot, but seeing that it puts them outside of the top 100 may provide some useful and much needed context to these numbers. The shape of the line also shows if there is a smooth trend or a sharp jump with a small handful of lawmakers raising or spending drastically more than others.
Hopefully more and more presentations of government information will follow the lead of the terrific watchdog.net and attempt to surround information with relative context so that government information isn't simply available, but understandable.
The Words They Used, by MATTHEW ERICSON, New York Times, September 4, 2008. "The words that speakers used at the two political conventions show the themes that the parties have highlighted."
This is a bubble graph of number of times words were used per 25,000 words spoken and a list of which speakers used which words. Ericson has done a good job of looking at phrases as well as individual words, of combining similar words and phrases, and of noting phrases that have very little or no use by one or both parties. Another good example of how, when we have access to the "raw data" (as opposed to transaction-based, search-and-retrieve, one-page-at-a-time access), the data can be used, re-used, and analyzed.
What would it be like if we had true open access to large quantities of government text? We would be able to do much more than retrieve a page of the Congressional Record and read it. Researchers would be able to analyze the text and create new, innovative ways of discovering, browsing, searching, and reading text-based information.
Clifford Lynch has written eloquently about this in the realm of scholarly literature (Clifford A. Lynch, "Open Computation: Beyond Human-Reader-Centric Views of Scholarly Literatures," Open Access: Key Strategic, Technical and Economic Aspects, Neil Jacobs Ed., Oxford: Chandos Publishing, 2006, pp. 185-193.).
I was reminded of these issues this morning when looking at Visualization Strategies: Text & Documents on Tim Showers Web Design Blog (August 20th, 2008). Tim lists more than a dozen examples of techniques and tools. One of my favorites is the visualization of the 2008 Democratic primary debates offered by the New York Times. You can hear the debate, search for keywords and see where they appear, browse a transcript, and more.
Shouldn't we have free, open, access to large bodies of all government texts (not just search-and-retrieve access to bits-and-pieces) so that we can easily create corpora that can be indexed, browsed, and analyzed?
Thanks and a tip of the hat to Tim Dennis!