Home » post
Category Archives: post
Back when I helped teach new data librarians about data, one of the themes my colleagues and I liked to repeat was that “data should tell a story.” By that we meant that raw facts are literally without meaning until we analyze them and understand the stories they tell. “Understanding” is more than facts. As John Seely Brown and Paul Duguid said in their book The Social Life of Information, “information” is something that we put in a database, but knowledge “is something we digest rather than merely hold. It entails the knower’s understanding and some degree of commitment. Thus while one person often has conflicting information, he or she will not usually have conflicting knowledge” (p 119-120).
In those early days of data-librarianship, the tools we had for finding and acquiring and using data were very primitive (and often expensive) compared to the tools available today. Today, one can download and install very sophisticated free software for statistical analysis, data visualization, and even data animation. And one can download enormous data time series directly from the web and do analysis on the fly.
One big source of data is, of course, the federal government. Of course, we shouldn’t just hope that the government will preserve and provide free access to its data. Libraries need to take action to ensure long-term free availability of data.
I say all that as an introduction to an article that I recommend to you as a source of inspiration toward action, an example of what can be done with government data today, and a cautionary tale of how data can be manipulated to tell stories that appear “true” but which actually distort the story the data really tell.
- 2016 Will Be The Warmest Year, But This Is How Deniers Will Spin It. Peter Aldhous. (December 20, 2016).
Aldhous provides code for using R and ImageMagick and Adobe Illustrator to load data on the average global temperature for each month since January 1880 directly from from the National Oceanic and Atmospheric Administration. He analyzes the data, animates it, and demonstrates how changing the timeline can make the data tell a false story.
How a complex network of bills becomes a law: GovTrack introduces new data analysis of text incorporation
Here’s a fascinating new way to look at US Congressional legislation from our friends at GovTrack.us. As Josh Tauberer explains, GovTrack’s new service “Enacted Via Other Measures,” their new data analysis of text incorporation, will now provide connections between bills — when a bill has at least about 33% of its provisions incorporated into one or more enacted bills — in order to show “how a complex network of bills becomes a law.”
No longer will legislative trackers be limited to the 6 stages in becoming a law described on Congress.gov, or even the 13 steps described by this handy infographic by Mike Wirth and Suzanne Cooper-Guasco (“How Our Laws Are Made”, First place award in the Design for America contest, 2010). Now we’ll be able to see the various pieces of bills that make it into other bills.
This is an amazing new looking glass into the legislative process. Thanks GovTrack.us!
This new analysis literally doubles our insight.
Only about 3% of bills will be enacted through the signature of the President or a veto override. Another 1% are identical to those bills, so-called “companion bills,” which are easily identified (see CRS, below). Our new analysis reveals almost another 3% of bills which had substantial parts incorporated into an enacted bill in 2015–2016. To miss that last 3% is to be practically 100% wrong about how many bills are being enacted by Congress.
And there may be even more than that, which we’ll find out as we tweak our methodology in the future.
There are so many new questions to answer:
- Who are the sources of these enacted provisions?
- How often is this cut-and-paste process cross-partisan?
- What provisions were removed from a bill to be enacted?
- Is cut-and-paste more frequent today than in the past?
A lot has been written about “fake news” in the last few months. Too much of that writing has (IMHO) muddled the differences between just-plain-lies and everything else that divides the country at the moment. But the basic issue of politicians who distort the truth because they are more interested in the zero-sum game of political power than they are in governance is an old one. When politicians do this consistently and with coordination and determination, all that distortion ends up in “the news” as if it were true, when really it is just spreading Fear, Uncertainty, and Doubt. As an article in Science reports, that process is starting in the new Congress with a renewed attack on the Census and the American Community Survey.
- Scientists fear pending attack on federal statistics collection By Jeffrey Mervis, Science (Jan 3, 2017).
As the Science story says, this is not a new attack, but part of “a broader attack on the survey that goes back several years.” Indeed, we covered it here. Read all about the false-facts, bent-facts, unsubstantiated speculation, and ideological faith-healing that typically are used to try to persuade people to support really, really bad policy ideas.
In a recent thread on the Govdoc-l mailing list about a Congressional Publications Hub (or “pub hub” — more of the thread here), one commenter said that The American Memory project’s digital surrogates of the pre-Congressional Record publications "probably aren’t salvageable" because the TIFFs were captured at 300 ppi resolution and then converted to 2-bit bitonal black and white and that most of the text is too faded or pixelated to be accurately interpreted by optical character recognition (OCR) software. He concluded that this was "Kind of a shame."
It is indeed a "shame" that many of the American Memory Project’s "digital surrogates" probably are not salvageable. But the real shame is that we keep making the same mistakes with the same bad assumptions today that we did 10-15 years ago in regard to digitization projects.
The mistake we keep making is thinking that we’ve learned our lesson and are doing things correctly today, that our next digitizations will serve future users better than our last digitizations serve current users. We are making a series of bad assumptions.
- We assume, because today’s digitization technologies are so much better than yesterday’s technologies, that today’s digitizations will not become tomorrow’s obsolete, unsalvageable rejects.
- We assume, because we have good guidelines (like Federal Agencies Digital Guidelines Initiative (FADGI)) for digitization, that the digitizations we make today will be the "best" by conforming to the guidelines.
- We assume, because we have experience of making "bad" digitizations, that we will not make those mistakes any more and will only make "good" digitizations.
Why are these assumptions wrong?
- Yes, digitization technologies have improved a lot, but that does not mean that they will stop improving. We will, inevitably, have new digitization techniques tomorrow that we do not have today. That means that, in the future, when we look back at the digitizations we are doing today, we will once again marvel at the primitive technologies and wish we had better digitizations.
- Yes, we have good guidelines for digitization but we overlook the fact that they are just guidelines not guarantees of perfection, or even guarantees of future usability. Those guidelines offer a range of options for different starting points (e.g., different kinds of originals: color vs. B&W, images vs. text, old paper vs. new paper, etc.) and different end-purposes (e.g., page-images and OCR require different specs) and for different users and uses (e.g. searching vs reading, reading vs. computational analysis). There is no "best" digitization format. There is only a guideline for matching a given corpus with a given purpose and, particularly in mass-digitization projects, the given corpus is not uniform and the end-point purpose is either unspecified or vague. And, too often, mass-digitization projects are compelled to choose a less-than-ideal, one-size-does-not-fit-all, compromise standard in order to meet the demands of budget constraints rather than the ideals of the "best" digitization.
- Yes, we have experiences of past "bad" digitizations so that we could, theoretically, avoid making the same mistakes, but we overlook the fact that use-cases change over time, users become more sophisticated, user-technologies advance and improve. We try to avoid making past mistakes, but, in doing so, we make new mistakes. Mass digitization projects seldom "look forward" to future uses. They too often "look backward" to old models of use — to page-images and flawed OCR — because those are improvements over the past, not advances for the future. But those decisions are only "improvements" when we compare them to print — or more accurately, comparing physical access to and distribution of print vs digital access to and distribution over the Internet. When we compare those choices to future needs, they look like bad choices: page-images that are useless on higher-definition displays or smaller, hand-held devices; OCR that is inaccurate and misleading; digital text that loses the meaning imparted by layout and structure of the original presentation; digital text that lacks markup for repurposing; and digital objects that lack fine-grained markup and metadata that are necessary for accurate and precise search results finer than volume or page level. (There are good examples of digitization projects that make the right decisions, but these are mostly small, specialized projects; mass digitization projects rarely if ever make the right decisions.) Worse, we compound previous mistakes when we digitize microfilm copies of paper originals thus carrying over limitations from the last-generation technology.
So, yes, it is a shame that we have bad digitizations now. But not just in the sense of regrettable or unfortunate. More in the sense of humiliating and shameful. The real "shame" is that FDLP libraries are accepting the GPO Regional Discard policy that will result in fewer paper copies. That means fewer copies to consult when bad digitizations are inadequate, incomplete, or unusable as "surrogates"; and fewer copies to use for re-digitization when the bad digitizations fail to meet evolving requirements of users.
We could, of course, rely on the private sector (which understands the value of acquiring and building digital collections) for future access. We do this to save the expense of digitizing well and acquiring and building our own public domain digital collections. But by doing so, we do not save money in the long-run; we merely lock our libraries into the perpetual tradeoff of paying every year for subscription access or losing access.
The first thing the new Republican-led Congress did was attempt to kill the Office of Congressional Ethics — thankfully the public uproar forced them to withdraw the plan *for now*.
Now Congress is set to put into place a terrible new law called Regulations from the Executive in Need of Scrutiny Act (REINS Act). This bill, which has died in the last 2 Congresses (and for good reason) essentially cripples executive agencies and their ability to create regulations to apply laws passed by Congress.
Currently, executive agencies develop regulations to apply new laws through careful study, research, and discussion with experts and the public — all proposed regulations are published in the Federal Register and the public is given a chance to comment in order to assist policy experts in writing solid regulations.
If REINS passes — and its passed the House twice but died in the Senate each time! — it will require all regulations passed by executive agencies (like the EPA) that have an “annual economic impact of $100 million or more,” which is less than 0.0006 percent of the U.S. economy, must be approved by Congress within “70 session days” or it does not go into effect. Essentially, Congress gets a “pocket veto”, if they do not affirmatively give their blessing within 70 days, the proposed regulation dies. REINS basically guts the system of checks and balances between our 3 branches of government upon which the US government rests.
The Federal government already works very slowly (when it works), but with one swipe of a pen, our government will be permanently crippled. Please, please please contact your members of Congress and let them know this is a BAD idea!!
The incoming House majority plans to schedule a vote on the Regulations from the Executive in Need of Scrutiny Act (REINS Act) soon after new members are sworn in next Tuesday. A top priority of the U.S. Chamber of Commerce, the leading lobby group for big business, REINS would fundamentally alter the federal government in ways that could hobble federal agencies during periods when the same party controls Congress and the White House — and absolutely cripple those agencies during periods of divided government.
Many federal laws delegate authority to agencies to work out the details of how to achieve relatively broad objectives set by the law itself. The agencies do so by drafting regulations that interpret and elaborate upon these statutes and which have the force of law. REINS, however, effectively strips agencies of much of this authority.