Home » Posts tagged 'open computation'
Tag Archives: open computation
Happy new year to all of you. Whether you are on vacation and peeking at the news, or reading this as you just get back to work, here is something interesting and fun to see:
Wattenberg is a computer scientist and new media artist. He is the founding manager of IBM’s Visual Communication Lab, which researches new forms of visualization and how they can enable better collaboration.
Check out his many projects (e.g., Name Voyager, the Baby Name Wizard with data from the Social Security Administration, or history flow, visualizing the editing history of Wikipedia pages, or Many-Eyes, an experiment in open, public data visualization and analysis).
This is another good example of what interesting things can be done when we have complete access to information. When the raw data are free, we can do so much more than the single views of data provided by government agencies.
Read more about Wattenberg here:
He creates ways of seeing information, by Billy Baker, Boston Globe, December 29, 2008.
2008 Society of American Archivists Convention: “Citizens in the Dark? Government Information in the Digital Age”
These are my speaker-notes for the presentation, “Citizens in the dark?
Government Information In the Digital Age,” which I gave on Friday
Aug 29, 2008, at the meeting of the Acquisitions and Appraisal Section
of the Society of American Archivists Convention in San Francisco.
The theme of the convention was “Archival R/Evolution & Identities.”
This is not a transcript of what I actually said, but an outline from which
I spoke. There are sentence fragments and inconsistent capitalization and
other less-than-final-draft editing. I hope that this is useful to you in
spite of these distractions.
I do include the points I tried to make and a bit of the verbiage and all
of the links I have.
– Jim Jacobs.
The title of this presentation is:
Citizens in the dark? Government Information In the Digital Age
We are seeing a fundamental change in the way governments communicate
with citizens. These changes are NOT caused by technology, although
they are enabled by technologies. They are driven and determined by
economic, political, and social issues.
The solutions are therefore, not technological either, although they
will be enabled by technology. The solutions are economic, social,
Abby Smith, Director of Programs at the Council on Library and Information
Resources, in a CLIR report on “authenticity” in a digital age, summed
this up quite nicely:
Interestingly, the scholar-participants suggested that technological
solutions to the problem [of establishing the authenticity of a digital
object] will probably emerge that would obviate the need for trusted
third parties. Such solutions may include, for example, embedding
texts, documents, images, and the like with various warrants (e.g.,
time stamps, encryption, digital signatures, and watermarks). The
technologists replied with skepticism, saying that there is no
technological solution that does not itself involve the transfer of
trust to a third party. Encryption — for example, public key
infrastructure (PKI) — and digital signatures are simply means of
transferring risk to a trusted third party. Those technological
solutions are as weak or as strong as the trusted third party. To
devise technical solutions to what is, in their view, essentially a
social challenge is to engender an “arms race” among hackers and their
Abby Smith, “Digital Authenticity in Perspective.” in “Authenticity in a
Digital Environment,” Council on Library and Information Resources,
Publication 92. (May 2000).
“Trust” is a social issue, not a technological one.
As we look at the technological changes, the way governments are using
and not using, adopting and avoiding, and in general coping with these
technological changes, i think we all see trends.
The agenda of this conference reflects these trends and changes.
with its theme of Revolution and Evolution, with sessions on
– digital repositories
– born digital materials,
– digital manuscripts
– the “e-tiger”
And of course, representatives from NARA, LC, and GPO are here to discuss
I assume that all of us are familiar at least in a general way with the
many of the difficulties of digital archiving. things like:
– format obsolescence
– media deterioration
– content that is tied to a particular operating system or application
– the need for new kinds of metadata
– emulation and migration strategies.
So, I will not cover those today.
What I do want to do is to give you a (perhaps) slightly different
perspective and some (possibly) different ideas and approaches to these
challenges and bring up some issues that i believe do not have enough
1. In the past, government information archiving was straightforward
a) We knew and could fairly easily define and identify records
b) We could (again, in a fairly straightforward way) identify
responsibility for record creation, scheduling, retention, deposit,
preservation, access, etc.
c) We could establish procedures to get things done. predictable,
So… in the past, we had a pretty clear path of preservation:
– of what we wanted to preserve and
– of how to preserve it and
– of who was responsible at each stage from record creation through
retention and disposition and preservation.
We could define and identify what we wanted to preserve and seek and
possibly fund the preservation.
We may not have always been 100% effective, there may have been failures,
gaps, short-funding, recalcitrant agencies, mistakes, etc. but we at least
knew what we were doing and where the gaps were…
A lot has changed, perhaps everything. Here are four areas
of fundamental change that affect our ability to archive the
complete historical record of governments:
1) WHAT. While to some extent we can still define and identify records,
the job of doing so is much less clear. There may be some things that
we cannot get a hold on to define as records. there may be things that
are part of the record which the govt does not even possess. Or for
which it lacks licensing or copyright permission to possess or copy.
2) WHO. Even to the extent that we can identify (broadly) what we want
to preserve, it may be hard to identify who is responsible and
difficult to create adequate, implementable, schedules for
3) HOW. Even if we can do all that, digital preservation itself is
difficult and it is very hard to move from a quick-moving,
service-oriented, bureaucratic, day-to-day, digital environment, to an
environment of digital preservation.
4) ACCESS. While preservation without access is not preservation at all,
“access” is a very different process than preservation.
It seems to me that the very processes that make it *easier* for a
current end-user to find and use digital information make it *harder*
for the archivist to preserve that same information and ensure its
usability in the future.
Let’s look at some examples
E-mail certainly provides good examples of the “recalcitrant agency” problem.
But I want to emphasize some other issues that will plague us even if we
solve that one.
An article in Technology Review gave several good examples.
One related a story that Allen Weinstein tells about how he discovered
in his FBI files a newspaper clipping with a note hand-written on it by
J. Edgar Hoover.
If that same communication happened today, it would most likely happen
in an email with, perhaps an attachment of the article, or worse, a
link to the article.
Even if we had in place all the new laws and regulations that are being proposed
to ensure that we can actually save email, would we have complete record? Or
would we have a partial record with a key part missing. And would be able
to find or identify that part? Would we be permitted to archive it?
Talbot, David. “The Fading Memory of the State.” Technology Review, July
Another problem with email is the difficulty in knowing what to preserve.
the simplest algorithm for preservation of email is to preserve everything, but
that means preserving so many trivial, unimportant messages that would not
normally be scheduled for retention in any rational universe.
RECORD OF INFORMATION USED IN DECISION MAKING
Another example from that same Technology Review article:
The mistaken bombing in 1999 of the Chinese embassy in Belgrade.
U.S. officials blamed the error on outdated maps used in targeting.
Today’s planners would use GIS software to zoom and pan, and run
calculations about the topography to make a targeting decision.
Would the software preserve the decision making process?
There are layers of challenges here:
– the data used (spatial data, databases of locations, topography, etc.)
– the software used to analyze and use the spatial data
– the code behind that software that has its own algorithms for implementing
– the actual use by the end-users, the trail of how they used the software
to analyze the data
These are difficult things to archive!
When decisions are based on computer models working on dynamic databases,
will we be able to preserve for future historians the state of the database
and the algorithms built into the models?
When we think about the preservation of the historical record, we have to
include public documents as well as private communications and decision-making
In the past, “public documents” meant “publications” that were widely distributed
to the public and depository libraries.
Today, it means web sites.
As you probably know the Library of Congress recently announced a big
project with several partners to crawl the .gov domain at the end of the
current presidential administration. This will harvest a lot of digital
content that might otherwise disappear with the change of administrations.
An article about this project discusses the software being used and some
of the issues.
Quint, Barbara. “Consortium–Minus NARA–Archiving Bush Administration
Websites.” Information Today NewsBreaks, August 28, 2008.
See also a discussion about NARA’s role at the ArchivesNext blog:
And some more on NARA here:
Web-harvesting is not a definitive solution, though,
When we compare web harvesting with active deposit by the government of
documents in depository libraries we can get a glimpse at the scope of
the preservation problem we face.
Web harvesting puts the onus on the harvesters.
It releases the government from the obligation of actively depositing information.
It is a step back in time. It means that archivists and librarians have
less control on their own selection and acquisition and that agencies have
WHERE IS GOVINFO? (ON THE WEB…?)
A study by the Center for Democracy and Technology late last year
“Why Important Government Information Cannot Be Found Through
Commercial Search Engines”
The reasons for the failure of search engines to adequately index
government web sites are the same reasons that make it difficult for
web harvesting to be successful. If we can’t find the information,
we cannot harvest it.
We cannot preserve what we cannot save.
BEYOND .GOV AND .MIL
One problem (and a rapidly growing one) is that not all government information is
on the .GOV and .MIL domains.
Here are some examples:
twitter.com is the very popular “micro blogging” site, where people post
very short entries about what they are doing, where they are having lunch,
and so forth. did you know that many government agencies twitter?
– the white house Communications Office
– the Department of Health & Human Services: Office on Women’s Health
– and more than 60 others.
– 20 or 30 members of congress
The military is actively using YouTube to post videos
U.S. military offers up its side of the Iraq war on YouTube
la times By Alexandra Zavis May 01, 2007 in print edition A-4
One military youtube channel says that:
“Video clips document action as it appeared to personnel on the ground and
in the air as it was shot.”
– NASA posts videos on YouTube and iTunes
– NOT JUST FEDERAL…
I’m mostly giving examples of the federal government today, but the trends
stretch across all levels of government.
the PRINCE WILLIAM COUNTY SERVICE AUTHORITY IN VIRGINIA is posting
videos on YouTube.
– the STATE OF CALIFORNIA has a youtube channel
and GOVERNOR SCHWARZENEGGER posts to twitter.
– According to GovTech magazine, more than 100 House members have
multimedia pages and YouTube links on their Web sites
FLICKR / LC
qik.com allows you to stream video live from their cell phones.
Congressman John Culberson of Texas is a big fan and has his own qik
channel where he streams and posts interviews, meetings and more.
Is this official government information? or political? or both?
While these examples may strike you as “not official” or “non-governmental”
the point here is that the environment for distribution of information is
changing rapidly and we must keep up with the changes. If Culbertson’s qik
site is not “official,” for example, we need a way of appraising it as such
and a way of differentiating it from the next channel that appears that
that *is* official.
HYBRID SITES WITH MIXED MESSAGES
The military is providing us with lots of examples of archiving problems.
These issues of provenance, use-rights, copyright, and just plain finding
and getting information.
This is an extension of what I call the “copyright poison-pill” in which
copyrighted material appears in an otherwise non-copyrighted government
publication and creates confusion over the rights of libraries and archives
to save, reproduce, and display any or all of such materials. We see this
today in the way the Google book project has blocked access to most
government publications because they “might” be covered by copyright.
DEPARTMENT OF DEFENSE
The the DOD “official website of Multi-National Force – Iraq” is a .com site,
not a .mil.
There we find
– links to other commercial sites with streaming video without download
but which add an additional level of difficulty in identifying and
bookmarking links and downloading pages.
Another DOD site that provides video clips is .mil but a lot of the content
is actually hosted by .coms
– While this is a .mil domain, it is actually operated by the Intel
Corporation and is hosted and maintained by a commercial organization
known as The FeedRoom or Globix Corporation
– While you can download video, you are bound by an END-USER LICENSE
AGREEMENT, in which Intel claims all proprietary rights to the content
and videos on the site.
– Those who try to harvest the content from this site will find
that the site instructs robots that should must not save copies of
videos or even web pages.
Here is another example of DOD video problems.
In January of 2008, the Pentagon broadcast a video of the “straights of
Hormuz incident” in which an unidentified voice says, apparently to a US
battleship “You will … explode.”
It was much in the news. (I found more than 1000 items on LexisNexis over
about a 4 week period.)
In January, two defense department web sites linked to a video of the
incident and one labeled it as “From Defense Department Video.” One
of those pages still exists.
But by June 2008, that url linked to a chef doing a promo for his show
called “Grill Seargent” and searches for “hormuz” turned up zero hits.
Last week, when I checked again, the link I got was to a 15 second ad for
“the pentagon channel” that said:
“Embrace accountability for all that you do — for everything in your
area of responsibility.”
A shorter version of the Hormuz video, complete with “you will explode” quote
is available here:
Background information including why it is hard to know what goes missing here:
Documenting the Government — Strait of Hormuz edition
But, this is more than the tale of a broken link.
– The link was not to a .mil site, but to a commercial site (FEEDROOM again).
– The video was provided only as a streaming video and no download was
So, here we have a critical piece of the historical record, with no
indication of who filmed it or edited it or posted it or took it down.
And we have no easy way to preserve this video and no guarantee that any
one will or can taken the responsibility for doing so.
WHAT MAKES A WEBSITE OFFICIAL?
How do we determine what makes a website official?
One document I found is explicit, but vague. It says that
an “official website” includes any website hosted on the .mil domain,
but also “any website PUBLISHED or SPONSORED by a military comand but
hosted on a commercial server.”
Unfortunately, this creates a cascade of problems.
– Will archivists overlook these sites because they are not .mil or
– Upon finding them, can we identify who is actually responsible for
the content? (were they Published or Sponsored by the government?)
– If we find the site and identify it as something that is
government-generated, are we allowed to archive it?
STANDARDS BY ANY OTHER NAME
One of the biggest problems digital archivists face is that of file
formats. When formats are tied to particular software or operating
systems or operating environments, it creates barriers to preservation.
“Standards” that work well for the end-user (and the service provider)
one year may be exactly the wrong standard for the archivist.
We can see an example of user-friendly, archive-unfriendly at the EPA.
The EPA has a nice site that has videos, audios, podcasts, and more.
But they have chosen the “Flash based” video format as a “standard”
this is indeed a common format for streaming video, but adds additional
layers of difficulty to anyone wanting to preserve the videos by
Feds set sights on small screen
By Wade-Hahn Chan FCW August 11, 2008
There have not been any substantial, comprehensive studies of what gets
withdrawn from the web by government agencies.
(See “Chronology of Disappearing Government Information” Data collected
through May 8, 2002, Compiled by Barbara Miller for ALA/GODORT
Education Committee With special assistance of Karrie Peterson, for an
example of one attempt.
We are left with anecdotes about things disappearing or being
withdrawn and random discoveries of something here today and gone
Anyone who works with government agencies for very long will encounter,
as I have over the years, as many “policies” as their are individuals
who administer those policies.
So we sometimes see agencies that are very careful about keeping older
documents online and others that express that opinion that “No one wants
last year’s (or last month’s) report.
Here is a recent example:
We find links to a issue 6 of a “newspaper” but no links or indication of
earlier issues being available.
I have left for last the concept of “E-government” — not because it is
less important, but because it is emerging and something to watch.
E-government is intended to transform the way government communicates with
citizens and business and itself.
To the extent that it creates communications that are faster, more
accurate, and more convenient, it is a Good Thing.
But, for us, it, again, fundamentally transforms the role of government.
In the past, the role of government in information dissemination ended at
the point of dissemination. Governments would collect and create and
assemble and edit and publish information products and distribute them to
the public and to libraries.
But today, the government is taking on a new, continuing role.
With e-government, governments are saying we must go them to get our
information today, and tomorrow, and forever.
As governments move to e-government, we are going to increasingly see
government information provided as “transactions” as opposed to
Here is a simple example:
I can call 411 and get a phone number: that’s a transaction and is a
big improvement over having to locate and use a bulky telephone book
which may not even be current.
Lots of kinds of government information lend themselves to this kind of
transaction delivery and make for better, more accurate, more timely
But, if I am a journalist and I want to look at a directory of all
employees in a department, or if I’m an historian and want to see who
was in a particular office last year (or 10 or 50 years ago), or if I’m
a demographer and I want to do an surname or given-name analysis of an
agency’s employees, then a current, up-to-date
one-transaction-at-a-time system won’t help me at all. I need an
instantiation of the information from one or more time periods.
Let me give you a concrete example: The Census
Every 10 years the federal government takes a population and housing census.
Through the government’s American FactFinder web site, the Census bureau
delivers a transaction-based service where you can find census facts and
But, in addition, the Bureau makes the raw, anonymized census data
available for downloading and has deposited the data in the largest social
science data archive in the U.S (ICPSR at the university of michigan).
What this means for us is that we can preserve the census. There is an
instantiation of the census in a format that we can preserve over time.
That instantiation is what is behind American FactFinder, but it is a
preservable form of that information.
This means that, we can preserve the data
– without crawling a web site
– even if the census bureau budget is cut and it takes data offline
it also means that the raw data are available for uses and re-uses
beyond the transactions that the bureau makes available.
This is a model for making government information available and preservable
and usable and re-usable for the long-term.
It is important to note that this model benefits users today, not just in
the future. Transaction-interfaces offer a limited number of possible uses
of the underlying information. When the raw data are available, users
can analyze use, and re-use the data in many ways not provided by the
Clifford Lynch has written eloquently about the need that scholars
have to get access to the raw information in the realm of scholarly
literature (Clifford A. Lynch, “Open Computation: Beyond
Human-Reader-Centric Views of Scholarly Literatures,” Open Access: Key
Strategic, Technical and Economic Aspects, Neil Jacobs Ed., Oxford: Chandos
Publishing, 2006, pp. 185-193.).
Governments may not like the idea of doing this, though. They may want to
keep control and may want to do so under the vise of “accuracy.” (E.g.,
“Last year’s phone book isn’t accurate anymore. We don’t want copies of it
out in the world confusing people.”) Indeed, we hear that very argument
from some who still argue that the people should not have free open access
to Congressional Research Service reports. Local governments in particular
may also see information as an “asset” and wish to charge for access or
use of it.
And the private sector may not like the idea of raw information being
freely distributed because they want to control access so they can charge
for it. (Indeed we see something like that with CRS reports!)
It may be a challenge to get governments to understand this concept and,
once they do, to embrace distribution.
WHAT CAN WE DO?
There is no single solution. And we should not expect any single entity or
agency or archive to “solve” the problems.
We need a multifaceted approach to preserving the historical record.
Here are some general approaches that I hope will guide you in your
1) Do you have influence over the creation of information? Then make
sure that the information is created with preservation in mind. Talk
to creators about providing an instantiation of information in addition
to transaction-based access. Advocate free and open access. Insist on
open formats (e.g., ODF http://opendocument.xml.org/) rather than
2) Identify your partners in your organization.
– IT depts. They may have tools that will help you do your job. They may
be able to do things differently that would enable preservation, but they
haven’t thought of them.
– Managers who want information access in the near term. Managers may not
think of long-term access and usability, but they usually do understand the
benefits of having their own information usable in the near-term (1 to 5 years).
If you can *guarantee* something will be usable in 5 years, you can probably
guarantee that you are going to be able to preserve it for longer periods.
3) Identify other partners
– The Internet Archive is doing a lot right now to preserve information
on the web and you can work with them to have them do preservation for you.
– Look for others locally and regionally with whom you can collaborate.
Universities may want to collaborate with governments and vice-versa, for
4) Are you a partner?
Even if you are in an archive that has clearly no responsibility for
preservation of (say) the records of an agency, you may be in a
position, because of your own archival mandates (you have personal
records of a government official, soldier, elected offical) or because
of your constituency (users at a university who need the complete record for
historical analysis), you may have the opportunity (and obligation) to
collect information that is relevant to and even part of the complete
The library model of having many copies dispersed over many institutions
has worked well for preserving and authenticating published materials, and
it may work in the archival environment as well when we are no longer tied
to a single copy of record. Software already exists to help with this:
Lots of Copies Keep Stuff Safe
i want to close with a quote from that same Technology Review article
that I quoted earlier.
In it, computer scientist Robert F. Sproull of Sun Microsystems
Laboratories, who chaired a a National Academy of Sciences panel that
advised NARA, said:
“If you become obsessed with getting the technical solution, you will
never build an archive.”
The challenges we face are as much political, sociological, and
economic, as technological.
What would it be like if we had true open access to large quantities of government text? We would be able to do much more than retrieve a page of the Congressional Record and read it. Researchers would be able to analyze the text and create new, innovative ways of discovering, browsing, searching, and reading text-based information.
Clifford Lynch has written eloquently about this in the realm of scholarly literature (Clifford A. Lynch, “Open Computation: Beyond Human-Reader-Centric Views of Scholarly Literatures,” Open Access: Key Strategic, Technical and Economic Aspects, Neil Jacobs Ed., Oxford: Chandos Publishing, 2006, pp. 185-193.).
I was reminded of these issues this morning when looking at Visualization Strategies: Text & Documents on Tim Showers Web Design Blog (August 20th, 2008). Tim lists more than a dozen examples of techniques and tools. One of my favorites is the visualization of the 2008 Democratic primary debates offered by the New York Times. You can hear the debate, search for keywords and see where they appear, browse a transcript, and more.
Shouldn’t we have free, open, access to large bodies of all government texts (not just search-and-retrieve access to bits-and-pieces) so that we can easily create corpora that can be indexed, browsed, and analyzed?
Thanks and a tip of the hat to Tim Dennis!