Month of October, 2005
An interesting article entitled, " UNDOCUMENTED EVIDENCE The Politics (and Profits) of Information: The 9/11 Commission One Year Later" has just come out from the Washington Spectator. The article, written by Max Holland, touches on various government information issues: privatization, the role of GPO and FDLP, access, government secrecy etc.
As a consequencs of this high level commission's decision to give Norton the rights to publish their report instead of going through the GPO, much of the commission's work was not included in the commission report, was only available online (if at all) or through private publishers for substantial cost. This highlights the weakness of the govt information system. That is, Title 44 has no teeth to make commissions, agencies, etc comply with the law and go through GPO for their publications. If GPO had had a role in the publication of the commission's work, depository libraries would have had the commission's final report, as well as staff monographs and other supplemental volumes. Instead, only the 567 page final report -- with no index! -- has been deposited. Washington Spectator has graciously allowed free access to this article. If you have a problem with the link, please download the PDF of the article attached to this post below. We will also add it to the FGI library.
Recently, a speaker at a conference said that Thomas, the Library of Congress site with extensive information about federal bills and legislation, excluded all search engines from indexing the site. Why, the speaker asked, would a government site wish to block search engines? While government agencies might, like most web sites, exclude search engines from parts of the site for benign reasons (such as blocking temporary pages, out-of-date pages, etc.) It seems counter-intuitive that anyone would want to block access to search engines since they tend to drive most traffic at most web sites.
This prompted me to examine if and when and how government agencies are blocking search engines. I discovered that Thomas does not block all search engines after all -- but it only allows Google to fully index the site. That raises more questions than it answers. What follows is some background and findings and conclusions from a selective examination of what some agencies are doing.
Search engines like Google, Yahoo, MSN search, etc. build indexes of web pages by using software (called "robots" or "spiders") that browse (or "crawl") the web, finding web pages and the links on those pages, following the links they find to find more pages, and so forth. When the spidering software reaches a site, it is supposed to do a couple of things before proceeding. It is supposed to look for a special file named "robots.txt", read the rules in that file that tell it where it may go and where it may not go, and then obey those rules. The rules (called The Robots Exclusion Standard) allow the site to exclude or allow robots to visit particular parts of the site. This is a web convention, not a mandate, but all "well behaved" robots are supposed to follow these rules. The rules can exclude robots from a directory or a particular file or even exclude a particular type of file. They can also specify that one robot can get to one part of the site, while another cannot. (This is useful for site administrators that want to allow their own indexing software to index parts of the site that they don't want to be visible to the world, for example.) There are other ways to ask a search engine to refrain from indexing a page, but I don't cover those here.
Some explanations of how the robots.txt files work are:
I examined the robots.txt files at some dot-gov (.gov) sites. The following is a report on what I have found so far. I used google to identify files (google search for file robots.txt at site .gov) and examined a selection of the federal sites from that result. I invite others to let me know of other interesting exclusions.
It is not always easy to know the effects of a robots.txt file on a complex site. When the file is quite long and lists many directories that are themselves links to other parts of the site, for instance, it is possible that these are old links and spiders will find the information through another route. The EPA site is an example of this; it has 975 exclusions and, without following every one and navigating the site looking for the information in the exclusions, it is difficult to determine if the information is still findable by spiders.
That said, here are the few sites that I examined that had some interesting features.
- Customs and Border Protection (visited 10/30/2005)
Excludes all spiders except Google and the "gsa-crawler", which it allows everywhere.
- Department of Justice (visited 10/30/2005)
Exclude indexing of its archive
- Department of Labor (visited 10/30/2005)
Evidently excludes spiders from older press releases (e.g., /dol/media/press/archives.htm), decisions of the Employees' Compensation Appeals Board and the Benefits Review Board.
- Environmental Protection Agency (visited 10/30/2005)
Has a lengthy robots.txt file that excludes all robots from parts of the site such as the state Binary Base Maps of the Exposure Assessment Models, a page on the Columbia Space Shuttle Accident, the section on Environmental Monitoring for Public Access and Community Tracking, the entire "Human Exposure Database System," parts of the site that require a password (e.g., "mercury-archive"), and much more.
- Thomas (visited 10/30/2005)
Allows Google's robot complete access. Allows the "Ultraseek" robot any plain file, but excludes it from following any dynamic links (which is most of the site). Denies all other robots any access.
- USDA Forest Service Southern Research Station (visited 10/30/2005)
Whole sections of publications in pdf format are excluded (e.g., "/pubs/rp/*.pdf" [recent publications])
- Whitehouse (visited 10/30/2005)
An extensive list of over 2000 exclusions. Most of these seem to be simply aimed at excluding robots from the "text" versions of web pages that are probably duplicates of other pages; if so, these have essentially no effect on indexing but are probably used to reduce the number of times spiders hit the site and save bandwidth. Interestingly, though, most of the other exclusions are for directories named "iraq" that do not exist. Excluding non-existing directories is neither necessary nor functional. Perhaps these are in place to save bandwidth by preventing spiders from hunting for information about Iraq rather than following actual links. Examples: /infocus/healthcare/iraq /infocus/internationaltrade/iraq
It appears that most agencies are either not using robots.txt files to excluded search engine robots, or are using the exclusion rules sparingly and appropriately. At least two government web sites (whitehouse.gov and epa.gov), and perhaps others, warrant further study because they have such lengthy robots.txt files or because the exclusions otherwise obscure their overall effect.
Two other federal government web sites (customs.gov and thomas.loc.gov) have troubling exclusions of most search engines except Google.
Martin E. Halstuk at the Columbia Journalism Review wrote an interesting article on the Freedom of Information Act and its need for a tuneup. FOIA became law on July 4, 1966 and over the decades the law has helped citizens and journalists expose waste and fraud in the federal government, uncover unsafe consumer products, dangerous drugs, health hazards etc.
Recently FOIA has been steadily eroded by broadly interpeting FOIA exemptions, stall tactics and the high cost of litigation. More ...
Free Government Information would like to extend its thanks to Barbie Selby and the entire Depository Library Council (DLC) for placing notes about the vision paper small group breakouts on their DLC Vision Blog. The DLC's timely dissemination of these comments is greatly appreciated.
For the convenience of FGI readers, here are the direct links to the comments (some scrolling required!):
- Library Roles in the Non-Exclusive Environment
- Managing Collections & Delivering Content
- Adding Value
- Deploying Expertise
These links have also been added to our Fall 2005 DLC grassroots proceedings page. This page also contains audio of the Council reports on the group breakout.
As reported here before, Massachusetts decided last month that all the documents its employees create have to be in the OpenDocument format. This, according to Markham, "could be the trigger for a revolution that will increase consumer choice and ensure the survival of documents that could be of historical importance in the future."
Open formats are an important part of computing freedom (although alone they are not sufficient) because they give people full control of their own data. In future years, when this freedom is commonplace, I predict that the Massachusetts Decision will be seen as the turning point.
On October 25, President Bush signed Executive Order 13388 that requires agencies to use their information systems to share "terrorism information" -- including information on individuals -- with counter-terrorism agencies. While the EO explicitly says that agencies must also "protect the freedom, information privacy, and other legal rights of Americans," it, like the "PATRIOT Act," appears to give the government broad and explicit authority to examine information on individuals. It sets up an Information Sharing Council whose mission includes creating an "interoperable terrorism information sharing environment to facilitate automated sharing of terrorism information."
The term "terrorism information" is defined in section 1016 of the Intelligence Reform And Terrorism Prevention Act Of 2004. That definition includes information collected on foreign or domestic terrorist groups or individuals and those "reasonably believed to be assisting or associated with such groups or individuals." The law defines terrorism information to include the "activities" of these groups and individuals.
SECRECY NEWS, Volume 2005, Issue No. 100, October 26, 2005 from the FAS Project on Government Secrecy.
Steven Aftergood says that this "audacious" exemption "would nullify the applicability of the FOIA to an entire agency."
A week ago, the Fall 2005 meeting of the Depository Library Council (DLC) meeting adjourned. Since that time, nothing has appeared on the official DLC proceedings page.
A look at the FDSys Presentations page shows that Mike Wash's slides have yet to appear.
So far, we seem to be the only ones offering content from the Fall 2005 DLC Meeting, including some admittedly rough audio and a few web-based presentation handouts that GPO could have already posted.
I'm not writing to crow about our coverage. I'm writing because as an institution obsessed with dropping its self-perception as a buggy whip manufacturer, GPO is unacceptably slow in posting meeting proceedings, especially in the Internet age.
If the Wyoming Library Association can provide immediate, web-based coverage of their proceedings, why can't the Government Printing Office? Because they don't have the resources? Because it's too technically difficult? If that's the case, are they really the ones we want running an all-encompassing digital information system?
"The Information Poor and the Information Don't Care: Small Libraries and the Digital Divide"
When: Wednesday, October 26, 2005 from 7 PM to 8:30PM
Where: San Jose, King Library, room 225b
After broadcast, talk will be archived on the Internet at http://amazon.sjsu.edu:8080/ramgen/wiab/jwest/archive.smil.
I'm highlighting this talk because I think the digital capabilities of small libraries will be important in a future where the federal government expects virtually all information to be available exclusively on the Internet.
I'll try to catch her talk this week and summarize it for you. If someone beats me to it, please feel free to comment below!
A new report from the Mellon Foundation that urges action on preserving scholarly electronic journals strongly suggests the need for digital deposit and for having locally controlled copies of digital information. This parallels issues that the government information community should address as well.
- Urgent Action Needed to Preserve Scholarly Electronic Journals, Report of a meeting at the Mellon Foundation, October 15, 2005, Edited by Donald J. Waters, Program Officer, The Andrew W. Mellon Foundation (There is also a pdf version.)
While the report focuses on electronic journals, I believe that many of the points it makes apply equally to government information. It says, for instance, that the problem originates when libraries "do not to take local possession of a copy as they did with print..." and that this drives "control of more and more journals into fewer and fewer hands." The report notes that "owning a copy" provided "long-term maintenance and control" that is absent when we license "access." Here are some other excerpts: