Recently, a speaker at a conference said that Thomas, the Library of Congress site with extensive information about federal bills and legislation, excluded all search engines from indexing the site. Why, the speaker asked, would a government site wish to block search engines? While government agencies might, like most web sites, exclude search engines from parts of the site for benign reasons (such as blocking temporary pages, out-of-date pages, etc.) It seems counter-intuitive that anyone would want to block access to search engines since they tend to drive most traffic at most web sites.
This prompted me to examine if and when and how government agencies are blocking search engines. I discovered that Thomas does not block all search engines after all — but it only allows Google to fully index the site. That raises more questions than it answers. What follows is some background and findings and conclusions from a selective examination of what some agencies are doing.
Search engines like Google, Yahoo, MSN search, etc. build indexes of web pages by using software (called “robots” or “spiders”) that browse (or “crawl”) the web, finding web pages and the links on those pages, following the links they find to find more pages, and so forth. When the spidering software reaches a site, it is supposed to do a couple of things before proceeding. It is supposed to look for a special file named “robots.txt”, read the rules in that file that tell it where it may go and where it may not go, and then obey those rules. The rules (called The Robots Exclusion Standard) allow the site to exclude or allow robots to visit particular parts of the site. This is a web convention, not a mandate, but all “well behaved” robots are supposed to follow these rules. The rules can exclude robots from a directory or a particular file or even exclude a particular type of file. They can also specify that one robot can get to one part of the site, while another cannot. (This is useful for site administrators that want to allow their own indexing software to index parts of the site that they don’t want to be visible to the world, for example.) There are other ways to ask a search engine to refrain from indexing a page, but I don’t cover those here.
Some explanations of how the robots.txt files work are:
I examined the robots.txt files at some dot-gov (.gov) sites. The following is a report on what I have found so far. I used google to identify files (google search for file robots.txt at site .gov) and examined a selection of the federal sites from that result. I invite others to let me know of other interesting exclusions.
It is not always easy to know the effects of a robots.txt file on a complex site. When the file is quite long and lists many directories that are themselves links to other parts of the site, for instance, it is possible that these are old links and spiders will find the information through another route. The EPA site is an example of this; it has 975 exclusions and, without following every one and navigating the site looking for the information in the exclusions, it is difficult to determine if the information is still findable by spiders.
That said, here are the few sites that I examined that had some interesting features.
- Customs and Border Protection (visited 10/30/2005)
Excludes all spiders except Google and the “gsa-crawler”, which it allows everywhere.
- Department of Justice (visited 10/30/2005)
Exclude indexing of its archive
- Department of Labor (visited 10/30/2005)
Evidently excludes spiders from older press releases (e.g., /dol/media/press/archives.htm), decisions of the Employees’ Compensation Appeals Board and the Benefits Review Board.
- Environmental Protection Agency (visited 10/30/2005)
Has a lengthy robots.txt file that excludes all robots from parts of the site such as the state Binary Base Maps of the Exposure Assessment Models, a page on the Columbia Space Shuttle Accident, the section on Environmental Monitoring for Public Access and Community Tracking, the entire “Human Exposure Database System,” parts of the site that require a password (e.g., “mercury-archive”), and much more.
- Thomas (visited 10/30/2005)
Allows Google’s robot complete access. Allows the “Ultraseek” robot any plain file, but excludes it from following any dynamic links (which is most of the site). Denies all other robots any access.
- USDA Forest Service Southern Research Station (visited 10/30/2005)
Whole sections of publications in pdf format are excluded (e.g., “/pubs/rp/*.pdf” [recent publications])
- Whitehouse (visited 10/30/2005)
An extensive list of over 2000 exclusions. Most of these seem to be simply aimed at excluding robots from the “text” versions of web pages that are probably duplicates of other pages; if so, these have essentially no effect on indexing but are probably used to reduce the number of times spiders hit the site and save bandwidth. Interestingly, though, most of the other exclusions are for directories named “iraq” that do not exist. Excluding non-existing directories is neither necessary nor functional. Perhaps these are in place to save bandwidth by preventing spiders from hunting for information about Iraq rather than following actual links. Examples: /infocus/healthcare/iraq /infocus/internationaltrade/iraq
It appears that most agencies are either not using robots.txt files to excluded search engine robots, or are using the exclusion rules sparingly and appropriately. At least two government web sites (whitehouse.gov and epa.gov), and perhaps others, warrant further study because they have such lengthy robots.txt files or because the exclusions otherwise obscure their overall effect.
Two other federal government web sites (customs.gov and thomas.loc.gov) have troubling exclusions of most search engines except Google.