A new paper on Link Rot and Content Drift gives new details on the extent of the problem.
Klein and his co-authors examined over a million references in close to 400,000 academic articles published between 1997 and 2012 and found that 1 out of 5 of those articles contained references that were no longer good. A lot of the articles they examined did not cite anything on the web (particularly articles published in the late 1990s when much less information had URLs). When they examined only those articles that contain references to web resources, they found that 7 out of 10 articles contained references that were rotten. The rate of failure of links is extremely high (34 to 80%) for older (1997) publications, but still very high (13 to 22%) for recently published (2012) articles.
Over the time period covered, more articles cite more items on the web and the authors discovered, as you might guess, that the percentage of articles with rotten cites increases over time (from a less that 1% in 1997 to as high as 21% in 2012).
They also examine “content drift.” (The authors define content drift this way: “The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced.”). If a link in a paper leads to a “404 Not Found” error message, at least you know you that the link failed. But if the link in a paper resolves to something you cannot always know if the information you are seeing is the same information that was cited, or if it has been altered or changed, or replaced.
David Rosenthal, the technical designer of LOCKSS at Stanford, has thoughtful and helpful comments on the article on his blog.
He says that the problems of link rot and content drift are even bigger than the authors of the paper describe. One example that David gives is that the doi.org domain (which is used for Digital Object Identifiers) was allowed to expire on January 20th thus briefly breaking DOI links all over the Web. (GPO had a similar — but much longer — problem like this when its PURL server crashed back in 2009).
All of this is relevant to government information. Although the study focuses on academic publishing, the authors found that the rate of link rot in the scholarly literature is very similar to link rot patterns found in other studies of the web in general. Klein’s paper does trace citation to .gov domains and records similar link rot to those references. David noticed that one of the links in the Klein paper itself was broken(!) and it was a link to PubMed (at ncbi.nlm.nih.gov).
But one thing that David mentions has, I think, particular importance for government information librarians who worry that the problems of preserving government information are beyond their resources. The reason is that we should not rely on only one solution or one institution to adequately address digital preservation of government information. David says that the complexity of the problems that need to be solved (including human, technical, economic, copyright, institutional, etc.) means that “there cannot be a single comprehensive technical solution.” That is not pessimism; it is realism. And it is not an excuse to give up, but a reason to act. We have to realize that we all must participate in preservation. As David says, the best we can do is to combine a diversity of partial solutions.
Imagine how much better a job 100, or 500, or 1000 FDLP libraries could do than GPO can do on its own.