Home » Posts tagged 'link rot'

Tag Archives: link rot

Another study of link rot and content drift

A new paper on Link Rot and Content Drift gives new details on the extent of the problem.

Klein and his co-authors examined over a million references in close to 400,000 academic articles published between 1997 and 2012 and found that 1 out of 5 of those articles contained references that were no longer good. A lot of the articles they examined did not cite anything on the web (particularly articles published in the late 1990s when much less information had URLs). When they examined only those articles that contain references to web resources, they found that 7 out of 10 articles contained references that were rotten. The rate of failure of links is extremely high (34 to 80%) for older (1997) publications, but still very high (13 to 22%) for recently published (2012) articles.

Over the time period covered, more articles cite more items on the web and the authors discovered, as you might guess, that the percentage of articles with rotten cites increases over time (from a less that 1% in 1997 to as high as 21% in 2012).

They also examine “content drift.” (The authors define content drift this way: “The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced.”). If a link in a paper leads to a “404 Not Found” error message, at least you know you that the link failed. But if the link in a paper resolves to something you cannot always know if the information you are seeing is the same information that was cited, or if it has been altered or changed, or replaced.

David Rosenthal, the technical designer of LOCKSS at Stanford, has thoughtful and helpful comments on the article on his blog.

He says that the problems of link rot and content drift are even bigger than the authors of the paper describe. One example that David gives is that the doi.org domain (which is used for Digital Object Identifiers) was allowed to expire on January 20th thus briefly breaking DOI links all over the Web. (GPO had a similar — but much longer — problem like this when its PURL server crashed back in 2009).

All of this is relevant to government information. Although the study focuses on academic publishing, the authors found that the rate of link rot in the scholarly literature is very similar to link rot patterns found in other studies of the web in general. Klein’s paper does trace citation to .gov domains and records similar link rot to those references. David noticed that one of the links in the Klein paper itself was broken(!) and it was a link to PubMed (at ncbi.nlm.nih.gov).

But one thing that David mentions has, I think, particular importance for government information librarians who worry that the problems of preserving government information are beyond their resources. The reason is that we should not rely on only one solution or one institution to adequately address digital preservation of government information. David says that the complexity of the problems that need to be solved (including human, technical, economic, copyright, institutional, etc.) means that “there cannot be a single comprehensive technical solution.” That is not pessimism; it is realism. And it is not an excuse to give up, but a reason to act. We have to realize that we all must participate in preservation. As David says, the best we can do is to combine a diversity of partial solutions.

Imagine how much better a job 100, or 500, or 1000 FDLP libraries could do than GPO can do on its own.

Temporal Context in Digital Preservation

Temporal context is an important and much overlooked aspect of preservation. Users of preserved information need a way of using that information in its original context. Many born-digital documents are not isolated and complete in themselves, but are part of a network of documents. Providing temporal context means preserving the context of a document at the time it was created.

When we preserve a document (however we might define “document”), the links in that document to other documents should not only work, but should also link to the same content that the author linked to at the time the document was created — not to a later version that the author never saw that may have replaced the document the author linked to. (We might also want to link to the later or current versions, but we need to know when we want to do that and be able to give the user the information she needs to know what she is getting and the choice of what to get when there are choices.)

This relates both to versioning (identifying different versions and editions and modifications of documents) and also to link-rot (keeping links working and working properly).

Here are two readings and a podcast that address these issues.

Research data lost to the sands of time

Here’s an interesting article, not on link rot (a topic FGI has been tracking for some time), but on *data rot*. In a recent article in Current Biology, researchers examined the availability of data from 516 studies between 2 and 22 years old. They found the following:

  • that the odds of a data set being reported as extant fell by 17% per year;
  • Broken e-mails and obsolete storage devices were the main obstacles to data sharing
  • Policies mandating data archiving at publication are clearly needed

Librarians have known of this issue for years — the Inter-university Consortium for Political and Social Research (ICPSR) was set up in 1962 to tackle this — but it does put the issue in focus. And finally the federal government — via efforts like the NSF’s data management plan and OSTP’s new directive to improve the management of and access to scientific collections — is beginning to get behind the effort to improve on data rot. And many libraries — not to mention scientists and researchers — are beginning to struggle with the issue of data preservation. The issue is too big for just government information librarians to handle obviously. But this is fertile space in which govt information librarians, data librarians, research communities, and federal agencies can come together. The Federal policy stating the importance of data preservation is there, it’ll just take effort by multiple stakeholders to make sure it actually happens. It’s a positive that the writers of Dragonfly, the blog of the National Network of Libraries of Medicine Pacific Northwest Region — where I came across the article — point out that academic institutions can and should play a leading role in data preservation. I wholeheartedly agree!

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. “The availability of research data declines rapidly with article age.” Current Biology 24, no. 1 (2014): 94-97.


The researchers found that for every year that had passed since the paper’s publication date, the odds of finding an email address that led to contact with a study author decreased by 7% and that the odds of turning up the data reduced by 17% per year.  The authors report that while some of the data sets were truly lost others fell more into the category of “unavailable,” since they existed, but solely on inaccessible media (think Jaz disk).  These findings will not come as a shock to those who have worked in a research lab.  This publication does put some tangible numbers behind the underlying message of NYU Health Sciences Library’s excellent dramatic portrayal of an instance of inaccessible data.  The authors conclude by suggesting that a solution to this problem moving forward can be found in more journals requiring the deposit of data into a public archive upon publication.  I would also suggest that academic institutions can take a role by establishing policies supporting research data preservation alongside providing a data repository.

Government Link Rot

Over the holidays, we switched FGI to new CMS software and a new theme and, in the process, installed some new back-end tools allowing us to do things like easily check for broken links. FGI went online in November 2004, so we have a little more than 9 years of outgoing links. Of those, 2676 link to .gov web sites and we discovered that 540 of those links are broken. That is about 20%.

That is actually lower than the 51% that the recent Chesapeake report found in its newest link rot study but still disconcertingly high. For those libraries that rely on pointing to URLs in their OPACs as a means of linking users to information, these kinds of numbers can lead to one of two conclusions: Either a) you better do link checking and link-repair frequently, or b) your “collection” is slowly disappearing. Adding to your workload is no fun and angering your users with bad links probably does not encourage them to increase your funding for better services. As the Chesapeake reports concluded: “documents posted on web sites will disappear at an increasing rate over time.”

As I browsed through the broken links on FGI, I found a variety of reasons for link breakage.

  • abandoned domains. There is no “2010.census.gov” or any “amlife.america.gov” any more.
  • cache problems. GPO use Akamai technology to “cache” frequently requested documents on Akamai servers throughout the world so that requests for those documents can be completed more quickly. In two cases we carelessly copied the “akamaitech” cache URL instead of the actual GPO URL. I checked and the documents still exist at their GPO address. But I do wonder how often users (and even libraries?) make this mistake of copying a very-temporary cache url.
  • redesigned sites change URLs. the House Appropriations Committee Subcommittee on Legislative Branch URL apparently changed from appropriations.house.gov/Subcommittees/sub_leg.shtml to appropriations.house.gov/Subcommittees/Subcommittee/?IssueID=34776 and the similar Senate sub-committee changed from appropriations.senate.gov/legislative.cfm to appropriations.senate.gov/sc-legislative.cfm
  • minor changes. Why would the BLS change its Data Finder search page from /query to /find at http://beta.bls.gov/dataQuery/ ? At least the data finder is still there!
  • e-government interest changes. What was once pandemicflu.gov is now flu.gov and blog.pandemicflu.gov is gone. HHS still has information about “Pandemic Awareness” but has evidently changed its focus to flu in general.
  • re-branding. the “govgab” blog, once at blog.usa.gov/roller/govgab/ and later at govgab.gov is either gone or maybe just replace by blog.usa.gov/
  • suspended blogs. A blog for “examining rumors, conspiracy theories and false stories” at blogs.america.gov/rumors/ has been “archived or suspended” — but we don’t know which or where any “archive” might be.
  • temporary sites are … temporary. The site change.gov simply says “the transition has ended” and invites you to go to whitehouse.gov where, apparently, “agendas” have changed to “issues.” http://change.gov/agenda/technology http://www.whitehouse.gov/issues/technology/
  • CMS changes? why would HRSA change a nice, lean URL like datawarehouse.hrsa.gov/NSSRN.htm for the National Sample Survey of Registered Nurses Web site to datawarehouse.hrsa.gov/data/datadownload/nssrndownload.aspx ? My guess is they changed the software they are using a new content management system which dictates how urls will be constructed.
  • scrubbing? When a report is controversial, is it just easier to remove it than to keep it online? The link to the Wegman Report at the House Energy and Commerce Committee is broken.
  • FDLP “out of date” information? FDLP is not immune to link rot. Where are the questions for a 2009 DLC discussion?
  • GPO moves stuff too. It is not easy to work in a bureaucracy that itself changes and, in doing so, changes how it does things. Remember the “The Federal Bulletin Board”? Probably not. Back in the 1990s It had 4,500 individual Federal agency files, in a variety of formats. GPO operated the bulletin board, which could “be accessed 24 hours a day, 7 days a week, by direct dialing 202/512-1387 from a modem using any communications software.” (Just type “/GO FAC.”!) As time moved on, GPO moved the files to permanent.fdlp.gov/fbb/ — but they are not all there, at least not under the same links like this one: http://fedbbs.access.gpo.gov/library/compare/compr5.pdf which is now, apparently, here: http://beta.fdlp.gov/file-repository/about-the-fdlp/gpo-projects/legislative-comparison-report/1189-legislative-comparison-report-2008-revised/file

I’ll stop there. I’ve only looked at about one tenth of the broken links, but the above should give us an idea of the kind of problems we face with pointing instead of collecting. Of course, some of the information may still exist somewhere with different URLs, but some may be gone permanently. We should be telling our library managers that “pointing” is not a cheap way to provide good service, it is a laborious task that is not necessarily easier than collecting, and certainly is not as reliable.

By the way: links to FGI pages are not immune from the kinds of link rot described above. We went to a lot of trouble in our switchover to a new theme to minimize broken links, but we know there are some that we were unable to duplicate. We’re still fixing the ones we can and invite you to let us know if you find any. We’re not saying FGI is better than .gov, we are saying that libraries should not rely on pointing. :-)

Link Rot up to 51% for .gov domains

New Link Rot report from Chesapeake.

For the past six years, the Georgetown Law Library and the Chesapeake Digital Preservation Group have been doing doing studies on “Link Rot and Legal Resources on the Web.” The newest report, for 2013, says that 51% of .gov URLs selected in 2007-2008 are broken. For a larger sample of documents selected 2007-2013 (and including all domains, not just .gov) “link rot has increased to 44.2 percent within six years.” This is a 6.5 percent increase over 2012.

The Chesapeake group gathers information from the web and preserves it for their users and each year they investigate “whether or not the documents in the archive can still be found at the original web addresses from which they were captured.”

The study uses two samples: one sample of 579 original URLs for content captured from 2007‐2008 and a second sample of the full content of the archive at the time the study is conducted. In 2013, the full sample included 842 original URLs for materials captured from 2007‐2013. The study is particularly relevant to government information specialists because more than 90% of the URLs in the original sample and almost 85% of the URLs in the full sample are from state governments (state.[state code].us), organizations (.org), and government (.gov) the top-level domains.

Among the new report’s findings:

This year saw a substantial increase in the number of government URLs (.gov) that no longer worked.

In 2013, the content at .gov domains showed the highest increase in link rot. More than 50 percent of the materials posted to government domains disappeared from the original documented web addresses.

Overall, the results of the six years of systemically checking links have demonstrated that documents posted on web sites will disappear at an increasing rate over time.

For “dot-gov” domains (URLs ending in “.gov”) the studies have shown cumulative link rot of:

2008:    10% 
2009:    13%
2010:    25% 
2011:    31% 
2012:    36%
2013:    51%

The Chesapeake Digital Preservation Group is able to create these reports because it has been actively preserving information from the web for its users for several years. The report is a useful by-product of a preservation effort that is rooted in providing long-term access for its user community to information they need. This is not an academic exercise — the Group also collects data on the use of their harvested content. The report summarizes its conclusion of its experience this way:

The value of harvesting these materials before they are no longer available at their original URLs is demonstrated by the high use of these materials. During March 2013, the time the 2013 sample set was taken, over 84,000 items were retrieved. In 2012, 1.5 million items viewed. It is likely that the value of this project and similar ones will become even more significant in future years.

For libraries that rely on pointing to URLs rather than preserving information in their own digital libraries, the new report from the Chesapeake Project provides sobering, factual data on the reliability of that strategy.


Subscribe to FGI posts

By signing up, you agree to our Terms of Service and Privacy Policy.