Home » Posts tagged 'link rot'

Tag Archives: link rot

Archives

Temporal Context in Digital Preservation

Temporal context is an important and much overlooked aspect of preservation. Users of preserved information need a way of using that information in its original context. Many born-digital documents are not isolated and complete in themselves, but are part of a network of documents. Providing temporal context means preserving the context of a document at the time it was created.

When we preserve a document (however we might define “document”), the links in that document to other documents should not only work, but should also link to the same content that the author linked to at the time the document was created — not to a later version that the author never saw that may have replaced the document the author linked to. (We might also want to link to the later or current versions, but we need to know when we want to do that and be able to give the user the information she needs to know what she is getting and the choice of what to get when there are choices.)

This relates both to versioning (identifying different versions and editions and modifications of documents) and also to link-rot (keeping links working and working properly).

Here are two readings and a podcast that address these issues.

Research data lost to the sands of time

Here’s an interesting article, not on link rot (a topic FGI has been tracking for some time), but on *data rot*. In a recent article in Current Biology, researchers examined the availability of data from 516 studies between 2 and 22 years old. They found the following:

  • that the odds of a data set being reported as extant fell by 17% per year;
  • Broken e-mails and obsolete storage devices were the main obstacles to data sharing
  • Policies mandating data archiving at publication are clearly needed

Librarians have known of this issue for years — the Inter-university Consortium for
Political and Social Research (ICPSR)
was set up in 1962 to tackle this — but it does put the issue in focus. And finally the federal government — via efforts like the NSF’s data management plan and OSTP’s new directive to improve the management of and access to scientific collections — is beginning to get behind the effort to improve on data rot. And many libraries — not to mention scientists and researchers — are beginning to struggle with the issue of data preservation. The issue is too big for just government information librarians to handle obviously. But this is fertile space in which govt information librarians, data librarians, research communities, and federal agencies can come together. The Federal policy stating the importance of data preservation is there, it’ll just take effort by multiple stakeholders to make sure it actually happens. It’s a positive that the writers of Dragonfly, the blog of the National Network of Libraries of Medicine Pacific Northwest Region — where I came across the article — point out that academic institutions can and should play a leading role in data preservation. I wholeheartedly agree!

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. “The availability of research data declines rapidly with article age.” Current Biology 24, no. 1 (2014): 94-97.

http://dx.doi.org/10.1016/j.cub.2013.11.014

The researchers found that for every year that had passed since the paper’s publication date, the odds of finding an email address that led to contact with a study author decreased by 7% and that the odds of turning up the data reduced by 17% per year.  The authors report that while some of the data sets were truly lost others fell more into the category of “unavailable,” since they existed, but solely on inaccessible media (think Jaz disk).  These findings will not come as a shock to those who have worked in a research lab.  This publication does put some tangible numbers behind the underlying message of NYU Health Sciences Library’s excellent dramatic portrayal of an instance of inaccessible data.  The authors conclude by suggesting that a solution to this problem moving forward can be found in more journals requiring the deposit of data into a public archive upon publication.  I would also suggest that academic institutions can take a role by establishing policies supporting research data preservation alongside providing a data repository.

Government Link Rot

Over the holidays, we switched FGI to new CMS software and a new theme and, in the process, installed some new back-end tools allowing us to do things like easily check for broken links. FGI went online in November 2004, so we have a little more than 9 years of outgoing links. Of those, 2676 link to .gov web sites and we discovered that 540 of those links are broken. That is about 20%.

That is actually lower than the 51% that the recent Chesapeake report found in its newest link rot study but still disconcertingly high. For those libraries that rely on pointing to URLs in their OPACs as a means of linking users to information, these kinds of numbers can lead to one of two conclusions: Either a) you better do link checking and link-repair frequently, or b) your “collection” is slowly disappearing. Adding to your workload is no fun and angering your users with bad links probably does not encourage them to increase your funding for better services. As the Chesapeake reports concluded: “documents posted on web sites will disappear at an increasing rate over time.”

As I browsed through the broken links on FGI, I found a variety of reasons for link breakage.

  • abandoned domains. There is no “2010.census.gov” or any “amlife.america.gov” any more.
  • cache problems. GPO use Akamai technology to “cache” frequently requested documents on Akamai servers throughout the world so that requests for those documents can be completed more quickly. In two cases we carelessly copied the “akamaitech” cache URL instead of the actual GPO URL. I checked and the documents still exist at their GPO address. But I do wonder how often users (and even libraries?) make this mistake of copying a very-temporary cache url.
  • redesigned sites change URLs. the House Appropriations Committee Subcommittee on Legislative Branch URL apparently changed from appropriations.house.gov/Subcommittees/sub_leg.shtml to appropriations.house.gov/Subcommittees/Subcommittee/?IssueID=34776 and the similar Senate sub-committee changed from appropriations.senate.gov/legislative.cfm to appropriations.senate.gov/sc-legislative.cfm
  • minor changes. Why would the BLS change its Data Finder search page from /query to /find at http://beta.bls.gov/dataQuery/ ? At least the data finder is still there!
  • e-government interest changes. What was once pandemicflu.gov is now flu.gov and blog.pandemicflu.gov is gone. HHS still has information about “Pandemic Awareness” but has evidently changed its focus to flu in general.
  • re-branding. the “govgab” blog, once at blog.usa.gov/roller/govgab/ and later at govgab.gov is either gone or maybe just replace by blog.usa.gov/
  • suspended blogs. A blog for “examining rumors, conspiracy theories and false stories” at blogs.america.gov/rumors/ has been “archived or suspended” — but we don’t know which or where any “archive” might be.
  • temporary sites are … temporary. The site change.gov simply says “the transition has ended” and invites you to go to whitehouse.gov where, apparently, “agendas” have changed to “issues.” http://change.gov/agenda/technology http://www.whitehouse.gov/issues/technology/
  • CMS changes? why would HRSA change a nice, lean URL like datawarehouse.hrsa.gov/NSSRN.htm for the National Sample Survey of Registered Nurses Web site to datawarehouse.hrsa.gov/data/datadownload/nssrndownload.aspx ? My guess is they changed the software they are using a new content management system which dictates how urls will be constructed.
  • scrubbing? When a report is controversial, is it just easier to remove it than to keep it online? The link to the Wegman Report at the House Energy and Commerce Committee is broken.
  • FDLP “out of date” information? FDLP is not immune to link rot. Where are the questions for a 2009 DLC discussion?
  • GPO moves stuff too. It is not easy to work in a bureaucracy that itself changes and, in doing so, changes how it does things. Remember the “The Federal Bulletin Board”? Probably not. Back in the 1990s It had 4,500 individual Federal agency files, in a variety of formats. GPO operated the bulletin board, which could “be accessed 24 hours a day, 7 days a week, by direct dialing 202/512-1387 from a modem using any communications software.” (Just type “/GO FAC.”!) As time moved on, GPO moved the files to permanent.fdlp.gov/fbb/ — but they are not all there, at least not under the same links like this one: http://fedbbs.access.gpo.gov/library/compare/compr5.pdf which is now, apparently, here: http://beta.fdlp.gov/file-repository/about-the-fdlp/gpo-projects/legislative-comparison-report/1189-legislative-comparison-report-2008-revised/file

I’ll stop there. I’ve only looked at about one tenth of the broken links, but the above should give us an idea of the kind of problems we face with pointing instead of collecting. Of course, some of the information may still exist somewhere with different URLs, but some may be gone permanently. We should be telling our library managers that “pointing” is not a cheap way to provide good service, it is a laborious task that is not necessarily easier than collecting, and certainly is not as reliable.

By the way: links to FGI pages are not immune from the kinds of link rot described above. We went to a lot of trouble in our switchover to a new theme to minimize broken links, but we know there are some that we were unable to duplicate. We’re still fixing the ones we can and invite you to let us know if you find any. We’re not saying FGI is better than .gov, we are saying that libraries should not rely on pointing. :-)

Link Rot up to 51% for .gov domains

New Link Rot report from Chesapeake.

For the past six years, the Georgetown Law Library and the Chesapeake Digital Preservation Group have been doing doing studies on “Link Rot and Legal Resources on the Web.” The newest report, for 2013, says that 51% of .gov URLs selected in 2007-2008 are broken. For a larger sample of documents selected 2007-2013 (and including all domains, not just .gov) “link rot has increased to 44.2 percent within six years.” This is a 6.5 percent increase over 2012.

The Chesapeake group gathers information from the web and preserves it for their users and each year they investigate “whether or not the documents in the archive can still be found at the original web addresses from which they were captured.”

The study uses two samples: one sample of 579 original URLs for content captured from 2007‐2008 and a second sample of the full content of the archive at the time the study is conducted. In 2013, the full sample included 842 original URLs for materials captured from 2007‐2013. The study is particularly relevant to government information specialists because more than 90% of the URLs in the original sample and almost 85% of the URLs in the full sample are from state governments (state.[state code].us), organizations (.org), and government (.gov) the top-level domains.

Among the new report’s findings:

This year saw a substantial increase in the number of government URLs (.gov) that no longer worked.

In 2013, the content at .gov domains showed the highest increase in link rot. More than 50 percent of the materials posted to government domains disappeared from the original documented web addresses.

Overall, the results of the six years of systemically checking links have demonstrated that documents posted on web sites will disappear at an increasing rate over time.

For “dot-gov” domains (URLs ending in “.gov”) the studies have shown cumulative link rot of:

2008:    10% 
2009:    13%
2010:    25% 
2011:    31% 
2012:    36%
2013:    51%

The Chesapeake Digital Preservation Group is able to create these reports because it has been actively preserving information from the web for its users for several years. The report is a useful by-product of a preservation effort that is rooted in providing long-term access for its user community to information they need. This is not an academic exercise — the Group also collects data on the use of their harvested content. The report summarizes its conclusion of its experience this way:

The value of harvesting these materials before they are no longer available at their original URLs is demonstrated by the high use of these materials. During March 2013, the time the 2013 sample set was taken, over 84,000 items were retrieved. In 2012, 1.5 million items viewed. It is likely that the value of this project and similar ones will become even more significant in future years.

For libraries that rely on pointing to URLs rather than preserving information in their own digital libraries, the new report from the Chesapeake Project provides sobering, factual data on the reliability of that strategy.

SCOTUS and Law Reviews have a bad case of link rot. perma.cc looks to be the prescription

“If permanence of legal thought is important to legal scholarship then it must be preserved consciously.”
–Howard A. Denemark, “The Death of Law Reviews has Been Predicted: What Might be Lost When the Last Law Review Shuts Down?” 27 SETON HALL L. REV. 1, 12 (1996).

According to a new study by Jonathan Zittrain and Kendra Albert at the Harvard Law School (Zittrain also has affiliations with Harvard’s Kennedy School, Harvard School of Engineering and Applied Sciences, and the Berkman Center for Internet & Society) “49 percent of the hyperlinks in Supreme Court decisions no longer work. And more than 70% of the links in such journals as the Harvard Law Review (in that case measured from 1999 to 2012), currently don’t work. As time passes, the number of non-working links increases.” The study builds off of other great link rot studies such those done annually since 2010 by the Chesapeake Digital Preservation Group and the more resent one by Raizel Liebler and June Liebert in the Yale Journal of Law and Technology.

We’ve been tracking the issue of link rot for government information and the .gov domain for quite some time. There seems at this time to be a critical mass to actually DO something about it. The Harvard Library Innovation Lab, along with 30 law libraries from around the country, the Internet Archive and Instapaper, have come together to create the Perma.cc service, currently in beta, that allows users to create citation links that will never break.

I LOVE this idea. I’ve long been a fan/user of the Zotero Commons which allows users to save snapshots of their zotero citations in the Internet Archive (though I can’t tell if it’s still being actively maintained/developed). I can’t wait to see perm.cc in action!

The Harvard Library Innovation Lab has pioneered a project to unite libraries so that link rot can be mitigated. We are joined by about thirty law libraries around the world to start Perma.cc, which will allow those libraries on direction of authors and journal editors to store permanent caches of otherwise ephemeral links. Libraries are the ideal partners for this task: they think on a long timescale; they take user trust and service seriously; and they are non-commercial. You can see more about the system at perma.cc. The amazing Internet Archive has lent its archiving engine to the effort, and Instapaper has generously provided an alternative path to parse Web pages to be saved. CloudFlare has kindly ensured that the the system at Perma.cc can scale with use.