Home » Posts tagged 'link rot'

Tag Archives: link rot

LLIS.gov going away and moving to HSDL.org and FEMA.gov

Are you pointing to documents at LLIS.gov? Those links appear to be broken. LLIS.gov (which points to llis.dhs.gov) has had a generic “Site under maintenance” page since at least December 2014.

According to the Federal Emergency Management Agency, the Lessons Learned Information Sharing (LLIS) program’s LLIS.gov website will cease independent operations and consolidate its content with the Naval Postgraduate School’s Homeland Security Digital Library (HSDL.org) and FEMA.gov. Documents are being posted at their new sites and will continue to be posted over the next few months, according to a FEMA LLIS.gov official, who added, “The user requirements for membership to HSDL.org are very similar to LLIS, but users will have to register for access to restricted content, as users had to do with LLIS.gov. HSDL does have a large amount of public data as well, which users will be able to access without registering.

  • LLIS.gov Consolidation Information
    One of the advantages of this move is that LLIS.gov content, such as lessons learned, innovative practices, after-action reports, plans, templates, guides, and other materials, will be consolidated with an already substantial database on HSDL.org. This will allow the homeland security and emergency management communities to find relevant information in one place. FEMA’s LLIS program will continue to produce trend analyses, case studies on the use of FEMA preparedness grants, webinars, and other documents relevant to emergency managers. These products will be available either on this site or HSDL.org.
  • Lessons Learned Information Sharing Consolidation Question and Answer (Q&A) [PDF]. Q&A document for the 2015 consolidation effort.
  • Documents moved to FEMA: https://www.fema.gov/lessons-learned-information-sharing-program
  • Documents moved to Homeland Security Digital Library: For publicly available documents, visit HSDL.org and use the search bar. For restricted content, login before searching.

Another study of link rot and content drift

A new paper on Link Rot and Content Drift gives new details on the extent of the problem.

Klein and his co-authors examined over a million references in close to 400,000 academic articles published between 1997 and 2012 and found that 1 out of 5 of those articles contained references that were no longer good. A lot of the articles they examined did not cite anything on the web (particularly articles published in the late 1990s when much less information had URLs). When they examined only those articles that contain references to web resources, they found that 7 out of 10 articles contained references that were rotten. The rate of failure of links is extremely high (34 to 80%) for older (1997) publications, but still very high (13 to 22%) for recently published (2012) articles.

Over the time period covered, more articles cite more items on the web and the authors discovered, as you might guess, that the percentage of articles with rotten cites increases over time (from a less that 1% in 1997 to as high as 21% in 2012).

They also examine “content drift.” (The authors define content drift this way: “The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced.”). If a link in a paper leads to a “404 Not Found” error message, at least you know you that the link failed. But if the link in a paper resolves to something you cannot always know if the information you are seeing is the same information that was cited, or if it has been altered or changed, or replaced.

David Rosenthal, the technical designer of LOCKSS at Stanford, has thoughtful and helpful comments on the article on his blog.

He says that the problems of link rot and content drift are even bigger than the authors of the paper describe. One example that David gives is that the doi.org domain (which is used for Digital Object Identifiers) was allowed to expire on January 20th thus briefly breaking DOI links all over the Web. (GPO had a similar — but much longer — problem like this when its PURL server crashed back in 2009).

All of this is relevant to government information. Although the study focuses on academic publishing, the authors found that the rate of link rot in the scholarly literature is very similar to link rot patterns found in other studies of the web in general. Klein’s paper does trace citation to .gov domains and records similar link rot to those references. David noticed that one of the links in the Klein paper itself was broken(!) and it was a link to PubMed (at ncbi.nlm.nih.gov).

But one thing that David mentions has, I think, particular importance for government information librarians who worry that the problems of preserving government information are beyond their resources. The reason is that we should not rely on only one solution or one institution to adequately address digital preservation of government information. David says that the complexity of the problems that need to be solved (including human, technical, economic, copyright, institutional, etc.) means that “there cannot be a single comprehensive technical solution.” That is not pessimism; it is realism. And it is not an excuse to give up, but a reason to act. We have to realize that we all must participate in preservation. As David says, the best we can do is to combine a diversity of partial solutions.

Imagine how much better a job 100, or 500, or 1000 FDLP libraries could do than GPO can do on its own.

Temporal Context in Digital Preservation

Temporal context is an important and much overlooked aspect of preservation. Users of preserved information need a way of using that information in its original context. Many born-digital documents are not isolated and complete in themselves, but are part of a network of documents. Providing temporal context means preserving the context of a document at the time it was created.

When we preserve a document (however we might define “document”), the links in that document to other documents should not only work, but should also link to the same content that the author linked to at the time the document was created — not to a later version that the author never saw that may have replaced the document the author linked to. (We might also want to link to the later or current versions, but we need to know when we want to do that and be able to give the user the information she needs to know what she is getting and the choice of what to get when there are choices.)

This relates both to versioning (identifying different versions and editions and modifications of documents) and also to link-rot (keeping links working and working properly).

Here are two readings and a podcast that address these issues.

Research data lost to the sands of time

Here’s an interesting article, not on link rot (a topic FGI has been tracking for some time), but on *data rot*. In a recent article in Current Biology, researchers examined the availability of data from 516 studies between 2 and 22 years old. They found the following:

  • that the odds of a data set being reported as extant fell by 17% per year;
  • Broken e-mails and obsolete storage devices were the main obstacles to data sharing
  • Policies mandating data archiving at publication are clearly needed

Librarians have known of this issue for years — the Inter-university Consortium for Political and Social Research (ICPSR) was set up in 1962 to tackle this — but it does put the issue in focus. And finally the federal government — via efforts like the NSF’s data management plan and OSTP’s new directive to improve the management of and access to scientific collections — is beginning to get behind the effort to improve on data rot. And many libraries — not to mention scientists and researchers — are beginning to struggle with the issue of data preservation. The issue is too big for just government information librarians to handle obviously. But this is fertile space in which govt information librarians, data librarians, research communities, and federal agencies can come together. The Federal policy stating the importance of data preservation is there, it’ll just take effort by multiple stakeholders to make sure it actually happens. It’s a positive that the writers of Dragonfly, the blog of the National Network of Libraries of Medicine Pacific Northwest Region — where I came across the article — point out that academic institutions can and should play a leading role in data preservation. I wholeheartedly agree!

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. “The availability of research data declines rapidly with article age.” Current Biology 24, no. 1 (2014): 94-97.
http://dx.doi.org/10.1016/j.cub.2013.11.014

The researchers found that for every year that had passed since the paper’s publication date, the odds of finding an email address that led to contact with a study author decreased by 7% and that the odds of turning up the data reduced by 17% per year.  The authors report that while some of the data sets were truly lost others fell more into the category of “unavailable,” since they existed, but solely on inaccessible media (think Jaz disk).  These findings will not come as a shock to those who have worked in a research lab.  This publication does put some tangible numbers behind the underlying message of NYU Health Sciences Library’s excellent dramatic portrayal of an instance of inaccessible data.  The authors conclude by suggesting that a solution to this problem moving forward can be found in more journals requiring the deposit of data into a public archive upon publication.  I would also suggest that academic institutions can take a role by establishing policies supporting research data preservation alongside providing a data repository.

Government Link Rot

Over the holidays, we switched FGI to new CMS software and a new theme and, in the process, installed some new back-end tools allowing us to do things like easily check for broken links. FGI went online in November 2004, so we have a little more than 9 years of outgoing links. Of those, 2676 link to .gov web sites and we discovered that 540 of those links are broken. That is about 20%.

That is actually lower than the 51% that the recent Chesapeake report found in its newest link rot study but still disconcertingly high. For those libraries that rely on pointing to URLs in their OPACs as a means of linking users to information, these kinds of numbers can lead to one of two conclusions: Either a) you better do link checking and link-repair frequently, or b) your “collection” is slowly disappearing. Adding to your workload is no fun and angering your users with bad links probably does not encourage them to increase your funding for better services. As the Chesapeake reports concluded: “documents posted on web sites will disappear at an increasing rate over time.”

As I browsed through the broken links on FGI, I found a variety of reasons for link breakage.

  • abandoned domains. There is no “2010.census.gov” or any “amlife.america.gov” any more.
  • cache problems. GPO use Akamai technology to “cache” frequently requested documents on Akamai servers throughout the world so that requests for those documents can be completed more quickly. In two cases we carelessly copied the “akamaitech” cache URL instead of the actual GPO URL. I checked and the documents still exist at their GPO address. But I do wonder how often users (and even libraries?) make this mistake of copying a very-temporary cache url.
  • redesigned sites change URLs. the House Appropriations Committee Subcommittee on Legislative Branch URL apparently changed from appropriations.house.gov/Subcommittees/sub_leg.shtml to appropriations.house.gov/Subcommittees/Subcommittee/?IssueID=34776 and the similar Senate sub-committee changed from appropriations.senate.gov/legislative.cfm to appropriations.senate.gov/sc-legislative.cfm
  • minor changes. Why would the BLS change its Data Finder search page from /query to /find at http://beta.bls.gov/dataQuery/ ? At least the data finder is still there!
  • e-government interest changes. What was once pandemicflu.gov is now flu.gov and blog.pandemicflu.gov is gone. HHS still has information about “Pandemic Awareness” but has evidently changed its focus to flu in general.
  • re-branding. the “govgab” blog, once at blog.usa.gov/roller/govgab/ and later at govgab.gov is either gone or maybe just replace by blog.usa.gov/
  • suspended blogs. A blog for “examining rumors, conspiracy theories and false stories” at blogs.america.gov/rumors/ has been “archived or suspended” — but we don’t know which or where any “archive” might be.
  • temporary sites are … temporary. The site change.gov simply says “the transition has ended” and invites you to go to whitehouse.gov where, apparently, “agendas” have changed to “issues.” http://change.gov/agenda/technology http://www.whitehouse.gov/issues/technology/
  • CMS changes? why would HRSA change a nice, lean URL like datawarehouse.hrsa.gov/NSSRN.htm for the National Sample Survey of Registered Nurses Web site to datawarehouse.hrsa.gov/data/datadownload/nssrndownload.aspx ? My guess is they changed the software they are using a new content management system which dictates how urls will be constructed.
  • scrubbing? When a report is controversial, is it just easier to remove it than to keep it online? The link to the Wegman Report at the House Energy and Commerce Committee is broken.
  • FDLP “out of date” information? FDLP is not immune to link rot. Where are the questions for a 2009 DLC discussion?
  • GPO moves stuff too. It is not easy to work in a bureaucracy that itself changes and, in doing so, changes how it does things. Remember the “The Federal Bulletin Board”? Probably not. Back in the 1990s It had 4,500 individual Federal agency files, in a variety of formats. GPO operated the bulletin board, which could “be accessed 24 hours a day, 7 days a week, by direct dialing 202/512-1387 from a modem using any communications software.” (Just type “/GO FAC.”!) As time moved on, GPO moved the files to permanent.fdlp.gov/fbb/ — but they are not all there, at least not under the same links like this one: http://fedbbs.access.gpo.gov/library/compare/compr5.pdf which is now, apparently, here: http://beta.fdlp.gov/file-repository/about-the-fdlp/gpo-projects/legislative-comparison-report/1189-legislative-comparison-report-2008-revised/file

I’ll stop there. I’ve only looked at about one tenth of the broken links, but the above should give us an idea of the kind of problems we face with pointing instead of collecting. Of course, some of the information may still exist somewhere with different URLs, but some may be gone permanently. We should be telling our library managers that “pointing” is not a cheap way to provide good service, it is a laborious task that is not necessarily easier than collecting, and certainly is not as reliable.

By the way: links to FGI pages are not immune from the kinds of link rot described above. We went to a lot of trouble in our switchover to a new theme to minimize broken links, but we know there are some that we were unable to duplicate. We’re still fixing the ones we can and invite you to let us know if you find any. We’re not saying FGI is better than .gov, we are saying that libraries should not rely on pointing. :-)

Archives

Subscribe to FGI posts

By signing up, you agree to our Terms of Service and Privacy Policy.