Home » Posts tagged 'link rot' (Page 2)

Tag Archives: link rot

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Dodging the memory hole

Abbey Potter’s comments about preserving digital news are also very relevant to the preservation of government information.

Potter is the Program Officer with the the National Digital Information Infrastructure and Preservation Program (NDIIPP). In her post on The Signal blog, she elaborates on her closing keynote address at the Dodging the Memory Hole II: An Action Assembly meeting in Charlotte NC last month.


She quotes a presentation by Andy Jackson of the UK Web Archive in which he addresses the questions: “How much of the content of the UK Web Archive collection is still on the live web?” and “How bad is reference rot in the UK domain?”

By sampling URLs collected in the UK Web Archive, Jackson examined URLs that have moved, changed, or gone missing. He analyzed both link rot (a file gone missing) and content drift (a file that has changed since being archived). He shows that 50 percent of content had gone, moved, or changed so as to be unrecognizable in only one year. After three years the figure rose to 65 percent.

Potter says that it is safe to assume that the results would be similar for newspaper content on the web. It would probably also be similar for U.S. government web sites.

What can we learn from this and what can we do? For newspapers, Potter says, libraries have acquisition and preservation methods that are too closely linked to physical objects and that too often exclude digital objects. This results in libraries having gaps in their collections – “especially the born-digital content.” She summarizes the problem:

Libraries haven’t broadly adopted collecting practices so that they are relevant to the current publishing environment which today is dominated by the web.

This sounds exactly like what is happening with government information.

First, because GPO has explicitly limited actual deposit of government information to so-called “tangible” products (Superintendent Of Documents Policy Statement 301 [SOD 301]). This policy does exactly what Potter says is wrong: it establishes collecting practices that are not relevant to the current publishing environment. (See more on the effects of SOD 301 here.)

Second, because most of the conversation within the FDLP in the last few years has been about our historic paper collections rather than about the real digital preservation issue we should be facing: born-digital government information. (See Born-Digital U.S. Federal Government Information: Preservation and Access.)

As Potter says, “We have clear data that if content is not captured from the web soon after its creation, it is at risk.” And, “The absence of an acquisition stream for this [born-digital] content puts it at risk of being lost to future library and archives users.”

Potter outlines a plan of action for digital newspaper information that is surprisingly relevant for government information. She suggests that libraries should establish relationships (and eventually agreements) with the organizations that create, distribute, and own news content. That sounds like exactly what FDLP libraries have always done for 200+ years with paper and should be doing, could be doing, with digital government information today. There is no legal or regulatory barrier to GPO depositing FDLP digital files with FDLP libraries; indeed, GPO is already doing this de facto with its explicit actions that allow “USDocs” private LOCKSS network partners to download FDsys content.

Potter also recommends web archiving as another promising strategy. Since many agencies are reluctant to deposit digital content with FDsys, and because they are allowed by law to refrain from doing so, web archiving is a practical alternative, even if it is imperfect. Indeed, GPO does its own web harvesting program. Although some libraries also do web harvesting that includes U.S. Federal government web sites, more needs to be done in this area. (See: Webinar on fugitive documents: notes and links.)

I find it ironic that libraries are not at least experimenting with preserving born-digital government information. It is difficult to find an article about library projects that does not assert scarcity of funds or high barriers of copyright to overcome in digital library projects. So, why not use born-digital government information as a test bed for preserving digital content? The FDLP agreements and commitments are already in place, most of the content is public domain, and communities of interest for the content already exist. FDLP libraries could start today by building digital library collections and test-bed technology for government information and later expand to other more difficult collections and build on a base of experience and success. The fact that this would help our designated communities, preserve essential information, and further the goals of the FDLP would be welcome side-effects.

LLIS.gov going away and moving to HSDL.org and FEMA.gov

Are you pointing to documents at LLIS.gov? Those links appear to be broken. LLIS.gov (which points to llis.dhs.gov) has had a generic “Site under maintenance” page since at least December 2014.

According to the Federal Emergency Management Agency, the Lessons Learned Information Sharing (LLIS) program’s LLIS.gov website will cease independent operations and consolidate its content with the Naval Postgraduate School’s Homeland Security Digital Library (HSDL.org) and FEMA.gov. Documents are being posted at their new sites and will continue to be posted over the next few months, according to a FEMA LLIS.gov official, who added, “The user requirements for membership to HSDL.org are very similar to LLIS, but users will have to register for access to restricted content, as users had to do with LLIS.gov. HSDL does have a large amount of public data as well, which users will be able to access without registering.

  • LLIS.gov Consolidation Information
    One of the advantages of this move is that LLIS.gov content, such as lessons learned, innovative practices, after-action reports, plans, templates, guides, and other materials, will be consolidated with an already substantial database on HSDL.org. This will allow the homeland security and emergency management communities to find relevant information in one place. FEMA’s LLIS program will continue to produce trend analyses, case studies on the use of FEMA preparedness grants, webinars, and other documents relevant to emergency managers. These products will be available either on this site or HSDL.org.
  • Lessons Learned Information Sharing Consolidation Question and Answer (Q&A) [PDF]. Q&A document for the 2015 consolidation effort.
  • Documents moved to FEMA: https://www.fema.gov/lessons-learned-information-sharing-program
  • Documents moved to Homeland Security Digital Library: For publicly available documents, visit HSDL.org and use the search bar. For restricted content, login before searching.

Another study of link rot and content drift

A new paper on Link Rot and Content Drift gives new details on the extent of the problem.

Klein and his co-authors examined over a million references in close to 400,000 academic articles published between 1997 and 2012 and found that 1 out of 5 of those articles contained references that were no longer good. A lot of the articles they examined did not cite anything on the web (particularly articles published in the late 1990s when much less information had URLs). When they examined only those articles that contain references to web resources, they found that 7 out of 10 articles contained references that were rotten. The rate of failure of links is extremely high (34 to 80%) for older (1997) publications, but still very high (13 to 22%) for recently published (2012) articles.

Over the time period covered, more articles cite more items on the web and the authors discovered, as you might guess, that the percentage of articles with rotten cites increases over time (from a less that 1% in 1997 to as high as 21% in 2012).

They also examine “content drift.” (The authors define content drift this way: “The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced.”). If a link in a paper leads to a “404 Not Found” error message, at least you know you that the link failed. But if the link in a paper resolves to something you cannot always know if the information you are seeing is the same information that was cited, or if it has been altered or changed, or replaced.

David Rosenthal, the technical designer of LOCKSS at Stanford, has thoughtful and helpful comments on the article on his blog.

He says that the problems of link rot and content drift are even bigger than the authors of the paper describe. One example that David gives is that the doi.org domain (which is used for Digital Object Identifiers) was allowed to expire on January 20th thus briefly breaking DOI links all over the Web. (GPO had a similar — but much longer — problem like this when its PURL server crashed back in 2009).

All of this is relevant to government information. Although the study focuses on academic publishing, the authors found that the rate of link rot in the scholarly literature is very similar to link rot patterns found in other studies of the web in general. Klein’s paper does trace citation to .gov domains and records similar link rot to those references. David noticed that one of the links in the Klein paper itself was broken(!) and it was a link to PubMed (at ncbi.nlm.nih.gov).

But one thing that David mentions has, I think, particular importance for government information librarians who worry that the problems of preserving government information are beyond their resources. The reason is that we should not rely on only one solution or one institution to adequately address digital preservation of government information. David says that the complexity of the problems that need to be solved (including human, technical, economic, copyright, institutional, etc.) means that “there cannot be a single comprehensive technical solution.” That is not pessimism; it is realism. And it is not an excuse to give up, but a reason to act. We have to realize that we all must participate in preservation. As David says, the best we can do is to combine a diversity of partial solutions.

Imagine how much better a job 100, or 500, or 1000 FDLP libraries could do than GPO can do on its own.

Temporal Context in Digital Preservation

Temporal context is an important and much overlooked aspect of preservation. Users of preserved information need a way of using that information in its original context. Many born-digital documents are not isolated and complete in themselves, but are part of a network of documents. Providing temporal context means preserving the context of a document at the time it was created.

When we preserve a document (however we might define “document”), the links in that document to other documents should not only work, but should also link to the same content that the author linked to at the time the document was created — not to a later version that the author never saw that may have replaced the document the author linked to. (We might also want to link to the later or current versions, but we need to know when we want to do that and be able to give the user the information she needs to know what she is getting and the choice of what to get when there are choices.)

This relates both to versioning (identifying different versions and editions and modifications of documents) and also to link-rot (keeping links working and working properly).

Here are two readings and a podcast that address these issues.

Research data lost to the sands of time

Here’s an interesting article, not on link rot (a topic FGI has been tracking for some time), but on *data rot*. In a recent article in Current Biology, researchers examined the availability of data from 516 studies between 2 and 22 years old. They found the following:

  • that the odds of a data set being reported as extant fell by 17% per year;
  • Broken e-mails and obsolete storage devices were the main obstacles to data sharing
  • Policies mandating data archiving at publication are clearly needed

Librarians have known of this issue for years — the Inter-university Consortium for Political and Social Research (ICPSR) was set up in 1962 to tackle this — but it does put the issue in focus. And finally the federal government — via efforts like the NSF’s data management plan and OSTP’s new directive to improve the management of and access to scientific collections — is beginning to get behind the effort to improve on data rot. And many libraries — not to mention scientists and researchers — are beginning to struggle with the issue of data preservation. The issue is too big for just government information librarians to handle obviously. But this is fertile space in which govt information librarians, data librarians, research communities, and federal agencies can come together. The Federal policy stating the importance of data preservation is there, it’ll just take effort by multiple stakeholders to make sure it actually happens. It’s a positive that the writers of Dragonfly, the blog of the National Network of Libraries of Medicine Pacific Northwest Region — where I came across the article — point out that academic institutions can and should play a leading role in data preservation. I wholeheartedly agree!

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. “The availability of research data declines rapidly with article age.” Current Biology 24, no. 1 (2014): 94-97.

The researchers found that for every year that had passed since the paper’s publication date, the odds of finding an email address that led to contact with a study author decreased by 7% and that the odds of turning up the data reduced by 17% per year.  The authors report that while some of the data sets were truly lost others fell more into the category of “unavailable,” since they existed, but solely on inaccessible media (think Jaz disk).  These findings will not come as a shock to those who have worked in a research lab.  This publication does put some tangible numbers behind the underlying message of NYU Health Sciences Library’s excellent dramatic portrayal of an instance of inaccessible data.  The authors conclude by suggesting that a solution to this problem moving forward can be found in more journals requiring the deposit of data into a public archive upon publication.  I would also suggest that academic institutions can take a role by establishing policies supporting research data preservation alongside providing a data repository.