Home » Posts tagged 'content drift'
Tag Archives: content drift
Legal Link Rot
The good folks at perma.cc recently conducted a “quick review” of the links in court filings made in the last five years by three of the largest law firms in the U.S.
- Link Rot for Lawyers: a Prodigious Problem. (February 1, 2018)
They found that over 80% had at least one broken link and that, on average, these briefs contained around six broken links each.
When they examined links that were not broken (that is, the link resolved to some page, not to an error message), they found that nearly 30% of those no longer displayed the material referenced in the brief. (This is known as ‘reference rot’ or ‘content drift.’)
Supreme Court Website Addresses link-rot and content-drift
The Supreme Court has announced two important changes to its website. The Court will now highlight changes to slip opinions and the Court will now attempt to preserve web-based content cited in Court opinions. These website enhancements address two digital preservation problems: changes to content over time, known as “content-drift”, and content being deleted or moved, called “link-rot.”
Here is the text of the two announcements, which appeared under the “What’s New” section of the Court’s homepage:
Beginning with the October Term 2015, postrelease edits to slip opinions on the Court’s website will be highlighted and the date they occur will be noted. The date of any revision will be listed in a new “Revised” column on the charts of Opinions, In-Chambers Opinions, and Opinions Related to Orders under the “Opinions” tab on the website. The location of a revision will be highlighted in the opinion. When a cursor is placed over a highlighted section, a dialog box will open to show both old and new text. See Sample Opinions” for an example of how postrelease edits will appear on the website.
The Court’s Office of Information Technology is collaborating with the Library, the Reporter of Decisions’ Office, and the Clerk’s Office to preserve web-based content cited in Court opinions. To address the problem of “link rot,” where internet material cited in Court opinions may change or cease to exist, web-based content included in Court opinions from the 2005 Term forward is being made available on the Court’s website. Hard copies will continue to be retained in the case files by the Clerk’s Office. See “Internet Sources Cited in Opinions.”
An article in the New York Times puts these changes into context:
- Supreme Court Plans to Highlight Revisions in Its Opinions, By ADAM LIPTAKOCT. New York Times (Oct. 5, 2015).
The Supreme Court announced on Monday that it would disclose after-the-fact changes to its opinions, a common practice that had garnered little attention until a law professor at Harvard wrote about it last year.
The court also took steps to address “link rot” in its decisions. A study last year found that nearly half of hyperlinks in Supreme Court opinions no longer work.
Some good news re: “link rot”
Charlotte Stichter says that new reports from the Library of Congress Law Library’s Global Legal Research Directorate will soon have references that include a link to an archived version of the reference using perma.cc. The announcement appears on the blog of the Law Librarians of the Library of Congress, but please see also Herbert Van de Sompel’s comment on the project. Van de Sompel says that providing an archived copy is a good first step, but more is need. Specifically, robust references that include the original URI, the datetime of linking, and the URL of the archival copy.
- Cooking Up a Solution to Link Rot by Charlotte Stichter, In Custodia Legis (August 14, 2015).
A plan for implementing perma.cc in the Law Library’s Global Legal Research Directorate is now being cooked up, with a target implementation date of October 1 this year…. This means that hyperlinked footnote references in new reports by the Directorate will also contain a link to an archived version of the referenced web page, allowing readers permanent access to key legal materials.
Dodging the memory hole
Abbey Potter’s comments about preserving digital news are also very relevant to the preservation of government information.
Potter is the Program Officer with the the National Digital Information Infrastructure and Preservation Program (NDIIPP). In her post on The Signal blog, she elaborates on her closing keynote address at the Dodging the Memory Hole II: An Action Assembly meeting in Charlotte NC last month.
-
Dodge that Memory Hole: Saving Digital News by Abbey Potter, The Signal Library of Congress Digital Preservation Blog (June 2, 2015).
-
Take Action: What comes next? Closing Keynote, Dodging the Memory Hole II: An Action Assembly Charlotte, NC, (May 12, 2015).
She quotes a presentation by Andy Jackson of the UK Web Archive in which he addresses the questions: “How much of the content of the UK Web Archive collection is still on the live web?” and “How bad is reference rot in the UK domain?”
- Ten years of the UK Web Archive: What have we saved? [ppt] Andy Jackson International Internet Preservation Consortium, General Assembly 2015 (April 27, 2015).
By sampling URLs collected in the UK Web Archive, Jackson examined URLs that have moved, changed, or gone missing. He analyzed both link rot (a file gone missing) and content drift (a file that has changed since being archived). He shows that 50 percent of content had gone, moved, or changed so as to be unrecognizable in only one year. After three years the figure rose to 65 percent.
Potter says that it is safe to assume that the results would be similar for newspaper content on the web. It would probably also be similar for U.S. government web sites.
What can we learn from this and what can we do? For newspapers, Potter says, libraries have acquisition and preservation methods that are too closely linked to physical objects and that too often exclude digital objects. This results in libraries having gaps in their collections – “especially the born-digital content.” She summarizes the problem:
Libraries haven’t broadly adopted collecting practices so that they are relevant to the current publishing environment which today is dominated by the web.
This sounds exactly like what is happening with government information.
First, because GPO has explicitly limited actual deposit of government information to so-called “tangible” products (Superintendent Of Documents Policy Statement 301 [SOD 301]). This policy does exactly what Potter says is wrong: it establishes collecting practices that are not relevant to the current publishing environment. (See more on the effects of SOD 301 here.)
Second, because most of the conversation within the FDLP in the last few years has been about our historic paper collections rather than about the real digital preservation issue we should be facing: born-digital government information. (See Born-Digital U.S. Federal Government Information: Preservation and Access.)
As Potter says, “We have clear data that if content is not captured from the web soon after its creation, it is at risk.” And, “The absence of an acquisition stream for this [born-digital] content puts it at risk of being lost to future library and archives users.”
Potter outlines a plan of action for digital newspaper information that is surprisingly relevant for government information. She suggests that libraries should establish relationships (and eventually agreements) with the organizations that create, distribute, and own news content. That sounds like exactly what FDLP libraries have always done for 200+ years with paper and should be doing, could be doing, with digital government information today. There is no legal or regulatory barrier to GPO depositing FDLP digital files with FDLP libraries; indeed, GPO is already doing this de facto with its explicit actions that allow “USDocs” private LOCKSS network partners to download FDsys content.
Potter also recommends web archiving as another promising strategy. Since many agencies are reluctant to deposit digital content with FDsys, and because they are allowed by law to refrain from doing so, web archiving is a practical alternative, even if it is imperfect. Indeed, GPO does its own web harvesting program. Although some libraries also do web harvesting that includes U.S. Federal government web sites, more needs to be done in this area. (See: Webinar on fugitive documents: notes and links.)
I find it ironic that libraries are not at least experimenting with preserving born-digital government information. It is difficult to find an article about library projects that does not assert scarcity of funds or high barriers of copyright to overcome in digital library projects. So, why not use born-digital government information as a test bed for preserving digital content? The FDLP agreements and commitments are already in place, most of the content is public domain, and communities of interest for the content already exist. FDLP libraries could start today by building digital library collections and test-bed technology for government information and later expand to other more difficult collections and build on a base of experience and success. The fact that this would help our designated communities, preserve essential information, and further the goals of the FDLP would be welcome side-effects.
Another study of link rot and content drift
A new paper on Link Rot and Content Drift gives new details on the extent of the problem.
- Klein, Martin, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin. “Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.” PLoS ONE 9, no. 12 (December 26, 2014): e115253. doi:10.1371/journal.pone.0115253.
Klein and his co-authors examined over a million references in close to 400,000 academic articles published between 1997 and 2012 and found that 1 out of 5 of those articles contained references that were no longer good. A lot of the articles they examined did not cite anything on the web (particularly articles published in the late 1990s when much less information had URLs). When they examined only those articles that contain references to web resources, they found that 7 out of 10 articles contained references that were rotten. The rate of failure of links is extremely high (34 to 80%) for older (1997) publications, but still very high (13 to 22%) for recently published (2012) articles.
Over the time period covered, more articles cite more items on the web and the authors discovered, as you might guess, that the percentage of articles with rotten cites increases over time (from a less that 1% in 1997 to as high as 21% in 2012).
They also examine “content drift.” (The authors define content drift this way: “The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced.”). If a link in a paper leads to a “404 Not Found” error message, at least you know you that the link failed. But if the link in a paper resolves to something you cannot always know if the information you are seeing is the same information that was cited, or if it has been altered or changed, or replaced.
David Rosenthal, the technical designer of LOCKSS at Stanford, has thoughtful and helpful comments on the article on his blog.
- Rosenthal, David. 2015. The Evanescent Web. DSHR’s Blog (February 10, 2015).
He says that the problems of link rot and content drift are even bigger than the authors of the paper describe. One example that David gives is that the doi.org domain (which is used for Digital Object Identifiers) was allowed to expire on January 20th thus briefly breaking DOI links all over the Web. (GPO had a similar — but much longer — problem like this when its PURL server crashed back in 2009).
All of this is relevant to government information. Although the study focuses on academic publishing, the authors found that the rate of link rot in the scholarly literature is very similar to link rot patterns found in other studies of the web in general. Klein’s paper does trace citation to .gov domains and records similar link rot to those references. David noticed that one of the links in the Klein paper itself was broken(!) and it was a link to PubMed (at ncbi.nlm.nih.gov).
But one thing that David mentions has, I think, particular importance for government information librarians who worry that the problems of preserving government information are beyond their resources. The reason is that we should not rely on only one solution or one institution to adequately address digital preservation of government information. David says that the complexity of the problems that need to be solved (including human, technical, economic, copyright, institutional, etc.) means that “there cannot be a single comprehensive technical solution.” That is not pessimism; it is realism. And it is not an excuse to give up, but a reason to act. We have to realize that we all must participate in preservation. As David says, the best we can do is to combine a diversity of partial solutions.
Imagine how much better a job 100, or 500, or 1000 FDLP libraries could do than GPO can do on its own.
Latest Comments