I was honored last week to be part of a panel hosted by OpenTheGovernment and the Bauman Foundation to talk about the End of Term project. Other presenters included Jess Kutch at Coworker.org and Micah Altman, Director of Research at MIT Libraries. I talked about what EOT is doing, as well as some of the other great projects, including Climate Mirror, Data Refuge and the Azimuth backup project, working in concert/parallel to preserve federal climate and environmental data.
I thought the Q&A segment was especially interesting because it raised and answered some of the common questions and concerns that EOT receives on a regular basis. I also learned about a cool project called Violation Tracker, a search engine on corporate misconduct. And I was also able to talk a bit about what are the needs going forward, including the idea of “Information Management Plans” for agencies similar to the idea of “Data Management Plans” for all federally funded research. I was heartened to know that there is interest in that as a wider policy advocacy effort!
The full recorded meeting can be viewed here from Bauman’s adobe connect account.
Here’s more information on the EOT crawl and how you can help.
Coalitions of government, university, and public interest organizations have been working to ensure as much information as possible is preserved and accessible, amid growing concern that important and sensitive government data on climate, labor, and other issues may disappear from the web once the Trump Administration takes office.
Last Thursday, OTG and the Bauman Foundation hosted a meeting of advocates interested in preserving access to government data, and individuals involved in web harvesting efforts. James Jacobs, a government information librarian at Stanford University Library who is working on the End of Term (EOT) web harvest – a joint project between the Internet Archive, the Library of Congress, the Government Publishing Office, and several universities – spoke about the EOT crawl, and explained the various targets of the harvest, including all .gov and .mil web sites, government social media accounts, and more.
Jess Kutch discussed efforts by Coworker.org with Cornell University to preserve information related to workers’ rights and labor protections, and other meeting attendees presented some of their own projects as well. Philip Mattera explained how Good Jobs First is using its Violation Tracker database to scrape and preserve government source material related to corporate misconduct.
Micah Altman, Director of Research at MIT Libraries, presented on the need for libraries and archives to build better infrastructure for the EOT harvest and other projects – including data portals, cloud infrastructure, and technologies that enhance discoverability – so that data and other government information can be made more easily accessible to the public.
We here at FGI have been making the argument against the destruction of physical collections in connection with digitization efforts for a long time (see e.g., Wait! Don’t Digitize and Discard! A White Paper on ALA COL Discussion Issue #1a and What You Need to Know about the New Discard Policy). So it’s nice to hear the same argument from Jeff MacKie-Mason, recently hired University Librarian and Chief Digital Scholarship Officer at UC Berkeley on his blog madLibbing: Muddling Along in the Information Age. Mackie-Mason clearly and succinctly points out the reasons that libraries still need physical collections: many digitized works are still in copyright and their digital surrogates are therefore not shareable online, print copies are easier to read with higher comprehension rates, there is “little or no confidence that we can guarantee long-term digital preservation” (emphasis his!), and current digital surrogates from large digitization projects are less than complete (we’ve pointed this out repeatedly e.g., in “‘An alarmingly casual indifference to accuracy and authenticity.’ What we know about digital surrogates.”). So we hope the next time your library weeds a government document under the assumption that it’s online, you’ll check the digital surrogate for completeness and at least start the discussion with your administrators about the need for a local digital archive to assure the preservation of the digital surrogate that you’re about to weed. It could mean the difference between access and frustration for your user community.
One huge misconception we face is that digitizing our collections means we don’t need the print anymore. For example, we are participants in the Google Books / HathiTrust project, and most of our 11 million regular volumes have been digitized. Why not burn our print copies?
- For starters, about half of the collection is still in copyright. The HathiTrust collection can be searched, full-text, to find the existence of books, but we are not allowed to let people use the digital copy (with limited exceptions, e.g., for the blind, who can listen to a text-to-voice conversion). Decades before this need for our print copies goes away.
- Second, we are here not to build collections for their own sake, but to serve our faculty and students. And many of them vastly prefer doing their work from print copies. Those who read long monographs find it easier and their comprehension higher. Those who need to study large images or maps, in high resolution, or who want to see side-by-side page comparisons, need the print. And for many rare and historical documents, the materiality of the original document itself is of enormous importance for scholarship, from the marginal annotations to the construction of the volume.
- Next, we can have little or no confidence that we can guarantee long-term digital preservation. Digital storage has been around a relatively short time In that time, formats change frequently. Hardware and software to render digital formats changes. Bits on storage media rot. Keeping bits and being able to find and access them in the future requires large annual expenditures, and those expenditures are getting larger as the amount of content we want to preserve grows enormously fast. Further, much of scholarly content currently is held on servers of for-profit companies, and we have no guarantee those companies will survive, or that they will take care to ensure that their archives of scholarly publications survive.
- The Google project has been very good, but it is not complete. It does not scan fold-out pages, for example, which are in many scholarly books (maps, charts, tables). We have discovered that sometimes they miss pages, or the quality is not readable.
So, for now, there is pretty much consensus among research scholars and librarians that we must keep print copies for preservation in all cases, and for continuing use in many cases.
Time once again for a selection of news and new resources that we hope will be an interest to the FGI community. The posts are from INFOdocket.com (@infofodocket) where we compile and post new items daily from a variety of resources.
The Associated Press (AP) has a story out today covering the Cyber Cemetery project at the University of North Texas Libraries. I came across a version at the Federal News Radio website, but I imagine it has been picked up elsewhere:
Government Web sites kept alive at Cyber Cemetery, 14 September 2009.