Preview of Capturing Electronic Documents on the Last Frontier
Since I’ve been mentioning my and other FGI volunteer talks at the Nevada Library Association Annual Conference, I thought I’d share my handout from my “Capturing e-docs on the last frontier talk.” I find this outline especially worthwhile to mention here because it touches on the notion of digital deposit for a state depository program.
Disclaimer – This is not a proposed solution for Federal Depository Library Program. I don’t believe that what I’m doing or planning to do in Alaska would scale to the vastly larger volume of federal documents. But I hope it will be food for thought.
And now, the handout…
Capturing E-Docs on the Last Frontier
Presented by Daniel Cornwall, Alaska State Library to the
Nevada Library Association, October 2005
Alaska has been collecting â€œInternet Onlyâ€ state publications since 1998. We went from manual harvesting in 1998 to automated harvesting in May 2002 with Teleport Pro to automated harvesting using the Illinois State Library commissioned Capturing Electronic Publications starting in August 2004.
Processing of electronic documents
From 1998 through June 2005, the Alaska State Library would print one copy of an electronic publication for our non-circulating Historical Collections. Publications would be cataloged in paper format with an 856 field indicating the URL. In any given month, between 40 and 60 percent of publications added to our collection are â€œInternet Only.â€ Publication receipts vary between 80 and 300 per month.
Beginning with documents cataloged in July 2005, the Library has begun to store cataloged state electronic publications on its web server. We still print out a preservation copy, but the 856 field now points to the harvested library copy. These copies can be accessed through our catalog, or though our shipping lists page at http://www.library.state.ak.us/asp/shippinglists/shippinglists.html. We feel that placing harvested documents on a server the library controls will eliminate link rot and allow us to demonstrate the usage of electronic state documents. It is not in and of itself a preservation measure.
Safeguards for agencies concerned about users getting dated content
While we have not surveyed agencies recently, some indexing projects in the past made us aware of the legitimate concern that agencies had about people accessing a superseded version of a document expecting the current version. So we have used the robots.txt protocol to prevent search engines from indexing the harvested content contained in â€œlibrary.state.ak.us/asp/edocs/.â€ The shipping lists should be visible to major search engines in the near future. When a user picks a document off our shipping list, the date of the document is obvious. The is true of an electronic document that is accessed through our catalog.
Current and projected space requirements
A. Actual publications stored on Alaska State Library web server
Since we began storing state agency publications on our servers with the July 2005 shipping lists, we have stored176 publications taking up 468MB. Based on examination of electronic documents back to May 2005, we estimate adding 2GB worth of state electronic publications to our server each year. Naturally this is a guess, but with inexpensive 250 GB hard drives readily available, we feel confident in our storage space for years to come.
B. State Agency websites collected and stored in compressed format by the Capturing Electronic Publications Software.
As of October 2005, the total material collected by the 74 CEP spiders took up 90.8 GB. Much of this material is not publication related (i.e. forms, pictures, flyers, posters, etc) Each month the CEP spiders collect nearly 40 GB of material. Only new material is retained by the CEP archive.
Now that we have pulled together our cataloged electronic state documents onto a single server, we have the potential of created a distributed system of digital deposit. Our eight Alaska State Document depositories are being surveyed for their interest in holding and serving electronic state documents. Several technologies, including distribution via CD/DVD, ofline-browser tools such as Teleport Pro, ftp, etc are being investigated to provide electronic state documents to our depository members. It is our belief, that however implemented, multiple electronic copies will keep publications safer than a single agency copy.
Past and current e-docs gathering tools
Original Spidering/Gathering Tool â€“ Teleport Pro from Tenmax.
Purchase/Download from: http://www.tenmax.com/teleport/pro/home.htm
Pros: Very easy to use. Can be used to retrieve just the file types you want (i.e. Do not need to download entire agency web site). Runs on Windows, so isn’t considered â€œweirdâ€ by many IT departments. All retrieved material remains in readable format. Can override robot exclusion protocol
Cons: No automatic notifications of new files. Newer files with same name overwrite older ones. Files in directories from many different dates. No tracking of file metadata. No reporting capabilities on web site sizes or file types.
Current Spidering/Gathering Tool â€“ Capturing Electronic Publications
Download from: http://www.isrl.uiuc.edu/pep/#CEP
Pros: Once setup is complete, relatively easy to use. E-mail alerting service for new, changed, and deleted files from a target web site. Web-based administration for adding/modifying/deleting agency web sites. Good version control. Compressed storage makes it easy to store months worth of sites. Great reporting capabilities for number, type and storage space of captured files.
Cons: Difficult to search compressed file archive. Non-text files are unreadable until entire web site is â€œchecked outâ€ Reports files as â€œdeletedâ€ if they’ve been moved. To retrieve all possible â€œdocument-likeâ€ objects, one must retrieve entire web site, exclusive of graphics files. This increases storage requirements.
Alaska State Publications Program Contact Information:
Government Publications Librarian
Alaska State Library
PO Box 110571
Juneau AK 99811-0571
E-mail: [email protected]