James Staub's blog
I'm going to have way too much fun browsing through the project proposals. Casting my votes, though, is already proving difficult... Using user downtime? Massive mashup calendars? Warnings for hidden corporate abuses every time I make an online purchase?
And of course, many of these envisioned projects would not even be imaginable were it not for widely available and mostly reliable government information data sources.
The GPO has posted a recap of the 2007 Fall Federal Depository Library Conference and Fall Depository Library Council Meeting that includes audio files, photos, and unedited transcripts. Of particular note:Ric Davis, who has been serving as Interim Superintendent of Documents has accepted the job of Interim Superintendent of Documents; the FDSys has been renamed FDSys; and the Depository Library Council is now directly asking "What does FDSys mean for libraries?" Also of note: there were excellent discussions of official and authentic online legal materials and shared models for regional FDLP libraries.
Attendees of the 2006 Fall Federal Depository Library Conference and Council Meeting heard about a number of developments in government information and the Federal Depository Library Program, including:
- NTIS will investigate providing access to its digital content for federal depository libraries
- celebrating Judith Russell's tenure as Superintendent of Documents
- All Depository Library Council sessions from this meeting will be released as podcasts!
To learn about these developments and to hear low quality audio before the official podcasts come out, take a stroll through the FGI Fall 2006 FDLP/DLC roundup page. Virginia Rigby has graciously offered her detailed notes from the sessions she attended - I'll be posting these and my own notes over the next couple days. When listening to the audio files, wear headphones!
And, as always, we'd love for you to share your own notes and experiences from the conference!
Welcome to another exciting episode of FGI's Not Just Blogs! When last we left our daring document do-gooders, they were examining clues pointing to the problem of Authenticity. Collective "Jenkies" let loose when they came to the realization that their online copies of Distinguishing Bolts from Screws were only as authentic as the trust they could invest in the Web sites hosting them, and no amount of pure technology could replace the security that comes from trust. Luckily, libraries - with a primary mission to deliver authentic information and a long history of success doing it - are well-positioned to continue the work we trust them to do in delivering government information.
Today we find our gutsy govdockers cracking into the Vault of Preservation.
To join along in the adventure, click on the "Issues" link at the top of FGI's pages.
Then click on "Preservation" in the Issues box.
Digital decay -- Yikes!
Obsolete file formats -- Zoinks!
Intentional, malicious removal of information from centralized repositories -- Oh, no!
Thank goodness libraries are working with projects like LOCKSS to secure permanent public access to electronic government information. Whew!
(Oh, and, by the way, if you happen to have the answer to these troubling preservation questions, could you please drop us a line in the comments? Thanks!)
Join us next week when Shinjoung leads us through government information issues of privacy, and you'll hear Daniel say:
And besides all that, what we need is a decentralized, distributed system of depositing electronic files to local libraries willing to host them.
Earlier this year, Daniel estimated the average size of an online federal document as between 5MB and 10MB. Libraries investigating digital deposit and provision of permanent public access to these resources need to estimate the cost of storage for these documents.
For the past week, I've played around in an entirely nonrandom sample of online docs to try to get an accurate estimate. Although I'm not close to a reliable estimate, I'd still like to share what I've done...
- grabbed all (1,234) MARC records with 856 fields from DDM2 for the GPO Timestamp range 2006 06 01 - 2006 06 30
- used wget to retrieve all URLs listed in those 856 fields
- slapped the wget logs into a vaguely useful excel spreadsheet (thanks to liberal regexp-ing in jEdit)
The basic results:
|TOTAL SIZE MB||2004.7|
|AVG SIZE KB||1530|
'Course, these numbers don't mean much against a little scrutiny. The 856 field often points to table of contents pages (when it points to the document at all...), and that single page is all that gets counted in this simple investigation.
PDF files might offer a better estimate than HTML files. Although publishers can
split up documents into multiple PDF files and have a "Table of Contents" PDF file point to these multiple resources composing a single bibliographic unit, this doesn't appear to be too common. When 856 fields point to PDF files, they tend to be self-sufficient, whole bibliographic units. So here are the numbers for pdf files retrieved using the 856 fields:
|TOTAL SIZE MB||1961|
|AVG SIZE KB||2464|
|STD DEV SIZE KB||7605|
|MAX SIZE KB||148902|
In a true demonstration of futility, I looked at 124 of the HTML files (of the 525 in the June 2006 DDM2 sample) that are stopping points for the 856 pointers. Most of these totally-non-random-sample HTML pages to not constitute the entire document described in the MARC record. I developed various wget capture strategies for 84 of these online documents, and the average size of the "cluster" of files captured per 856 pointer was 8.17 MB (median: 3.19 MB, std dev: 13.09 MB).
In a vaguely related exercise, I grabbed the various files composing Foreign Relations of the United States, vols. E-1, E-5, and E-7. Sure, they're outliers w/r/t size, but I thought I'd mention them anyway...
I don't have one yet. At the end of the week, though, 5-10 MB seems like a pretty good estimate to me.
What exactly are the preservation concerns for open source standards expressed by Massachusetts state supervisor of public records, Alan Cote?
Are any of them legitimate concerns?
"The rigid policy, such as the initiative before you that excludes any vendor or any process and relies on questionable, untested and unreliable practices or tools, does not suit the commonwealth well," Cote said in prepared remarks. "It may very well result in many electronic records being lost or destroyed."
Cote added that the state's records management system renders what format a document is saved in as moot.
[Anybody want to help me with timestamps on this "transcript"? - JS]
[I will. - Daniel]
The Basic Timestamps
[Audio recording begins after Ric Davis' reading of Judy Russell's letter]
0:00:00 Mike Wash describes progress on GPO's Future Digital System (FDSys).
0:34:00 Questions from Depository Library Council.
0:44:00 Questions from the audience.
The Detailed Version
JR has emergency family business and will not be joining the conference.
RD is sitting in as JR and delivers her prepared remarks. [And somehow James S. managed to not record thisâ€¦]
RD offers a nod to the "unofficial" bloggers in the audience :)
The mechanics of Council business for this meeting: The Depository Library Council will develop a vision document to deliver to the Public Printer Wednesday morning.
200510 - Depository Library Council - Personal name abbreviations James S. is using in his DLC posts
BJ: Bruce James: Public Printer of the United States: US Government Printing Office
JR: Judith (Judy) Russell: Managing Director: Information Dissemination, US Government Printing Office
MW: Mike Wash: Chief Technical Officer and co-Director of the Office of Innovation and New Technology.
RD: Richard (Ric) Davis: Acting Director: Library Services and Content Management, US Government Printing Office
TE: Thomas (TC) Evans: Assistant Chief of Staff for Strategic Initiatives, US Government Printing Office
DEPOSITORY LIBRARY COUNCIL
AM: Ann Miller: Head, Public Documents and Maps Department: Duke University
BS: Barbara S. Selby: Government Information Librarian: University of Virginia
CE: Charles D. Eckman: Principal Government Documents Librarian: Stanford University
CM: Cheryl Knott Malone: Associate Professor: University of Arizona
DA: Duncan M. Aldrich: DataWorks Coordinator: University of Nevada, Reno
0:00:15 MP: Law libraries want paper because the issuing agencies recognize paper as the only official version.
0:00:50 DA: Horse and buggy, paper and electronic.
0:02:25 MS: Publishers anguish over the loss of control of content as libraries seek to customize it. We want platform-neutral content.
0:05:10 BJ: responds. Carnegie-Mellon is developing automated language translation. We need to work with big agencies to help them prepare documents in a fashion that will produce accurate automated translations.
0:07:00 CM: asks for an update on the mass digitization of the legacy collection.
70% of the FDLP collection will be online by the end of 2007.
We need to determine what other documents need to go into the collection. E.g, the Federalist Papers are not FDLP, but are undeniably important.