Earlier this year, Daniel estimated the average size of an online federal document as between 5MB and 10MB. Libraries investigating digital deposit and provision of permanent public access to these resources need to estimate the cost of storage for these documents.
For the past week, I’ve played around in an entirely nonrandom sample of online docs to try to get an accurate estimate. Although I’m not close to a reliable estimate, I’d still like to share what I’ve done…
- grabbed all (1,234) MARC records with 856 fields from DDM2 for the GPO Timestamp range 2006 06 01 – 2006 06 30
- used wget to retrieve all URLs listed in those 856 fields
- slapped the wget logs into a vaguely useful excel spreadsheet (thanks to liberal regexp-ing in jEdit)
The basic results:
|TOTAL SIZE MB||2004.7|
|AVG SIZE KB||1530|
‘Course, these numbers don’t mean much against a little scrutiny. The 856 field often points to table of contents pages (when it points to the document at all…), and that single page is all that gets counted in this simple investigation.
PDF files might offer a better estimate than HTML files. Although publishers can
split up documents into multiple PDF files and have a “Table of Contents” PDF file point to these multiple resources composing a single bibliographic unit, this doesn’t appear to be too common. When 856 fields point to PDF files, they tend to be self-sufficient, whole bibliographic units. So here are the numbers for pdf files retrieved using the 856 fields:
|TOTAL SIZE MB||1961|
|AVG SIZE KB||2464|
|STD DEV SIZE KB||7605|
|MAX SIZE KB||148902|
In a true demonstration of futility, I looked at 124 of the HTML files (of the 525 in the June 2006 DDM2 sample) that are stopping points for the 856 pointers. Most of these totally-non-random-sample HTML pages to not constitute the entire document described in the MARC record. I developed various wget capture strategies for 84 of these online documents, and the average size of the “cluster” of files captured per 856 pointer was 8.17 MB (median: 3.19 MB, std dev: 13.09 MB).
In a vaguely related exercise, I grabbed the various files composing Foreign Relations of the United States, vols. E-1, E-5, and E-7. Sure, they’re outliers w/r/t size, but I thought I’d mention them anyway…
I don’t have one yet. At the end of the week, though, 5-10 MB seems like a pretty good estimate to me.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.