Home » post » Toward estimating file sizes of online federal documents

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Toward estimating file sizes of online federal documents

Earlier this year, Daniel estimated the average size of an online federal document as between 5MB and 10MB. Libraries investigating digital deposit and provision of permanent public access to these resources need to estimate the cost of storage for these documents.

For the past week, I’ve played around in an entirely nonrandom sample of online docs to try to get an accurate estimate. Although I’m not close to a reliable estimate, I’d still like to share what I’ve done…

The process:

  1. grabbed all (1,234) MARC records with 856 fields from DDM2 for the GPO Timestamp range 2006 06 01 – 2006 06 30
  2. used wget to retrieve all URLs listed in those 856 fields
  3. slapped the wget logs into a vaguely useful excel spreadsheet (thanks to liberal regexp-ing in jEdit)

The basic results:

TOTAL URLS 1342
TOTAL SIZE MB 2004.7
AVG SIZE KB 1530

‘Course, these numbers don’t mean much against a little scrutiny. The 856 field often points to table of contents pages (when it points to the document at all…), and that single page is all that gets counted in this simple investigation.

PDF files might offer a better estimate than HTML files. Although publishers can
split up documents into multiple PDF files and have a “Table of Contents” PDF file point to these multiple resources composing a single bibliographic unit, this doesn’t appear to be too common. When 856 fields point to PDF files, they tend to be self-sufficient, whole bibliographic units. So here are the numbers for pdf files retrieved using the 856 fields:

FILETYPE pdf
TOTAL URLS 815
TOTAL SIZE MB 1961
AVG SIZE KB 2464
STD DEV SIZE KB 7605
MAX SIZE KB 148902

In a true demonstration of futility, I looked at 124 of the HTML files (of the 525 in the June 2006 DDM2 sample) that are stopping points for the 856 pointers. Most of these totally-non-random-sample HTML pages to not constitute the entire document described in the MARC record. I developed various wget capture strategies for 84 of these online documents, and the average size of the “cluster” of files captured per 856 pointer was 8.17 MB (median: 3.19 MB, std dev: 13.09 MB).

In a vaguely related exercise, I grabbed the various files composing Foreign Relations of the United States, vols. E-1, E-5, and E-7. Sure, they’re outliers w/r/t size, but I thought I’d mention them anyway…

VOL MB FILES
E-1 318 880
E-5 143 687
E-7 618 892

CONCLUSION:

I don’t have one yet. At the end of the week, though, 5-10 MB seems like a pretty good estimate to me.

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


2 Comments

  1. Wget ain’t fast. The initial wget action – retrieving the “stopping point” file of each of the 1,441 856 field URLs in the 1,234 MARC records – took 3 hours running on a 3.2GHz Pentium 4 Windows XP box with 512MB RAM. 1342 URLs were successfully wgot = 6478 files = just over 2000 MB. The process was slowed down in part because I used a perl script to initiate a new wget instance for each URL to control the log output.

    The second wget action – retrieving all files within online documents associated with 84 856 field URLs that stopped on HTML pages – wgot 84 856 field URLs = 7972 files = 686 MB, and it took about 1.5 hours.

    The FRUS E volumes were wget-ted without the intermediary perl scripts, but they included wget options like –convert-links that cause some extra overhead:
    E-1: 1.3 hours
    E-5: 11 minutes
    E-7: 1.3 hours

  2. James,

    Thanks for looking at actual Federal documents! My “estimates” were solely based on my experience with Alaska docs. I hope this spurs more research from other librarians and/or some information from the Government Printing Office.

    Aside from choosing one or two other months to play with, I think your method seems sound. But then I’m not a statistician.

    Could you comment on how long it took you to gather the files with wget?

    This is really good empirical work that I think will help the community define the size problem that lays before us!

    ————————————
    “And besides all that, what we need is a decentralized, distributed system of depositing electronic files to local libraries willing to host them.” — Daniel Cornwall, tipping his hat to Cato the Elder for the original quote.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Archives