Home » post » book scan wizard + internet archive = DIY public domain digital book repository

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

book scan wizard + internet archive = DIY public domain digital book repository

At the last couple of depository library council meetings, I’ve heard comments from documents librarians — especially from librarians at smaller institutions — that they’d love to participate in the digitization process of historic government documents, but for various reasons (lack of $$, staffing, time, technical infrastructure etc) could not undertake large scale digitization projects.

Now there’s a way for lots of libraries to chip in on the greater goal of increased access to historic government documents with very little $$ or infrastructure. We’ve mentioned before about BookLiberator and DIYbookscanner, two projects working on low cost hardware solutions for digitizing books using off the shelf digital cameras and free opensource software called Book Scan Wizard.

But there were still 2 pieces missing to make the whole workflow run smoothly for libraries and government documents collections of all sizes. The third piece to the puzzle just became a reality with yesterday’s announcement that Book Scan Wizard had teamed up with the Internet Archive to provide automatic uploads of scans to the Internet Archive (directions and more information here). Hardware: check. Software: check. Digital infrastructure: check.

With the new version of Book Scan Wizard, or even through just uploading directly to the Internet Archive, any PDF composed of images of book pages or organized zip file filled with images of book pages will be automatically processed. The Internet Archive’s servers will then automatically perform optical character recognition (OCR) on the book and make a pdf, epub, kindle (mobi), daisy, djvu, and text file copy of the entire book available for download by anyone, anywhere. You can see a sample book from this process to get a better idea. All this happens within a few hours of the book being uploaded and then anyone can download it. This is free OCR for anyone in the world.

Now there’s one last piece needed: Scan on demand. This idea has already been put into practice by the Internet Archive’s Open Library and their partnership with the Boston Public Library. What we need is to open up the Catalog of Government Publications (CGP) — which will soon include over 1 million records from GPO’s historic shelflist spanning 1870s – 1992 — similar to the way the BPL’s scan on demand project (now retired it seems) allowed users to request a scan of a public domain book directly from the Open Library catalog.

GPO could manage this scan on demand process — or allow libraries to pick and choose documents from the CGP — connect the bibliographic metadata from the historic shelflist, and upload to both the Internet Archive and FDsys. The circle is complete. Am I missing anything? Would love to hear readers’ thoughts.

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.