Home » Library » Digital library technologies

Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

Digital library technologies

Here at Free Government Information, we’re extremely interested in the collection, dissemination and preservation of digital government information (for more background we point you to our manifesto of sorts) and feel that libraries have a vital role to play in this area. Cornell University has been at the forefront of digital preservation and has created a Digital Preservation Tutorial that gives a broad overview of the issues and challenges involved in digital preservation in general.

With that in mind, we are creating a list of technologies that will be of interest and importance as we move toward a digital FDLP. We’re focusing on digital technologies that get at the basic needs for digital preservation: collection or capturing of digital information, description (metadata creation), dissemination (as opposed to simple availability!), and long-term preservation.

This list is a work in progress. If you know of other technologies, software, hardware, clients etc that you have used and would recommend to the community, please contact admin at freegovinfo dot info or leave a comment on this page so we can add to the growing list of useful resources.

  • Archive-IT: Web archiving service from the Internet Archive. The service allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise or hosting facilities. Check out a list of their collections. (added 2/7/07)
  • Capturing Electronic Publications (CEP): A web site archiving system developed with Open Source software for Unix/Linux. CEP makes it possible for organizations to periodically download and retain archival copies of their evolving web site(s). CEP uses a web spider, wget, to traverse and download a target website’s pages and CVS to archive the pages and their subsequent changes. CEP uses a variety of software packages to create, maintain historical data and provides summary statistics about the website’s content. The packages used to create CEP include: Fedora, Apache, CVS, Perl, GD graphic tools, TreeTagger and Wget.
  • CONTENTdm OCLC Digital Collection Management Software. “CONTENTdm® makes everything in your digital collections available to everyone, everywhere. No matter the format — local history archives, newspapers, books, maps, slide libraries or audio/video — CONTENTdm can handle the storage, management and delivery of your collections to users across the Web.”
  • cURL. A command line tool for getting or sending files using URL syntax. Curl is targeted at single-shot file transfers. Curl is not a web site mirroring program. Curl is not a wget clone.
  • Del.icio.us: del.icio.us is a social bookmarking site that allows users to bookmark and share Web sites. It also allows for collaborative collection projects like FGI’s IAdeposit project where digital govt documents that are tagged “IAdeposit” in delicious are then uploaded to and preserved in the Internet Archive’s US govt documents collection. So even if a library can’t afford to build its own digital architecture, it can still participate in a digital collection project.
  • DSpace. Open source software that enables open sharing of content that spans organizations, continents and time. “DSpace is the software of choice for academic, non-profit, and commercial organizations building open digital repositories. It is free and easy to install “out of the box” and completely customizable to fit the needs of any organization. DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets.” See also: duraspace.org.
  • EPrints. “EPrints is the most flexible platform for building high quality, high value repositories, recognised as the easiest and fastest way to set up repositories of research literature, scientific data, student theses, project reports, multimedia artefacts, teaching materials, scholarly collections, digitised records, exhibitions and performances.”
  • Fedora Commons Repository software “The Fedora Repository software has been installed by institutions, worldwide, to support a variety of digital content needs. The Fedora Repository is extremely flexible and can be used to support any type of digital content. There are numerous examples of Fedora being used for digital collections, e-research, digital libraries, archives, digital preservation, institutional repositories, open access publishing, document management, digital asset management, and more.” See also: duraspace.org.
  • Greenstone: Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM in the form of a fully-searchable, metadata-driven digital library….. The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.
  • HTTrack. HTTrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure. Simply open a page of the “mirrored” website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.
  • [w:institutional repository] (IR): Many libraries (especially academic libraries) are building digital institutional repositories to collect and manage the content created by their members/professors/researchers. These software infrastructures offer soup-to-nuts tools for collecting, describing, accessing and preserving digital content. Examples include Dspace, Fedora, and EPrints.
  • Lots of Copies Keep Stuff Safe (LOCKSS): Open source software to collect, store, preserve, and provide access to local copies electronic documents. Using Peer-to-peer (P2P) architecture, libraries can compare, share and repair digital content. A Low cost digital library tool! Contact James Jacobs (jrjacobs AT stanford DOT edu) if you’re interested in joining the USdocs private LOCKSS network.
  • Lucene. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. Lucene is the guts of a search engine – the hard stuff. You write the easy stuff, the UI and the process of selecting and parsing your data files to pump them into the search engine, yourself.
  • Open Archives Initiative The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements. Projects include the Protocol for Metadata Harvesting (OAI-PMH) and the Object Reuse and Exchange (OAI-ORE) standard.
  • Solr: open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features.
  • SWISH-E: Simple Web Indexing System for Humans.
  • Teleport Pro: Web-based spidering tool. This one’s NOT open-source and NOT free ($40), but has been recommended by an FGI volunteer as easy-to-use
  • Wget: An open source UNIX command line spidering tool used to retrieve files automatically off of web servers.

While not a specific technology package, people working in government-funded institutions should be aware of the Digital Preservation Network. According to their web site, the Network is “dedicated to forging a community of practitioners who are focused on the issues of preserving the digital records and publications of government. This online forum will be a repository for the exchange and discussion of ideas, research, strategy and documents that can be used by other practitioners in their organization. Membership into this community will be open to any practitioner employed in a government funded institution that is currently researching or participating in appraisal, acquisition, preservation or access of government records or publications.” Most features require registration, but the front page features a good training calendar.

Another initiative of note is the Digital Library Federation (DLF). DLF seeks to:

  • define, clarify, and develop prototypes for digital library systems and system components;
  • scan the larger technical environment for and encourage the development of potentially important trends and practices;
  • encourage technology transfer and information sharing between and among DLF members, and between DLF and appropriate commercial sectors; and
  • communicate technical directions and accomplishments of the DLF to a wider audience.

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


5 Comments

  1. This page looks good. For our readers, I’m the one who praised Teleport Pro’s ease of use. I’m happy to answer questions about it.

    I think you should add Capturing Electronic Publications (CEP), which was originally developed for the Illinois State Library under an IMLS grant, and is now used by seven states, including Alaska. It’s a free product, though some linux savvy is recommended. It combines the wget program with other open source components to provide a good web site gathering and archiving tool. I especially appreciate its ability to generate e-mail reports of new files added to a site since the last spidering.

    ————————————
    “And besides all that, what we need is a decentralized, distributed system of depositing electronic files to local libraries willing to host them.” — Daniel Cornwall, tipping his hat to Cato the Elder for the original quote.

  2. I think that tools/Web apps are getting to the point that you’re thinking about. The thing is to collate lots of web2.0 tools into a site/blog in order to collect/search/display/share content of interest (much like a library!).

    Here at FGI, we’re using Drupal, an open-source content management system, that allows us to blog, upload documents/images, create static webpages, make all content full-text searchable (including the docs/pdfs etc that we upload into the database) and has LOTS of other modules that could be useful for integrating systems/apps/data. We’re also integrating Del.icio.us into the site (see the left column) which allows us to collect interesting Web sites and resources and, with a a little piece of javascript, dynamically display the tags on the site. We could do the same thing with our flickr account if we had lots of images of interest, or with the many google mapping remixes that are out there.

    Here are a couple of others off the top of my head that may help:

    • Zotero (must use Firefox 2.0): a free, easy-to-use Firefox extension to help you collect, manage, and cite your research sources. You can even store PDFs, files, images, links, and whole web pages!
    • Del.icio.us: Social tagging service where you can collect/tag Web sites; share your tags with others and see what others have tagged etc. For a library, it’s like a defacto Web portal!
    • Pasta-licious (Firefox extension): text pasting service that automatically generates a page that can be tagged to your del.icio.us account. VERY handy!
    • Sourceforge: Search through sourceforge because there are LOTS of apps to install on a linux box.
  3. I have lots of interests. I gather lot of information from the web, print media, and am on many mailing lists plus my friends send links, scanned pages etc.

    I want to store the information on my PC in an orderly fashion i.e.

    (a) Categorisation by subject, date, origin, location ….
    (b) Human created key words for each item.
    (c) Searching
    (d) Running a tool to find patterns.
    (e) letting people access specific areas / pages only.

    Are their any Open Source tools available to do so? Possibly web based so that I can ask friends to do part of the work themselves.

    I’ll be grateful.
    Oh, and I use Linux for everything 🙂

    Regards

  4. Established in 2001 and renewed in 2004; formerly named Technical Issues for Digital Data and combined with Intelligent and Knowledge Based Systems. To address the technical issues of library digital data which arise as libraries continue to both create and collect information in digital form in addition to other materials already in their collections. These issues include: standards; formats; archiving; infrastructure; and technology refresh. The objective of the IG will be to make information available, and to provide a forum for discusion for library professionals who are grappling with these issues. The IG will accomplish this by sponsoring conference programs, institutes, and facilitated discussions and encouraging publications on these issues.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Archives