Digital library technologies
Here at Free Government Information, we’re extremely interested in the collection, dissemination and preservation of digital government information (for more background we point you to our manifesto of sorts) and feel that libraries have a vital role to play in this area. Cornell University has been at the forefront of digital preservation and has created a Digital Preservation Tutorial that gives a broad overview of the issues and challenges involved in digital preservation in general.
With that in mind, we are creating a list of technologies that will be of interest and importance as we move toward a digital FDLP. We’re focusing on digital technologies that get at the basic needs for digital preservation: collection or capturing of digital information, description (metadata creation), dissemination (as opposed to simple availability!), and long-term preservation.
This list is a work in progress. If you know of other technologies, software, hardware, clients etc that you have used and would recommend to the community, please contact admin at freegovinfo dot info or leave a comment on this page so we can add to the growing list of useful resources.
- Archive-IT: Web archiving service from the Internet Archive. The service allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise or hosting facilities. Check out a list of their collections. (added 2/7/07)
- Capturing Electronic Publications (CEP): A web site archiving system developed with Open Source software for Unix/Linux. CEP makes it possible for organizations to periodically download and retain archival copies of their evolving web site(s). CEP uses a web spider, wget, to traverse and download a target website’s pages and CVS to archive the pages and their subsequent changes. CEP uses a variety of software packages to create, maintain historical data and provides summary statistics about the website’s content. The packages used to create CEP include: Fedora, Apache, CVS, Perl, GD graphic tools, TreeTagger and Wget.
- CONTENTdm OCLC Digital Collection Management Software. “CONTENTdm® makes everything in your digital collections available to everyone, everywhere. No matter the format — local history archives, newspapers, books, maps, slide libraries or audio/video — CONTENTdm can handle the storage, management and delivery of your collections to users across the Web.”
- cURL. A command line tool for getting or sending files using URL syntax. Curl is targeted at single-shot file transfers. Curl is not a web site mirroring program. Curl is not a wget clone.
- Del.icio.us: del.icio.us is a social bookmarking site that allows users to bookmark and share Web sites. It also allows for collaborative collection projects like FGI’s IAdeposit project where digital govt documents that are tagged “IAdeposit” in delicious are then uploaded to and preserved in the Internet Archive’s US govt documents collection. So even if a library can’t afford to build its own digital architecture, it can still participate in a digital collection project.
- DSpace. Open source software that enables open sharing of content that spans organizations, continents and time. “DSpace is the software of choice for academic, non-profit, and commercial organizations building open digital repositories. It is free and easy to install “out of the box” and completely customizable to fit the needs of any organization. DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets.” See also: duraspace.org.
- EPrints. “EPrints is the most flexible platform for building high quality, high value repositories, recognised as the easiest and fastest way to set up repositories of research literature, scientific data, student theses, project reports, multimedia artefacts, teaching materials, scholarly collections, digitised records, exhibitions and performances.”
- Fedora Commons Repository software “The Fedora Repository software has been installed by institutions, worldwide, to support a variety of digital content needs. The Fedora Repository is extremely flexible and can be used to support any type of digital content. There are numerous examples of Fedora being used for digital collections, e-research, digital libraries, archives, digital preservation, institutional repositories, open access publishing, document management, digital asset management, and more.” See also: duraspace.org.
- Greenstone: Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM in the form of a fully-searchable, metadata-driven digital library….. The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.
- HTTrack. HTTrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure. Simply open a page of the “mirrored” website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.
- [w:institutional repository] (IR): Many libraries (especially academic libraries) are building digital institutional repositories to collect and manage the content created by their members/professors/researchers. These software infrastructures offer soup-to-nuts tools for collecting, describing, accessing and preserving digital content. Examples include Dspace, Fedora, and EPrints.
- Lots of Copies Keep Stuff Safe (LOCKSS): Open source software to collect, store, preserve, and provide access to local copies electronic documents. Using Peer-to-peer (P2P) architecture, libraries can compare, share and repair digital content. A Low cost digital library tool! Contact James Jacobs (jrjacobs AT stanford DOT edu) if you’re interested in joining the USdocs private LOCKSS network.
- Lucene. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. Lucene is the guts of a search engine – the hard stuff. You write the easy stuff, the UI and the process of selecting and parsing your data files to pump them into the search engine, yourself.
- Open Archives Initiative The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements. Projects include the Protocol for Metadata Harvesting (OAI-PMH) and the Object Reuse and Exchange (OAI-ORE) standard.
- Solr: open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features.
- SWISH-E: Simple Web Indexing System for Humans.
- Teleport Pro: Web-based spidering tool. This one’s NOT open-source and NOT free ($40), but has been recommended by an FGI volunteer as easy-to-use
- Wget: An open source UNIX command line spidering tool used to retrieve files automatically off of web servers.
While not a specific technology package, people working in government-funded institutions should be aware of the Digital Preservation Network. According to their web site, the Network is “dedicated to forging a community of practitioners who are focused on the issues of preserving the digital records and publications of government. This online forum will be a repository for the exchange and discussion of ideas, research, strategy and documents that can be used by other practitioners in their organization. Membership into this community will be open to any practitioner employed in a government funded institution that is currently researching or participating in appraisal, acquisition, preservation or access of government records or publications.” Most features require registration, but the front page features a good training calendar.
Another initiative of note is the Digital Library Federation (DLF). DLF seeks to:
- define, clarify, and develop prototypes for digital library systems and system components;
- scan the larger technical environment for and encourage the development of potentially important trends and practices;
- encourage technology transfer and information sharing between and among DLF members, and between DLF and appropriate commercial sectors; and
- communicate technical directions and accomplishments of the DLF to a wider audience.