Home » Posts tagged 'born digital'
Tag Archives: born digital
Since 2007, on behalf of EMC Corporation, IDC has been sizing what it calls the Digital Universe, or the amount of digital information created and replicated in a year. The newest report is now available:
- The Digital Universe Decade – Are You Ready? By John Gantz and David ReinselIDC, Sponsored by EMC Corporation (May 2010) [PDF, 16 pp excerpted from the IDC multimedia presentation, “The Digital Universe Decade – Are You Ready?” (May 2010)
- The Digital Universe Decade – Are You Ready? (The multimedia content)
These reports estimate the size of everything digital. IDC looks at the installed base of devices or applications that could capture or create digital information and estimates (based on their research and “other sources”) how much information was created in a year. They also estimate the number of times a a unit of information is replicated. The include devices such as mobile phones and bar code readers and video games as well as cameras, scanners, email, office applications, databases, GPS, medical imaging, and lots more. A lot of this is estimates and I found it hard to tell how much was gathered evidence and how much was speculation (see their methodology in the first IDC Digital Universe paper, published in 2007). To me this means that the figures they come up with may not be very accurate. Predictions of the future based on these estimates are, I think, very speculative.
Nevertheless, I’ve been following these ever since I noticed that Fran Berman quoted an earlier report and referred to 2007 as the “cross-over” year: the year in which more digital data was created than there was data storage to host it. (Berman, Francine, Got data?: a guide to data preservation in the information age, Commun. ACM, 51 (2008), 50-56.)
Even if you don’t believe that the IDC numbers are 100% accurate, the general ideas that they promote are probably not that far off the mark. Some of those ideas:
- Last year, despite the global recession, the Digital Universe set a record growing by 62% to nearly 800,000 petabytes.
- The average file size is getting smaller. The number of things to be managed is growing twice as fast as the total number of gigabytes.
- The growth of the Digital Universe is like a perpetual tsunami. How will we find the information we need when we need it?
- How will we know what information we need to keep, and how will we keep it?
That last item is my favorite. Regardless of exactly how much digital information is created each year, regardless of how much storage space we have, regardless of the fact that a lot of the “digital universe” that IDC describes is throw-away information that no one would think is worth keeping, we are still faced with Lots of Stuff and we need to figure out What to Preserve. That, I believe, is the next big challenge for digital preservation.
One way to face that challenge is to rely on producers to decide what to save. If a government agency produces something digital, allow that agency (or GPO, or LoC, or NARA, or OMB or OPM, or your favorite TLA) to decide for you if that information is worth saving.
Another way to face the challenge is to rely on a few big organizations. That is: pool our resources and outsource preservation to a few big organizations that will do this for us. Some of the same players pop up here: LoC and NARA, for example, but there are also organizations like Portico, and the Internet Archive, and ICPSR.
Both of the above solutions hope that someone else will take into account the needs of all possible users and make the right decisions. That model can work for some classes of information with appropriate governance and decision-making structures in place.
But, I believe, the lesson from the IDC report is that the “digital universe” is so large that we should not assume that any single solution will be enough. There is just too much information and there are too many decisions to make about what is worth saving. While information producers and a few big preservation organizations can do a lot, they cannot do everything. And, their size alone will constrain their decisions. It will be harder for big organizations to respond to the needs of smaller communities of interest.
What is the alternative? I think that we need (what shall we call them…?) Libraries. Public Libraries, Special Libraries, College and University Libraries, and School Libraries. These can work together or independently. They can address the needs of their particular communities of interest. This will accomplish three things:
- It will aid preservation by making the preservation community bigger. This will not only increase redundancy, but will also help ensure that there is less chance that a single system or financial or governance failure will mean a loss of all information.
- It will help deal with the scale of the preservation problem (as identified by the IDC report). With more players and more stake holders, there will be more voices and more variety in the decision making process when we collectively decide what to save. This will mean, for example, that a group of School Libraries working together on digital preservation could ensure that an item of essential use to K-12 will be saved even if no university saves it. And vice versa.
- It will help users find and use the information they need. Today, it seems that everyone understands what librarians have always known: that there is a lot of information in the world. It used to take a library degree to get an appreciation of all the sources of information in the world. Today, everyone that uses the Web has that same appreciation. It seems like every day there is another newspaper article or blog posting about how great it is to have access to “everything.” But the “everything” people see on the Web is really only a subset of everything and it only appears to be “everything” because there is so much in this very large subset of everything. And, when your only option is to search “everything” you quickly discover that that is not always the best way to find just what you want. (Even Google has segmented information into categories like movies, blogs, books, and scholarly information.) Having community-of-interest collections will enable libraries to build user-interfaces that work best for those communities and that provide access to the information those communities most want.
Libraries won’t replace “everything” collections. They will complement each other and unfocused “everything” collections. They will enrich us all and help ensure that we will preserve what needs to be preserved as the “digital universe” expands more rapidly than we could otherwise deal with.
I’m really impressed with the work that OSTI is doing to build digital collections of scientific and technical information as well as to push the boundaries of access by building databases, federated search tools, being an OAI node, distributing bibliographic records and generally finding unique and innovative ways to make scientific and technical information available on the Web (I just love the idea of an adopt-a-doc program!!).
In particular, a blog post entitled Beyond Collecting: Connecting from a few weeks back (yes my feedreader is bursting at the seams 🙂 ) caught my eye. They’ve basically gone out and built a digital infrastructure along the lines of what we at FGI have been advocating for lo these many years. That is, they’ve realized that they can’t possibly collect it all. Instead of building one big central repository, they’re relying on many agencies and actors to host content and standards-based metadata of interest to them. OSTI can then use increasingly robust digital tools to aggregate and provide search mechanisms for vast amounts of information — to “connect users with the highest quality science information without collecting or hosting it.”
THAT’S what I envision for the Federal Depository Library Program: a collaborative network of libraries (a technical and social P2P network!) hosting content of interest to their local communities, creating and maintaining standardized metadata, connecting up with each other to create powerful search tools across the network. This is the many-hands-make-light-work digital model to which we in the documents community should be espousing.
–that is all.
OSTI has embraced a new paradigm for sharing scientific and technical information (STI). Historically, OSTI has fulfilled its mission of providing STI to scientists, researchers, and the public by hosting, or collecting, documents and/or metadata. OSTI’s new paradigm is to make content searchable that is often hosted by others; today, OSTI connects those seeking the content with the organizations that host it.
Beginning in the late 1940’s, with OSTI’s production of the Nuclear Science Abstracts – which was to go on for nearly 30 years, OSTI entered into the business of collecting information. Beginning in the 1990’s, OSTI began creating web application to make the collected content openly accessible and conveniently searchable. ETDE Web, DOE Information Bridge, the Energy Citations Database, and DOE R&D Accomplishments are some of the successful applications.
In the last several years, OSTI’s approach to disseminating STI has evolved. Recent applications such as the Eprint Network, Science.gov, DOE Science Accelerator, and WorldWideScience.org connect users with the highest quality science information without collecting or hosting it.
How does OSTI move beyond collecting to connecting and what does connecting mean? OSTI’s new applications search content that is housed in document repositories owned by a number of government agencies and government-sanctioned organizations. OSTI applications search a number of these repositories on the fly and they aggregate the content from the sources they search and present the most relevant of the search results to the user. This simultaneous and real-time search of multiple repositories is called federated search. OSTI’s federated search applications serve as portals to specific subjects. In being subject-specific, they connect users to the highest quality STI in their fields of interest.
Why is OSTI embracing the connection model? Quite simply, OSTI can far better achieve its mission by making great quantities of content openly accessible and conveniently searchable, but it is impossible to collect and keep current such quantities of content from multiple content sources. “Connecting” to content is doable, while “collecting” is not. (My emphasis added!)
We believe that by connecting users to content, we provide a more comprehensive and authoritative search. In doing so, we accelerate the advancement of science.
I had to explain to a student patron and their Professor today what is meant by “born digital” and how digital government documents are wonderful resources for a paper if we do not have the print version or when the print version doesn’t exist (or is horribly out of date). Have any of you had to explain this a lot?
It all started when the student patron told me she could only have three web sources for her Nursing research paper after I had shown her the wonderful world of digital documents online. She had found an eleven year old version of a government print source in our catalog but I cringed…born digital documents online via NIH or the U.S. Dept. of Health had more up to date medical information on her topic! I told her to use both the print and online sources. She would be able to see if there were any noticeable differences from the 1997 print version and the 2007/2008 online information on her topic.
I contacted the Professor and explained this too. All is well and she will allow for the use of online government information. She was just hoping to avoid the use of too many general (i.e. crappy) websites. I understand that but I wanted to make sure that the student would not be punished for using several good government online documents and websites for her paper.
I didn’t get into the nitty gritty digital authentication of government documents, but with some Professors who require legislative research, I tell them about the digitally authenticated documents that currently exist from GPO.
I have a feeling we government document librarians are going to have to explain this concept of “born digital” gov docs and digital authentication more often…especially now that more and more gov docs are being born digitally.