There was a provocative post on govdoc-l this week asking if the cataloging of government information is “obsolete”? We thought the question needed to be unpacked and contextualized and wanted to share more widely.
Here is the post that posed the question:
First of all, I’m asking this question to be provocative. It is my sincere hope that individuals who are passionate about this topic step forward and provide leadership for GODORT to redefine the scope and relevancy for this topic as members of the Cataloging Committee. Send me an e-mail if you are interested.
For myself, there were two instructors in library school who influenced my own thoughts on this subject.
Esther Bierbaum my cataloging instructor in library school in 1995 would be considered by most as old school. We were instructed in AACR2 the new emerging standard and meticulously picked out LC and Dewey subject headings and assigned corresponding call numbers. She was the first one to introduce to me the idea that a library might want to describe and catalog something that one didn’t read.
Padmini Srinivasan was the first of a new school of library instructors that emerged that had extensive backgrounds in computer science and interest in the development of algorithms for information retrieval. In sum, her classes helped me makes sense of emerging information technology and more importantly how the Internet worked.
My early days as a government information specialist I was confronted with the task of creating that infamous profile of retrospective MARC records being offered at cut rate prices from MARCIVE. I was grateful for Dr. Bierbaum’s attention to detail, but also cognizant of my own limits and patience with such an enormous task. What I really wanted was some kind of quick fix algorithm that would allow me to do all this work without touching every piece.
Fast forwarding to that period of time when Google beat out upstarts like Altavista and for all practical purposes that subject controlled interface Yahoo [meta-data approach]. Google as we all know got this brilliant idea to digitize everything known to man/woman. To make sure they got everything they erroneously thought they could use OCLC cataloging records as a comprehensive registry. Furthermore, information providers tried to create one stop shops that competed with Google and my two worlds of Bierbaum and Srinivasan collided.
So where do we go from here? Do we continue to call it “cataloging” or is there some other hybrid term that combines these two worlds of “meta data” vs. algorithms? Do we continue to call it Technical Services or is there a set of services that better describe this integration? What does all this mean for organizing, preserving, discovering, and using government information?
Let your voice be heard.
And here is our response, seeking to clarify and give context for the provocative question.
Stephen asks if describing government information is obsolete. We would suggest that asking a slightly different question might yield a more accurate and useful answer. Why not ask instead what our designated communities need and how can we best provide that?
We would also suggest that lumping “cataloging” and “metadata” and “tagging” and “algorithms” and “searching” together confuses some very different things that need to be understood separately as well as collectively.
With that said, here are our thoughts brought about by Stephen’s provocation:
Our communities need search results that are accurate — with high precision and recall and few false-positives. They need easy browsing by agency, subject, date, author, series, Congress, and so forth. They need to be able to easily search for known-items (and cited items) and get accurate results. They want to be able to accurately identify any given digital object as to its authenticity, agency, author, date, version, title, series, and so forth.
Stephen implies that the existence of Google and full-text searching might make cataloging and the creation of metadata for govinfo obsolete or unnecessary. Put another way, we might ask: Does the availability of full text search make metadata unnecessary?
No it does not. We actually have very strong evidence that metadata-based searching is better and more accurate than full-text-based searching. That evidence comes from Google itself! Google “won” the battle against Alta-Vista et al., not because it provided full-text searching and its competitors didn’t, but because Google figured out how to use human-created metadata (links) to increase relevance in its search results.
The obvious conclusion is that searching based on human-created metadata is better and more accurate than searching based only on full-text.
Stephen also implies that there is just too much information for us to handle and we need automated techniques to cope with that volume. We would agree that we should be investigating new techniques and strategies (including automated techniques) to make our work more efficient and effective. But, rather than use the volume of information as an excuse for cutting back on our work, we would suggest that the volume of bad, confusing, unofficial, badly maintained (etc.) information on the web is a reason for us to spend more time collecting, preserving, authenticating, describing, labeling and “cataloging” govinfo — the famous digital curation lifecycle model depends on libraries doing all of these things well. This is a space where libraries can define themselves as trusted, privacy-protecting sources of information and differentiate themselves from commerical entities that commodify personal information.
Stephen suggests that maybe we could automate the creation of metadata in some way. There is some good research on this idea (sometimes called topic-modeling) of using computers to apply classifications to documents based on analysis of the digital text. (Google already uses something like this [“Normalized Google Distance”] to determine “similarity” of news articles so that they can cluster news articles about the same topic together.) This is a very interesting area of research that may, someday, result in tools for automated subject classification of government documents. But, before suggesting that human subject cataloging and classification is “obsolete,” government information professionals might want to read one study that actually looked at computer-generated topic-modeling classifications and human-generated classifications of govdocs (Classification of the End-of-Term Archive: Extending Collection Development Practices to Web Archives. Final Report). This report explicitly noted the difficulty of accurately attributing authorship (SuDoc numbers) to web-harvested government documents and examined the reasons for this difficulty. Research in topic modeling has been more promising in less complex bodies of literature — such as collections of newspaper articles.
We would suggest that it is important to differentiate between descriptive cataloging and subject cataloging. Accurately associating authors, agencies, dates, series, editions, and so forth is an extremely important part of providing accurate search, discovery, and identification of government information. This is very difficult to automate using full text. You can see this for yourself by trying to find all issues of a government serial or series in Google Books or HathiTrust. (These sites do have the particular problem of using inaccurate digital text generated during digitization of print documents. See also: An alarmingly casual indifference to accuracy and authenticity. What we know about digital surrogates).
So, of course metadata still matters! And, at least at this stage, human-created metadata for govinfo is more accurate and of better quality than any automated metadata (descriptive or subject). Internet users inherently understand this, too, as can be seen by the fact that they tag their photos, videos, blog posts, tweets, Facebook posts, bookmarks, instagrams and vines.
If anything, we should be aware that the Web provides us the best evidence for the need for professional govinfo action. Specifically, existing algorithms for classifying and describing govinfo on the web are rather feeble and existing search facilities for authentic, official government information is not nearly as good as it is for other kinds of information on the web.
Libraries in general are playing a lot of catch up with our govt documents metadata because too many libraries neglected creating adequate metadata for too long and failed to add govdocs to their cataloging workflows. But that neglect should not be used as an excuse to propose further neglect. It should, rather, be evidence of the need for more work. It does not really matter what we call what we do (catalogers, metadata-managers, etc.). It is time to push our library administrations for the resources necessary to describe documents collections in order to make our deep and rich collections more findable and usable by our communities.
We are not alone in facing these issues. A couple of very recent articles ask some similar questions about the broader information landscape. You might find these of interest:
- Will Deep Links Ever Truly Be Deep? Scott Rosenberg, Medium (Apr 7, 2015)
- If Algorithms Know All, How Much Should Humans Help? Steve Lohr, New York Times (APRIL 6, 2015)
- OCLC Works Toward Linked Data Environment. By Matt Enis, ALA Midwinter 2015, Library Journal (February 17, 2015)
Jim A. Jacobs and James R. Jacobs
PS. our responding to Stephen should not be seen as our wanting to volunteer for GODORT cataloging committee 🙂
(editor’s note: Our response here is slightly different from the original one sent to govdoc-l. That was my mistake in sending an earlier draft to the list. jrj)