Home » Posts tagged 'metadata'
Tag Archives: metadata
Our mission
Reject GPO’s proposal to drop metadata from CGP
March 3, 2019 / Leave a comment
The Government Publishing Office has a brief proposal to omit some metadata in the Catalog of Government Publications (CGP). As written in the announcement, the proposed change in policy seems simple and obvious. GPO says that, for publications in govinfo, “historic URLs” (the original, source URL) and PURLs (the Persistent Uniform Resource Locators) are “identical.” GPO reasons that, since the the two fields are “redundant,” the historic URLs are unnecessary:
“LSCM proposes to cease the inclusion of historic URLs only in catalog records for resources in govinfo.”
The announcement ask for feedback on how the policy change would affect “cataloging/metadata and other operations and processes.”
The proposal should be rejected for three reasons.
1. The premise is wrong
GPO’s premise is flat out wrong. The information recorded in PURLs and “historic URLs” is neither identical nor redundant. PURLs change over time and “historic URLs” stay the same. The two fields record different information even when they are the same: PURLs record the current location of a resource and “historic URLs” record the original location of the resource. PURLs exist because URLs change.
(If GPO could guarantee that the public URL of govinfo items will never change, then it could just as easily advocate not including PURLs in CGP records. But that would, obviously, be crazy. PURLs are needed to account for the problem of “link rot.” Link rot is a well documented problem including for government information. GPO’s own digital repository has had already had three base addresses over the years requiring PURL redirects: access.gpo.gov, gpo.gov/fdsys, and govinfo.gov.)
Because PURLs are necessary, the issue GPO should be addressing is whether or not the original URL is a valuable piece of metadata — even after it “rots” (changes). And the question GPO should be asking is not how omitting that information will affect “cataloging/metadata and other operations and processes,” but how it will affect users and long-term access to digital resources.
2. The historic URL is valuable to users
The original URL of govinfo resources is a valuable piece of metadata for users. They can use it to locate a copy of a document in other archives (e.g., the Internet Archive, the End of Term Archive). They can use it to compare copies when more than one institution has archived the same content (perhaps at different times, reflecting changes over time).
The historic URL is essential to retrieving copies in other archives and this would be vital if technical, administrative, legal, or budgetary changes interfered with GPO keeping PURLs up to date, or keeping its PURL server online, or keeping its online copy available. It would also be vital if government information is intentionally withdrawn or privatized. There is no good reason to deny such essential information to users.
Some resources in govinfo (such as the Federal Register) are available in other versions from other government sources. The CGP records should record the historic URL so that users can understand which versions are described and to which version the PURL points.
The govinfo website provides users with the URLs of resources, but does not provide PURLs. Most users of resources in the TDR will have those public govinfo URLs and not PURLs. Users who cite those resources and provide links to them will likely use those URLs, not PURLs. Having those URLs in CGP would help users find the PURL when those URLs change (as they inevitably will).
3. The proposal ignores the evolution of govinfo
The proposed policy change is notable in what it leaves out: the future. At its inception, the policy would, apparently, affect only content in GPO’s Trusted Digital Repository (TDR).[1] The proposal should explicitly define the scope of the policy.
Currently, as far as we can tell[2], the TDR contains two kinds of resources with CGP records: Born-digital items that are first published by GPO in either fdsys or govinfo (example CGP record: America’s water infrastructure needs and challenges); and digitized paper items that GPO has ingested into fdsys or govinfo (example CGP record: Internal revenue cumulative bulletin.).
But CGP includes historic URLs for other kinds of resources. For example:
- Items harvested by GPO into permanent.access.gpo.gov and into wayback.archive-it.org. (Example CGP record: HealthCare.gov.)
- Digitized paper items not ingested into gpo.gov/fdsys or govinfo. (Example CGP record: Conservation and full utilization of water )
This prompts several questions that the policy proposal does not address: Will GPO’s TDR ever contain other resources? Will it evolve and expand over time? Will the policy affect other kinds of content if they become “resources in govinfo”?
This is important because the historic URLs in CGP records for resources that, apparently, are not affected by the new policy today, could be affected if they were ingested into the TDR (or if the policy is broader than we think it is, or if GPO interprets the policy differently in the future, etc.).
The “historic URLs” for that other content is even more important in some cases than it is for the content currently in the TDR.
The source URL of a harvested item establishes the provenance (origin) of the item archived. The standard for Trusted Digital Repositories requires TDRs to preserve information on the provenance of resources they archive (TDR 4.2.6.3). If GPO ever ingests harvested content into its TDR, it will need the historic URL. Having it in CGP records could provide a useful link between CGP and the GPO’s repository.
Digitized paper resources present a special problem for users. More than one digitization project may digitize the same paper document or different editions or versions of the same title. Differences can occur because of the use of different source copies and because of different standards and levels of accuracy of different digitization processes. This is further complicated in CGP by the fact that some CGP records point to non-government websites (example: Joan of Arc saved France), some point to the TDR (example: An account of the receipts and expenditures of the United States), and some point to harvested content (example: Pots and pans for your kitchen). There are probably other permutations and combinations that we have not discovered. The “historic” URL (e.g., its original location) would help the user identify the source of the digitization and would enable the user to obtain a copy from that source or from an archive of that source.
Conclusion
The “historic URLs” in CGP provide information to users that PURLs do not. That information is useful to users because it will help them identify, understand, and locate copies of resources. “Historic URLs” may seem unnecessary to GPO today, but they will increase in value to users over time. Making a decision for “resources in govinfo” today fails to take into account what resources may be in GPO’s TDR in the future (including harvested content and digitizations). The proposal to drop historic URLs is short-sighted. Dropping historic URLs today would be a mistake that users would resent in the future.
GPO should clarify the scope of the policy and how it would be applied in the future and evaluate its effects on users and long-term access.
Authors:
James A. Jacobs, University of California San Diego
James R. Jacobs, Stanford University
Endnotes
- The announcement mentions getting TDR certification and refers to “resources in govinfo.” GPO does not make it clear, however, if “govinfo” is a reference to its digital repository or the website www.govinfo.gov. Without a clarification, the scope of the policy proposal is unclear and ambiguous. The website runs using the content management system drupal. It appears that everything in the TDR is available through the website, but not everything listed on the website is available in the TDR (e.g., see: the browse page that points to GPO partner resources). For the most part, though, the website appears to be the public interface to the TDR. As far as we know, GPO has not said if that will always be the case. Presumably, the website could be used to point at content in more than one repository (permanent.gpo.gov, for example), or GPO might want to ingest into the TDR the content it has harvested at wayback.archive-it, or GPO might replace drupal with a different CMS. Any of these would result in a different scope of “resources in govinfo.” ↵
- Since GPO has not yet shared its TDR certification report publicly, our evaluation is based on what we can infer from using the govinfo website. ↵
Is cataloging government information obsolete? FGI says NO!
April 9, 2015 / Leave a comment
There was a provocative post on govdoc-l this week asking if the cataloging of government information is “obsolete”? We thought the question needed to be unpacked and contextualized and wanted to share more widely.
>>>>>>>>>>>>>>>>>>>>>>>
Here is the post that posed the question:
>>>>>>>>>>>>>>>>>>>>>>>
Folks,
First of all, I’m asking this question to be provocative. It is my sincere hope that individuals who are passionate about this topic step forward and provide leadership for GODORT to redefine the scope and relevancy for this topic as members of the Cataloging Committee. Send me an e-mail if you are interested.
For myself, there were two instructors in library school who influenced my own thoughts on this subject.
Esther Bierbaum my cataloging instructor in library school in 1995 would be considered by most as old school. We were instructed in AACR2 the new emerging standard and meticulously picked out LC and Dewey subject headings and assigned corresponding call numbers. She was the first one to introduce to me the idea that a library might want to describe and catalog something that one didn’t read.
Padmini Srinivasan was the first of a new school of library instructors that emerged that had extensive backgrounds in computer science and interest in the development of algorithms for information retrieval. In sum, her classes helped me makes sense of emerging information technology and more importantly how the Internet worked.
My early days as a government information specialist I was confronted with the task of creating that infamous profile of retrospective MARC records being offered at cut rate prices from MARCIVE. I was grateful for Dr. Bierbaum’s attention to detail, but also cognizant of my own limits and patience with such an enormous task. What I really wanted was some kind of quick fix algorithm that would allow me to do all this work without touching every piece.
Fast forwarding to that period of time when Google beat out upstarts like Altavista and for all practical purposes that subject controlled interface Yahoo [meta-data approach]. Google as we all know got this brilliant idea to digitize everything known to man/woman. To make sure they got everything they erroneously thought they could use OCLC cataloging records as a comprehensive registry. Furthermore, information providers tried to create one stop shops that competed with Google and my two worlds of Bierbaum and Srinivasan collided.
So where do we go from here? Do we continue to call it “cataloging” or is there some other hybrid term that combines these two worlds of “meta data” vs. algorithms? Do we continue to call it Technical Services or is there a set of services that better describe this integration? What does all this mean for organizing, preserving, discovering, and using government information?
Let your voice be heard.
>>>>>>>>>>>>>>>>>>>>>>>
And here is our response, seeking to clarify and give context for the provocative question.
>>>>>>>>>>>>>>>>>>>>>>>
Hi all,
Stephen asks if describing government information is obsolete. We would suggest that asking a slightly different question might yield a more accurate and useful answer. Why not ask instead what our designated communities need and how can we best provide that?
We would also suggest that lumping “cataloging” and “metadata” and “tagging” and “algorithms” and “searching” together confuses some very different things that need to be understood separately as well as collectively.
With that said, here are our thoughts brought about by Stephen’s provocation:
Our communities need search results that are accurate — with high precision and recall and few false-positives. They need easy browsing by agency, subject, date, author, series, Congress, and so forth. They need to be able to easily search for known-items (and cited items) and get accurate results. They want to be able to accurately identify any given digital object as to its authenticity, agency, author, date, version, title, series, and so forth.
Stephen implies that the existence of Google and full-text searching might make cataloging and the creation of metadata for govinfo obsolete or unnecessary. Put another way, we might ask: Does the availability of full text search make metadata unnecessary?
No it does not. We actually have very strong evidence that metadata-based searching is better and more accurate than full-text-based searching. That evidence comes from Google itself! Google “won” the battle against Alta-Vista et al., not because it provided full-text searching and its competitors didn’t, but because Google figured out how to use human-created metadata (links) to increase relevance in its search results.
The obvious conclusion is that searching based on human-created metadata is better and more accurate than searching based only on full-text.
Stephen also implies that there is just too much information for us to handle and we need automated techniques to cope with that volume. We would agree that we should be investigating new techniques and strategies (including automated techniques) to make our work more efficient and effective. But, rather than use the volume of information as an excuse for cutting back on our work, we would suggest that the volume of bad, confusing, unofficial, badly maintained (etc.) information on the web is a reason for us to spend more time collecting, preserving, authenticating, describing, labeling and “cataloging” govinfo — the famous digital curation lifecycle model depends on libraries doing all of these things well. This is a space where libraries can define themselves as trusted, privacy-protecting sources of information and differentiate themselves from commerical entities that commodify personal information.
Stephen suggests that maybe we could automate the creation of metadata in some way. There is some good research on this idea (sometimes called topic-modeling) of using computers to apply classifications to documents based on analysis of the digital text. (Google already uses something like this [“Normalized Google Distance”] to determine “similarity” of news articles so that they can cluster news articles about the same topic together.) This is a very interesting area of research that may, someday, result in tools for automated subject classification of government documents. But, before suggesting that human subject cataloging and classification is “obsolete,” government information professionals might want to read one study that actually looked at computer-generated topic-modeling classifications and human-generated classifications of govdocs (Classification of the End-of-Term Archive: Extending Collection Development Practices to Web Archives. Final Report). This report explicitly noted the difficulty of accurately attributing authorship (SuDoc numbers) to web-harvested government documents and examined the reasons for this difficulty. Research in topic modeling has been more promising in less complex bodies of literature — such as collections of newspaper articles.
We would suggest that it is important to differentiate between descriptive cataloging and subject cataloging. Accurately associating authors, agencies, dates, series, editions, and so forth is an extremely important part of providing accurate search, discovery, and identification of government information. This is very difficult to automate using full text. You can see this for yourself by trying to find all issues of a government serial or series in Google Books or HathiTrust. (These sites do have the particular problem of using inaccurate digital text generated during digitization of print documents. See also: An alarmingly casual indifference to accuracy and authenticity. What we know about digital surrogates).
So, of course metadata still matters! And, at least at this stage, human-created metadata for govinfo is more accurate and of better quality than any automated metadata (descriptive or subject). Internet users inherently understand this, too, as can be seen by the fact that they tag their photos, videos, blog posts, tweets, Facebook posts, bookmarks, instagrams and vines.
If anything, we should be aware that the Web provides us the best evidence for the need for professional govinfo action. Specifically, existing algorithms for classifying and describing govinfo on the web are rather feeble and existing search facilities for authentic, official government information is not nearly as good as it is for other kinds of information on the web.
Libraries in general are playing a lot of catch up with our govt documents metadata because too many libraries neglected creating adequate metadata for too long and failed to add govdocs to their cataloging workflows. But that neglect should not be used as an excuse to propose further neglect. It should, rather, be evidence of the need for more work. It does not really matter what we call what we do (catalogers, metadata-managers, etc.). It is time to push our library administrations for the resources necessary to describe documents collections in order to make our deep and rich collections more findable and usable by our communities.
We are not alone in facing these issues. A couple of very recent articles ask some similar questions about the broader information landscape. You might find these of interest:
- Will Deep Links Ever Truly Be Deep? Scott Rosenberg, Medium (Apr 7, 2015)
- If Algorithms Know All, How Much Should Humans Help? Steve Lohr, New York Times (APRIL 6, 2015)
- OCLC Works Toward Linked Data Environment. By Matt Enis, ALA Midwinter 2015, Library Journal (February 17, 2015)
Jim A. Jacobs and James R. Jacobs
PS. our responding to Stephen should not be seen as our wanting to volunteer for GODORT cataloging committee 🙂
(editor’s note: Our response here is slightly different from the original one sent to govdoc-l. That was my mistake in sending an earlier draft to the list. jrj)
Looking for data? Two Important Sites.
June 18, 2012 / Leave a comment
Finding raw data or the statistics generated from those data can be a daunting task. There is not “Books in Print” for data. Two recent developments should help.
- OpenMetadata.org Community Site Launched, by Christine Connors, Information Today (June 18, 2012).
A new web portal was announced at the recent IASSIST conference in Washington D.C. Designed to make working with metadata easier, OpenMetadata.org (OM) is the product of Metadata Technology North America and Integrated Data Management Services, which created the site to “facilitat[e] access to standards based innovative technologies for the management of socio-economic, scientific, and other statistical data.” Though the site is still in its initial deployment, the goal is for it to become a go-to resource for discovery, access, and tools for using statistical metadata.
The site currently focuses on two metadata standards: the Data Documentation Initiative (DDI) and the Statistical Data and Metadata Exchange (SDMX) standard.
- DataCatalogs.org
DataCatalogs.org aims to be the most comprehensive list of open data catalogs in the world. It is curated by a group of leading open data experts from around the world – including representatives from local, regional and national governments, international organisations such as the World Bank, and numerous NGOs.
See particularly: the OpenMetadata Survey Catalog, a portal aggregating information on surveys from data producers and archives around the globe. The catalog enables you to perform complex searches across studies and variables and browse through comprehensive metadata.
New from the Library of Congress: A Bibliographic Framework for the Digital Age
October 31, 2011 / Leave a comment
Via INFOdocket
The Working Group of the Future of Bibliographic Control, as it examined technology for the future, wrote that the Library community’s data carrier, MARC, is “based on forty-year-old techniques for data management and is out of step with programming styles of today.” The Working Group called for a format that will “accommodate and distinguish expert-, automated-, and self-generated metadata, including annotations (reviews, comments, and usage data.” The Working Group agreed that MARC has served the library community well in the pre-Web environment, but something new is now needed to implement the recommendations made in the Working Group’s seminal report. In its recommendations, the Working Group called upon the Library of Congress to take action. In recommendation 3.1.1, the members wrote:
“Recognizing that Z39.2/MARC are no longer fit for the purpose, work with the library and other interested communities to specify and implement a carrier for bibliographic information that is capable of representing the full range of data of interest to libraries, and of facilitating the exchange of such data both within the library community and with related communities.”
This same theme emerged from the recent test of the Resource Description and Access (RDA) conducted by the National Agricultural Library, the National Library of Medicine, and the Library of Congress. Our 26 test partners also noted that, were the limitations of the MARC standard lifted, the full capabilities of RDA would be more useful to the library community. Many of the libraries taking part in the test indicated that they had little confidence RDA changes would yield significant benefits without a change to the underlying MARC carrier. Several of the test organizations were especially concerned that the MARC structure would hinder the separation of elements and ability to use URLs in a linked data environment.
With these strong statements from two expert groups, the Library of Congress is committed to developing, in collaboration with librarians, standards experts, and technologists a new bibliographic framework that will serve the associated communities well into the future. Within the Library, staff from the Network Development and Standards Office (within the Technology Policy directorate) and the Policy and Standards Division (within the Acquisitions and Bibliographic Access directorate) have been meeting with Beacher Wiggins (Director, ABA), Ruth Scovill (Director, Technology Policy), and me to craft a plan for proceeding with the development of a bibliographic framework for the future.
[Clip]
We at the Library are committed to finding the necessary funding for supporting this initiative, and we expect to work with diverse and wide-ranging partners in completing the task. Even at the earliest stages of the project, we believe two types of groups are needed: an advisory committee that will articulate and frame the principles and ideals of the bibliographic framework and a technical committee that has the in-depth knowledge to establish the framework, itself.
Searching is easy, finding is hard
June 8, 2011 / Leave a comment
The Association of Educational Publishers and Creative Commons to Co-Lead Learning Resources Framework Initiative (June 7, 2011)
The Association of Educational Publishers (AEP) and Creative Commons (CC) today announced a partnership to improve search results on the World Wide Web through the creation of a metadata framework specifically for learning resources.
Improving Web Searches for Students, by Steve Kolowich, Inside Higher Ed (June 8, 2011)
…a coalition of education-oriented companies and organizations aims to make it easier to find useful educational content amid the detritus of the Web…. [T]hey are forming a working group to come up with more detailed criteria that could eventually be incorporated into the search interfaces for Google, Bing, and Yahoo!
“Searching is easy, finding is hard, and finding relevant is very hard…. The purpose of this effort is to provide a series of tags and tools that allows the search engines to more discretely and accurately expose the educational resources to the people who need it, said [Michael] Johnson [an AEP board member]. The project is aimed at benefiting the publishers of educational content as much as students, he said. By giving publishers better flares and students better binoculars, Johnson and his colleagues hope to up their chances of finding one another in the wilderness of the Web.
This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google and Yahoo! rely on this markup to improve the display of search results, making it easier for people to find the right web pages.
A Study on Metadata Elements for Web-based Reference Resources System Developed through Usability Testing, by Younghee Noh, (Konkuk University), Library Hi Tech, Vol. 29 Iss: 2.
The study aimed to improve metadata elements of web-based reference resources. To propose correct metadata elements, it was deemed necessary to close the gap between the perception of metadata creators and data creators through a user behavior analysis.
Latest Posts
- John Oliver on the federal judiciary and the importance of voting
- Reposting from Information Observatory: “Academic libraries in class society”
- EDGI’s new public comments initiative
- End of Term crawl 2024 is now underway!
- HHS launches Heat and Health Index to identify communities hit hardest by extreme heat
Blogroll
- ASU Gov Docs
- beSpacific
- Best. Titles. Ever. (Tumblr)
- Center for Effective Government
- Every CRS Report New Reports RSS Feed
- FDLP Desktop
- FDLP News & Events
- FullTextReports
- GISIG UW-SLIS: Gov Info, Sources, Data & Docs
- Government Book Talk
- Government Information Network (Canada)
- Government Information News from Fondren Library, Rice University
- GPO [twitter]
- INFOdocket
- Information Observatory
- Libraries+ Network
- Library Babel Fish by Barbara Fister
- NARA records express
- Open The Government
- Secrecy News
- SLA GovInfo [twitter]
- StatFountain
- Sunlight Foundation
- University of Washington Gov Pubs Finds
Latest Comments