Our mission

Free Government Information (FGI) is a place for initiating dialogue and building consensus among the various players (libraries, government agencies, non-profit organizations, researchers, journalists, etc.) who have a stake in the preservation of and perpetual free access to government information. FGI promotes free government information through collaboration, education, advocacy and research.

EPA Tagging Results – Ready and Promising

Our report on our experiment in using del.icio.us to tag EPA documents originally harvested by GPO is now completed and available for your review and comment at http://freegovinfo.info/node/1825.

For more information about this project, including a list of tags assigned to documents by project participants, please see http://freegovinfo.info/epatagging.

Our thanks to the project participants!

EPA Tagging Results and Future Directions

Back in January we asked people to use del.icio.us to tag a sample of 32 documents taken from the 100 EPA documents posted by the Government Printing Office (GPO) to http://www.gpoaccess.gov/harvesting/index.html.
We asked people to tag documents from 1/18/2008 through /18/2008. A spreadsheet of the results is available at http://spreadsheets.google.com/pub?key=pybymZBlZ80PVat2ggty2GA.
This brief article informally discusses some of our results, offers some lessons learned, and offers suggestions for future projects. Finally, a short list of articles on other research relating to tagging is presented.

1) Findings

  • Number of tagged documents – 31
  • Average number of people tagging a given document – 2.5
  • Highest number of taggers for a document – 8, for the document “Environmental Results Under EPA Assistance Agreements”
  • Average number of deduplicated tags per document – 11.25
  • Number of documents with descriptions – 31, with a majority of documents having more than one human generated description.

2) Some Promising Results

While we would have liked to have seen more participation (see below under “study limitations”), these initial results are somewhat positive. There is some interest in tagging. Tagged documents tended to receive meaningful descriptions beyond what a brief bibliographic record would provide. For example, for the document “Air Sealing: Building Envelope Improvements”, we have the following descriptions from five users:

* Mount Desert Spring Water was able to win a bid to provide bottled water and water coolers to the University of Maine. Mount Desert Spring Water was successful because the water coolers it provided were energy efficient and the lowest cost to the Universi – samchap

* Describes the benefits of proper air sealing for homes. EPA awards the EnergyStar when legal minimum standards are exceeded. – mkvs

* Conserving energy in your house by having it sealed correctly – bookswoman

* “Air sealing the building envelope is one of the most critical features of an energy efficient home.” “25-40% of energy” “ENERGY STAR qualified homes, constructed to exceed [building] codes with air sealing, can offer a better quality product.” – keyvowel

* This Energy Star news release describes ways homeowners can reduce home heating and cooling costs by implementing air sealing techniques. – tadamich

Without question, the first description is problematic, but the other four descriptions are in agreement about what this document is about AND provide more relevant information than a brief bibliographic record.

For the most part, the tags we got were also meaningful and descriptive. Staying with the document “Air Sealing”, we have the following tags:

Air, air-sealing, airsealing, building-insulation, efficient, energy,
energy-efficiency, Energy-Star-Branding, energyconservation, energystar, epa, EPA-advertising, globalwarming, greenhousegases, home-building, home-building-techniques, home-construction, home-improvement, homes, hvac, indoor, leakage, money-saving, quality, sealing, ventilation

Contrast that with a brief bibliographic record that simply has title, agency, and URL. How would people know that this document is part of the EnergyStar initiative, or that it was related to home building or energy efficiency? Clearly, in this instance and in a number of other project documents, there was a clear value added.

3) Limitations of current study

Our promising results were limited by three factors, the most important was the lack of participation. We estimate that about ten people participated in our tagging project. The available research on tagging is pretty firm on stating that good social tagging requires many users. Some say 100 or so is good, others suggest higher numbers. Our numbers are clearly too low. There are also too many instances (12) when a document was tagged by a single user. This could greatly bias how a document gets tagged. Consider if the only description of “Air Sealing” had been the mistaken one about water coolers. That would have been worse than useless. But even in this instance, a user pulling up this document while searching for water coolers could have provided a more accurate description.

The low number of taggers also made it difficult to see how much tag agreement existed among the various taggers.

Another problem was self-inflicted. We forgot to instruct people on tag construction. These were our original instructions:

1) Visit http://www.archive.org/search.php?query=epapilotproject and go to a document on the list. Open the pdf file in a separate browser window.
2) In del.icio.us, tag the page for the Internet Archive record (i.e. not the PDF file) after examining the PDF file.
3) In the del.icio.us “notes” field, write a one or two sentence description of what the document is about.
4) In the tags field, please use epapilotproject, for:freegovinfo and then any tags that you feel describe this document.

del.icio.us uses a space separated tag system. In other words, a space begins a new tag. So tagging something as “air quality” results in the two tags of “air” and “quality” and not the more helpful tag of “air quality” This resulted in some of the tagging becoming meaningless. If we had asked people to put dots or dashes in multiple word tags, we would have gotten more meaningful tags. We still got some useful tags because some of our taggers were used to the del.icio.us system, but we shouldn’t have assumed that everyone tagging would know how to construct multiword tags in del.icio.us. On the other hand, this problem might have been less noticeable if we had more taggers per document.

Our final problem is one we think could be avoided in future projects. That is people tagging different files with the same document title. We asked people to bookmark the Internet Archive page for a given document, which has a link to the PDF file. We specifically asked people NOT to tag the PDF file because del.icio.us doesn’t populate the title field of bookmarked PDFs. But one person in our project consistently bookmarked a document’s PDF file instead of the Internet Archive page and this separated that person’s tagging from everyone else’s and made it more difficult to compile tagging info for every document.

4) What next? Some suggestions

Our findings indicate that tagging does have potential to add value to web harvested documents that do not receive full cataloging, but for this benefit to be fully realized, there must be more taggers. When we realized we didn’t have the number of taggers we wanted, we headed for the literature and found some articles
listed below under “References Consulted.” They offer some interesting guidance for other document tagging efforts.

While all of the papers below talked about user motivation, I think Tim Spalding said it best in a post titled “When tags work and when they don’t: Amazon and LibraryThing”:

“Something is going on here—something with broad implications for tagging, classification and “Web 2.0” commerce. There are a couple of lessons, but the most important is this: Tagging works well when people tag “their” stuff, but it fails when they’re asked to do it to “someone else’s” stuff. You can’t get your customers to organize your products, unless you give them a very good incentive. We all make our beds, but nobody volunteers to fluff pillows at the local Sheraton.”

The EPA documents are sort of like fluffing pillows at the local Sheraton, to me at least. My primary interest isn’t environmental documents and EPA documents are not a major component of my library’s depository collection. In addition our particular sample was unintentionally heavy on flyers, applications, and brochures. It could be that another agency’s documents, say NASA or DoD might get more attention.

There’s another angle too. In my anecdotal experience, librarians don’t see web stuff as theirs, so they don’t spend much processing time on it. Of if they are concerned about web documents, perhaps their administration does not. So how could we make them owners and think of web harvested materials as “their stuff” so they’ll make their “documents beds”? A few suggestions follow:

1) For the EPA documents, GPO could partner with libraries that do have a strong environmental collection. Perhaps candidate libraries could be determined through item selection analysis.

2) GPO might wish to consider doing a depository survey to see what agency depositories would most like to see web-harvested. The survey could include a question asking libraries if they would tag if the desired content was harvested.

There wouldn’t have to be a commitment to tag every document, but to tag some of the documents.

While GPO should continue with web harvesting no matter what, we wouldn’t blame them for not moving forward with a documents tagging initiative if the depository community failed to register interest in such a project.

3) If GPO re-harvests EPA or moves on to another agency, it should consider setting up RSS feeds for newly harvested documents. Subject specialists from inside and outside the library community could take part in tagging. Again, GPO would need to start with some broadly popular agencies to have a chance of recruiting a significant number of taggers.

4) If GPO or another organization does a large scale tagging project, significant thought should go into tagging conventions. Not the vocabulary itself — research seems to show that once an item reaches 100 tags or so, the proportion of tags stays constant. That is to say that agreed upon terms appear to predominate over idiosyncratic or spam tags (See Golder and Huberman below for details). What needs to be spelled out is how multi-word tags should be constructed — is it air-quality, air.quality, or air_quality? They all mean the same thing, but del.icio.us and other tagging services interpret them differently. A consistent new word marker or a choice of tagging site that supported spaces inside tags will make any tagging project go smoother.

These are our thoughts. What are yours? Look at our spreadsheet. Check out the item pages on del.icio.us and read the articles below. Then let us know what you think about the future of social tagging for government documents.

References Consulted

– “HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, ToRead” by Cameron Marlow, Mor Naaman, danah boyd, Marc Davis http://www.danah.org/papers/Hypertext2006.pdf

– The Structure of Collaborative Tagging Systems
by Scott A. Golder and Bernardo A. Huberman

– “Can Social Bookmarking Improve Web Search?” by Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina

– “When tags work and when they don’t: Amazon and LibraryThing”
Thingology Blog, posted by Tim Spalding Tuesday, February 20, 2007

Library of Congress on Flickr

Library of Congress tests Web 2.0 photo archive, By Wade-Hahn Chan, FCW.com, March 24, 2008. “The Library of Congress has turned to the popular online picture-sharing community of Flickr for help with tagging the library’s voluminous photo archives.”

Library of Congress Photos on Flickr, Library of Congress Prints & Photographs Reading Room. “We invite you to tag and comment on the photos, and we also welcome identifying information—many of these old photos came to us with scanty descriptions!” See also the FAQ.

The Library of Congress’ photos, Flickr. “Yes. We really are THE Library of Congress.”

See also: EPA Pilot Project Tagging Project.

Help Us Explore Findability Through Tagging!

Free Government Information is investigating the usefulness of tagging government documents that do not receive traditional cataloging and needs your help! We’ve posted 32 documents that the Government Printing Office (GPO) harvested from the EPA web site and posted them to the Internet Archive. Over the next three months, we’d like to see as many people as possible tag and describe these documents using the del.icio.us bookmarking service. For a full project description and instructions on how to participate, please visit http://freegovinfo.info/epatagging. We’d like to thank GPO for posting a sample of their harvested EPA documents that made this project possible.

This project got its inspiration from Galaxy Zoo (http://www.galaxyzoo.org), an astronomy project which has a database of 1 million galaxies that researchers asked regular folks to classify as ellipical, clockwise spiral, or anticlockwise spiral. They aimed for and got at least 20 classifications per galaxy. If a particular galaxy was classified a certain way by 80% of users who assigned a classification to that galaxy, that classification was accepted. This "person on the street" data was compared with a small subset (50,000) of galaxies that professional astronomers had managed to classify on their own. The researchers found that there was pretty much total agreement between the professional and amateur assessments. Documents are more complex than galaxies. 🙂 , but if 9 out of 10 people tag an epa document as air quality, then it’s probably about air quality.

So please visit http://freegovinfo.info/epatagging and get started. And tell your friends, coworkers and especially any environmental professionals that you know to get involved. Also, if you have a network in del.icio.us, we’d appreciate you putting on a "for:[friend name]" tag for every member of your del.icio.us network.

UPDATE 1/25/2008 Forgive my overzealousness with the above suggestion to tag every person in your del.icio.us network. I should never advocate spam. BUT, if there are people in your network interested in the environment or government documents, please consider sharing our project page with them.

The more people involved with this project, the better the descriptions and the more robust the subject access provided by the tagging will be. At least that’s our hope.

We are going to run this project for three months, then the FGI volunteers will compile data on the following:

A) How many people participated in the project.
B) How many documents were tagged.
C) How many documents were described.
D) The average number of tags per document.

We will also examine how much agreement on tags exist for a given document. We will make our compilations publicly available along with any analysis we have.

Hope to see you on del.icio.us soon making environmental documents easier to find and easier to digest!

EPA Pilot Project Tagging Project

Note: The project period was January 18, 2008 through April 18, 2008.
The participation period for this project has closed

Please see below for unique tags assigned to documents. To view the tagging directly on del.icio.us, please see http://del.icio.us/tag/epapilotproject

Update 4/23/2008 – Data has been compiled into a spreadsheet available at http://spreadsheets.google.com/pub?key=pybymZBlZ80PVat2ggty2GA
Interpretation to follow.

Update 5/7/2008 – The results report is finished and may be read and commented on at http://freegovinfo.info/node/1825.

Below is the original project announcement:

Free Government Information needs your help to investigate whether social tagging of government documents is a viable idea.

We have stashed 32 documents from the Government Printing Office’s EPA Web Harvesting Pilot Project in the Internet Archive. We would like as many people as possible to bookmark, tag and provide brief descriptions of all 32 of these test documents using the del.icio.us bookmarking service.

If you would like to join this effort and have a del.icio.us account, please follow this proceedure:

1) Visit http://www.archive.org/search.php?query=epapilotproject and go to a document on the list. Open the pdf file in a separate browser window.
2) In del.icio.us, tag the page for the Internet Archive record (i.e. not the PDF file) after examining the PDF file.
3) In the del.icio.us "notes" field, write a one or two sentence description of what the document is about.
4) In the tags field, please use epapilotproject, for:freegovinfo and then any tags that you feel describe this document.

Please do steps 2-4 for as many documents as you can, ideally all 32.

We are going to run this project for three months, then the FGI volunteers will compile data on the following:

A) How many people participated in the project.

B) How many documents were tagged.

C) How many documents were described.

D) The average number of tags per document.

We will also examine how much agreement on tags exist for a given document.

We have a belief, based on projects like NASA Clickworkers, GalaxyZoo and the Library of Congress’ Flickr project, that the community of government documents users can improve the findability of government information and provide a valuable adjunct to traditional cataloging. We also believe that a successful tagging environment will provide better access than GPO’s newly declared brief bibliographic records process. Time will tell. Help us find out!


List of harvested EPA test titles for this project:

Aerosol Test Facility at Research Triangle ParkAerosol-propellants, Environmental-health-Research, Research-Triangle-North-Carolina, Terrorism-Prevention
Air Quality Data Analysis Technical Support Document for the Proposed Interstate Air Quality Ruleair, pollution, quality, data, ambient, monitoring
Air Sealing: Building Envelope ImprovementsAir, air-sealing, airsealing, building-insulation, efficient, energy, energy-efficiency, Energy-Star-Branding, energyconservation, energystar, epa, EPA-advertising, globalwarming, greenhousegases, home-building, home-building-techniques, home-construction, home-improvement, homes, hvac, indoor, leakage, money-saving, quality, sealing, ventilation
Analysis of Atmospheric Deposition of Mercury to the Savannah River Watershedwater-quality, mercury-levels, water-pollutants, Clean-Water-Act, water-testing-results
Approval of Urban Bus Retrofit/ Rebuild EquipmentAir, Air-quality, Air-toxins, buses, emissions, engine, engines, engines-retrofit-and-rebuild-equipment, matter, particulate, pollution, retrofit, transit-buses-emissions, Urban-transportation
Approval of Urban Bus Retrofit/ Rebuild Equipment (Oct 1997)Air, buses, Clean-Air-Act, emissions, engine, engine-retrofit-rebuild-kits, engines, matter, Particulate, particulate-matter, pollution, retrofit, transportation, Urban-transit
Are You One of the Top 20?2005, benefits, Best-Workplaces-For-Commuters, business, commuters, commuting, companies, emission, employers, epa, flexible-scheduling, fortune500, govdocs, incentives, misspellings, private-transportation, public-transportation, reduction, telecommuting
Are You Ready to Take Advantage of the New Commercial Tax IncentivesEnergy-Star, Energy-saving-tax-decuction, Commercial-buildings, Commercial-building-improvements
Arsenic Rule Benefits Analysis: an SAB ReviewArsenic, Arsenic-levels, Cancer-causing-agents, costbenefitanalysis, drinking, environment, Environmental-health-Research, exposure, exposurelevels, Freegovinfo, health, public, standards, water, Water-quality, Water-treatment-costs
Best Workplaces for Commuters Application FormApplicationforms, applications, audience:hr, benefits, bestworkplaces, carpool, carpools, communting, commuter, commuter.benefits, commuters, commutersepa, commuting, employerincentives, employers, environment, environmental, environmentalimpactofcommuting, epa, etc., forms, impact, incentives, program, publictransportation, telecommuting, telework, transportation, vanpool, vanpools
Best Workplaces for Commuters Graphic Standards and Usage GuideEPA-branding, Best-Workplaces-For-Commuters, Government-agencies-public-relations
Boxed In?2004, airpollution, cleanair, emissions, environmentally-friendly-shipping, epa, fleet, gases, global, globalwarming, govdocs, greenhouse, greenhousegases, management, money-saving, posters, shipping, smartway, transportation, vehicle, warming
Business Case for Information Services: EPA’s Regional Libraries and CentersEnvironmental-libraries-United-States, Environmental-libraries-United-States-Costs-and-benefits
Calculation and Use of First-Order Rate Constants for Monitored Natural Attenuation Studiesattenuation, attenuation-rates, biodegration-rates, contaminants, contamination, epa, govdocs, ground, groundwater, mna, monitored, natural, plume-concentrations, remediation, research, water
Carpool Incentive Programs: Implementing Commuter Benefits as One of the Nation’s Best Workplacesair, carpools, commuters, commuting, employers, Employers-and-employees, EPA-advertising, EPA-branding, incentives, pollution, transportation, Workplace-conditions
ChloronebChloroneb, pesticides, pesticides-safety, Cotton-crop-management-and-control, ornamental-plants-and-grasses-pesticide-treatment
Community Involvement Plan for the Copper Basin Mining District Site, Polk County, TennesseeFreegovinfo, Copper, Basin, mining, community, involvement, cleanup, environmental-cleanup, community-involvement-in-environmental-programs-details
Conformity SIP guidancetransportation-regulations-states, Conformity-state-improvement-plans-SIPs, transportation-federal-regulation, transportation, conformity, SIP, air, quality, standards
Development of a Performance-based Industrial Energy Efficiency Indicator for Automobile Assembly Plantsvehicle-assembly-plants-energy-use, assembly-plants-energy-efficiency, automobile-assembly-energy-use-studies, manufacturing-process-energy-used
Diclofop-Methyl2000, bioaccumulation, cancer, carcinogens, Commercial-use-of-pesticides, golf-courses, diclofop-methyl, epa, Freegovinfo, golf, govdocs, herbicides, pesticides, reregistration, toxicity, toxicology,

Diesel Retrofits: Quantifying and Using Their Benefits in SIPs and Conformityemissions, Diesel-engines, engines, engine-rebuild-retrofit-kits, environmental-state-implementation-plans-SIPs, environmental-regulation-states environmental-regulation-federal
DifenzoquatDifenzoquat, pesticides, wild-oats-control, barley-crop-yields, wheat-crop-yields, agriculture-crops-and-yields, difenzoquat, herbicides, pesticides
Energy Star Wins the Bid2006, bottled_water, efficiency, electricity, energy, energy_star, energy-efficiency, energy-efficient-water-coolers, energy.conservation, energy.efficiency, energy.star, energyconservation, energystar, environmental-benefits, epa, EPA-branding, Energy-Star, freegovinfo, govdocs, money-saving, pressreleases, purchasing, umaine, water

ENERGYSTAR Building Upgrade ManualEnergy-Star, Buildings-energy-saving-improvements, Energy-savings-plans, Energy-costs
Environmental Economics Research StrategyEnvironmental-economics-influences, Behavioral-science-economic-impacts, Behavioral-science-effect-on-policy-development, Business-and-human-behavior
Environmental Results Under EPA Assistance Agreements – Tagged with (gmp), 2005, assistance.agreements, assistance, agreements, compliance, environment, environmental, environmental.protection, epa, epa.policy, epa.strategic.plan, evidence-based, funding, goals, objectives, governmentagreements, grant, grantee, grants, management, outcome-based, plan, policies, programs, regulations, research, results, results-oriented, strategic
EPA’s Diesel Retrofit SIP and Conformity Guidanceemissions, Diesel-engines, engines, engine-rebuild-retrofit-kits, environmental-state-implementation-plans-SIPs, environmental-regulation-states, environmental-regulation-federal
Final Emission Standards for 2004 and Later Model Year Highway Heavy-Duty Vehicles and Engines
Guidance for Quality Assurance Project PlansQuality, assurance, environmental, data, EPA-quality-assurance-project-plans, EPA-QA-project-plans, Organizational-quality product-quality
Guide to Technology Commercialization Assistance for EPA Small Business Innovation ResearchSmall-Business-Innovation-Research, Small-business-finances, small, business, innovation, technology, commercialization
Heavy-Duty Engine Emission Standards for Highway Trucks and Busestrucks, transportation, Air-quality-history, trucking-industry, emissions, NOx-standard, Nitrogen-oxides, Global-warming greenhouse-effect
Preliminary Risk-Based Screening Approach for Air Toxics Monitoring Data SetsAir, Air-quality, air-toxics, Air-toxins, assessments, Clean-Air-Act, data, Data-analysis, data-screening, dqo, freegovinfo, methodology, monitoring, pollution, r4-slt, risk-based, Screening, sets, toxics, Biomarkers