Month of July, 2012

FOIA request about the cost of American Factfinder with pointers to MuckRock and census.ire.org

I ran into this odd post recently about the US Census Bureau's census tool called American Factfinder -- odd because it was mix of interesting, fact-based reporting with a healthy dose of tongue-in-cheek facetiousness. Nursing a "long-standing grudge against another piece of contractor-built government software," William Hartnett (who may or may not be a journalist) decided to submit a FOIA request to find out how much it cost to build and then wrote a post about it entitled "The U.S. Census Bureau’s American FactFinder, which everyone in the universe hates, cost taxpayers $33.3 million. So that’s great."

Hartnett's FOIA request garnered an amazingly quick response from the US Census Bureau:

The name of the company that developed the current version of the American FactFinder web application is IBM U.S. Federal and the total $33,340,681.00.

While I'm the first to admit that FactFinder is a difficult and confusing tool to use (not to mention that the Census Bureau decided not to host the 1990 census data on AFF2 but instead to only make it available for download on their FTP server!), I would put it in neither the "useless boondoggle" nor even the "steamy pile of sh*t" category. But at least now we now know how much FactFinder cost to build.

Besides that little informational tidbit, Hartnett also provided pointers to 2 Web sites of interest:

Muckrock: This site, for a small fee (not clear if they'll manage your FOIA fees exemption), helps researchers, journalists and the public submit and manage their FOIA requests, and scans and makes them available to the public. Check out the FOIA requests currently in their queue. You can follow @MuckRockNews on twitter.

Investigative Reporters and Editors (IRE) has a Census project "designed to provide journalists with a simpler way to access 2010 Census data so they can spend less time importing and managing the data and more time exploring and reporting the data." This is a great example of a useful tool built from bulk data supplied by the US Census Bureau! Check out the tool and let us know what you think.

Election 2012 Web Archive

This is the final post in our series on the collaborative End-of-term and Election web archives. This post focuses on the Election 2012 Web archive, particularly on the challenges we are facing in building an engaged community of web site nominators.

The idea for the 2012 Election Web Archive grew out of conversations with the End of Term project partners (Internet Archive, California Digital Library, Library of Congress, the University of North Texas Libraries and the U.S. Government Printing Office) about Harvard’s potential participation in the project. Librarians at the Harvard Kennedy School (HKS - Harvard’s graduate school of government), suggested that a logical focus for them would be the upcoming presidential election. In the past, presidential elections generated a lot of enthusiasm among faculty and students, who frequently requested both printed and online election-related resources from the library. They hoped to harness some of this enthusiasm.

Since the Library of Congress has been collecting in this area for many years, the group decided to collaboratively collect web sites about the 2012 election as a sister collection to the 2012-2013 End-of-term collection. The partners decided to distribute selection responsibilities so that they could each focus on areas of particular interest at their institution. Curators from the Library of Congress would focus on official campaign sites produced by presidential, congressional and gubernatorial candidates, using their in-house tools for tracking nominations.

The Harvard curators would focus on web sites produced by non-profit organizations, academic institutions, fact-checking organizations and some individuals, including blogs, tweets and YouTube videos, using the nomination tool created by UNT for the 2008-2009 project. Examples of these web sites include http://factcheck.org, http://dailykos.com and http://campaignmoney.com/. Initial selections were made by library staff and plans were made to engage faculty, students and staff in relevant academic areas such as Democracy, Politics and Institutions and from HKS Research Centers such as the Institute of Politics; the Shorenstein Center on the Press, Politics and Public Policy; the Center for Publish Leadership and the Ash Center for Democratic Governance and Innovation.

Nomination of web sites began in December 2011 and will continue on an ongoing basis until the election. The Internet Archive began crawling these sites in January 2012, and will continue crawling these sites on a weekly basis until sometime after the 2012 election. After each crawl, detailed reports are distributed to the entire group, highlighting any problematic web sites, for example sites that couldn’t be collected because of robot exclusion files. The campaign sites nominated by Library of Congress curators will be crawled separately as a part of the Library of Congress Web Archives with the ultimate goal of providing a shared interface for researchers to access all of the campaign and other election-related sites.

As soon as crawls were underway, HKS librarians focused on soliciting help in nominating web sites to include in the collection. Previous efforts to engage the community of faculty and students included direct emails, in-person conversations, an advertisement in the student newspaper, and posts to the HKS LinkedIn page, but produced few nominations. The campaign also included contacting staff and librarians at other public policy schools. So far this hasn’t resulted in additional nominations. At the start of the semester in September, HKS librarians will publicize the project on the Journalist’s Resource web site, and possibly on The Monkey Cage web site. In addition, the group made the decision to publish the URL to the nomination tool (which does not require an account to use) directly in public articles about the project. As was learned in the 2008-2009 project, it was more important overall to make the tool as accessible as possible than to lock it down because of the risk of misuse. In the event that any inappropriate URLs are submitted, they can be removed from the list of sites to crawl.

Although to date it has been challenging to engage a broad group of nominators for this project, we remain optimistic that as we move closer to the election it will be easier to spark interest and participation in the project.

How can you help? If you would like to nominate a web site for the Election 2012 Web Archive, visit the nomination tool and start entering URLs. If you have suggestions for us or any questions, please contact us at eotproject@loc.gov, here on this blog, or on Twitter @eotarchive.

Andrea Goethals and Wendy Marcus Gogel
Harvard Library

Keely Wilczek
Harvard Kennedy School Library

FRASER adds new Marriner S. Eccles Document collection on economic history and the Fed

The St Louis Fed's FRASER (Federal Reserve Archive) has just announced the addition of the new Marriner S. Eccles Document collection. It looks to be especially relevant to economic historians and those interested in economics and the Great Depression.

FRASER, a digital library dedicated to preserving the nation’s economic history, recently added the Marriner S. Eccles Document Collection. The new collection provides access to nearly 10,000 documents from the archival collection housed by the University of Utah. Eccles served as Chairman (1934-48) and member (1948-51) of the Board of Governors of the Federal Reserve System. The collection provides research material about the Federal Reserve System, particularly during the 1930s and 1940s, as well as Eccles’s role in the monetary and fiscal systems of the United States during those years.

The documents can be browsed and searched by box, date, author, or keyword (the keyword field searches title, author, and description). Full-text searching is also available through a site-level advanced search, which can be narrowed to only items in the Eccles collection.

Other archival collections that have been made digitally available on FRASER include Papers from the Committee on the History of the Federal Reserve System (held by the Brookings Institution) and the William McChesney Martin Jr. Document Collection (held by the Missouri Historical Society).

FRASER has more than 640 publication titles, dated from 1789 to the present, that can be browsed by title, author, date, or topic. Full-text searching is also available.

Ezra Klein explains daily treasury real yield curve rates

On yesterday's Rachel Maddow show, I was excited to see the Washington Post's Ezra Klein (hosting for Ms Maddow while she's on vacation) use a government document to make a very cogent point.

In discussing a story by the NY Times "Weather Extremes Leave Parts of U.S. Grid Buckling", Klein opined that now is the perfect time for the Federal government to spend on infrastructure projects. He explained that the US Treasury's Daily Treasury Real Yield Curve Rates are currently negative. That is, people are buying treasury bonds knowing full well that the return will be less than the original price of the treasury bond. Therefore, the US Congress refusing to spend on badly needed infrastructure rejuvenation (see NY Times article above) is "financial mismanagement on an epic scale!"

Visit NBCNews.com for breaking news, world news, and news about the economy

State Agency Databases Activity Report 7/27/2012

Looks like our last activity report for the State Agency Databases Across the Fifty States project was 7/1/2012. What's happened since then?

Well, our project was named one of the Best Free Reference Web Sites 2012 by ALA's Reference and User Services Association's Emerging Technologies in Reference Section (MARS). Our group of four dozen volunteers is honored to mentioned on the same page as the FBI Vault, Google's Art Project and the Encyclopedia of Life.

While honors were being handed out, our volunteer documents specialists kept busy. Many URLs were fixed and a number of databases, particularly for California were added. Here is a sample of what was added in the last month:

CALIFORNIA (Joel Rane)

Online Archive of American Folk Medicine - A searchable database of over a million folk remedies, collected from individuals and books since the 1940s. The remedies are searchable by condition, belief, method of treatment, gender, age and ethnicity of the informant, place, date and region collected, place or region of origin, and the citation name.

OHIO (Audrey Hall)

Ohio Judicial Conference Directory - Directory with information for all Ohio judges. Find contact information for judges by name, county and court.

WISCONSIN (Barbara Bren)

Credential Search Menu - allows consumers to search for individuals and businesses that are credentialed. You can search by name, credential type, credential id or by zip code. These lists certify who has been granted a permit from the state. They may or may not signify a particular level of training or success in passing an examination.

Kudos for GPO: Video streaming from Interagency Depository Seminar

The Interagency Depository Seminar is a week long event regarded by most of the librarians we know as one of the best events a government information specialist can attend. As the name implies, instructors from other agencies come and provide in-depth training and insight into the information products from their agencies.

Traditionally this has been an in-person event. For the first time this year, GPO will be webcasting parts of the event, according to this e-mail sent to the FDLP-L list:

From: Announcements from the Federal Depository Library Program [mailto:GPO-FDLP-L@LISTSERV.ACCESS.GPO.GOV] On Behalf Of FDLP Listserv
Sent: Thursday, July 26, 2012 11:23 AM
To: GPO-FDLP-L@LISTSERV.ACCESS.GPO.GOV
Subject: Join the Interagency Depository Seminar from Afar!

Next week for the first time, the U.S. Government Printing Office (GPO) will provide remote access to portions of the Interagency Depository Seminar. The virtual sessions are free, but registration is required. Register now and join us live for four segments of the Seminar:

07/30/2012 -- Monday morning: 8:30 AM - Noon (Eastern Time)
* Welcome & Introductions
* History of GPO
* LTIS Update
* Education & Outreach
Register: http://login.icohere.com/registration/register.cfm?reg=270&evt=IADS-Mond...

07/30/2012 -- Monday afternoon: 1:30 PM - 4:00 PM (Eastern Time)
* Federal Reserve Bank's FRASER, FRED, and ALFRED
* DSIMS: Depository Selection Information Management System
Register: http://login.icohere.com/registration/register.cfm?reg=271&evt=IADS-Mond...

08/02/2012 -- Thursday afternoon: 1:30 PM - 5:00 PM (Eastern Time)
* U.S. Geological Survey
* U.S. Department of State
* National Agricultural Library
* National Library of Medicine
Register: http://login.icohere.com/registration/register.cfm?reg=272&evt=IADS-Thur...

08/03/2012 -- Friday morning: 8:30 AM - 12:30 PM (Eastern Time)
* Introduction to GPO's Federal Digital System (FDsys)
* Advanced Navigation in GPO's Federal Digital System (FDsys)
* Closing Remarks and Adjourn
Register: http://login.icohere.com/registration/register.cfm?reg=273&evt=IADS-Frid...

Once your registration is completed, you will receive a confirmation email with instructions for logging into the virtual Interagency Depository Seminar sessions. Virtual attendees will receive certificates of participation. Contact Cindy Etkin (cetkin@gpo.gov) if you have questions.

We at FGI have long advocated for more distance training opportunities for government documents librarians and other government information stakeholders. We haven't been alone. And for the past few years, the Government Printing Office has been listening. This latest advance comes on the heels of a number of online classes for FDSys and other Federal resources.

Thanks GPO for continuing to expand opportunities for librarians and other staff not available to travel. We especially appreciate you webcasting from your signature training event!

US Executive Branch Closure Crawl

The State of the Federal Web Report issued in late 2011 noted that Federal agencies planned to eliminate or merge several hundred domains, as part of the President's Campaign to Cut Waste. The goal was to reduce outdated, redundant, and inactive domains. As part of this work, the .gov Task Force overseeing the process asked members of the National Digital Stewardship Alliance (NDSA) to archive and preserve all .gov Executive branch domains slated to be decommissioned or merged. NDSA members immediately agreed that an important step in this process was to preserve the content of these sites as part of our national digital heritage - instead of simply eliminating them.

Rather than start a separate, standalone project, we chose to launch a collaborative crawl under the auspices of the End of Term Web Archive project (EOT). Although the EOT project has primarily focused on transitions occurring at the end of administrative terms, part of the goal of the project is to document changes in all online presences of the US Federal government during key periods of transition, regardless of when or under what circumstances they occur. So, a comprehensive harvest, using a targeted list of domains supplied by the .gov Task Force and a general list of all Executive branch domains downloaded from data.gov, began on Saturday, October 8, 2011. The crawl concluded on November 5, 2011 and encompassed 46,278,384 captures and ~13TBs of data compressed.

Here's a general outline of the sequence of events of the Fall 2011 crawl:

  • Agencies identified recommended actions for domains in their Interim Progress Reports and Web Inventory
  • The .gov Task Force collected a list of outgoing .gov domains and shared those with the NDSA
  • Internet Archive crawled outgoing sites and the full suite of Executive branch domains (note: for some resources it took several weeks to crawl sites in their entirety)
  • GSA eliminated domains after they were archived

The End of Term Web Archive project, including the archival capture of Executive Branch domains last Fall, is not meant in any way to satisfy agency records management obligations. The domains are archived solely for the purpose of preservation and posterity. Agencies separately discuss records management obligations and handle those processes independently. However, we do make every effort to replicate resources in their entirety – at least what can be supported by available tools, techniques and best practices. Some portion of every web site is housed server-side and that subset of content and/or user experience cannot be archived and replicated using traditional web crawler/capture software that is dependent on files being downloaded to the client.

The biggest challenge of this project, however, was not Web 2.0/Web 3.0 server side rendering or content serving. The biggest limiting factor was time. When we archive resources, there is a big difference between visiting and sampling a web resource using a set of scoping rules and guidelines versus going out and attempting to “drain” a site, i.e. replicate it soup to nuts as fast as the server can respond to your requests. Some of these resources house thousands to tens of thousands of PDF files, videos &/or other network intensive resources. And, most servers are programmed to meter how fast they respond to requests from the same IP address or an IP address range, so we have to wait appropriate intervals between requests in order to avoid being ignored or blacklisted by an automated process. There are ways to parallelize capture, but without dedicated funding, few institutions are able to marshal those kinds of resources on a volunteer basis.

The End of Term project is built on the collaborative best efforts of a network of partners who share a passion for preservation of online government.

For more information about the streamlining of agency website management, please visit www.usa.gov/WebReform.shtml. This effort is now part of the larger Digital Government Strategy.

For more information on the End of Term Web Archive project, please visit http://eotarchive.cdlib.org, and follow us @eotarchive.

Kris Carpenter Negulescu
Director Web Group
Internet Archive

Sunlight provides databases of government information to university libraries

[Editor's note: Adeeb Sahar, Stanford undergraduate student and Sunlight Foundation intern, asked me to post the following PSA about Sunlight's many projects of interest to students, researchers, and the public. FGI has no official connection to Sunlight Foundation. We just love what they're doing!]

The Sunlight Foundation has launched a campaign to partner with university libraries to provide easy access for students and researchers by cataloging as electronic resources its vast online databases of information on politics and government data.

Sunlight Foundation is a nonpartisan, nonprofit organization working to enhance government transparency through free online resources that track political contributions, follow federal regulations and bills and monitor Congressional activity.

Many universities have already let in the sunlight; Sunlight's projects are cataloged in university library databases including those at Stanford University, New York University and the University of Pennsylvania. In its ongoing effort to supply government information to students, the Sunlight Foundation is looking to partner with even more university libraries.

The following are the most commonly cataloged databases by university libraries and are geared toward university-level researchers and students interested in political science, public policy, and politics and government:

  • Scout is the first free searchable database of regulations and bills from all fifty states and the federal government. This service searches through a variety of sources including the Congressional Record, THOMAS, and the Federal Register to produce curated legislative news alerts.
  • Influence Explorer contains the most recent information on political contributions, lobbying information, contracts and other government data, allowing users to track and analyze influence by lawmaker, company or prominent individual.
  • Clearspending is a scorecard that analyzes how well U.S. government agencies are reporting their spending data on USAspending.gov and provides insights to any descrepancies.
  • Open Congress brings together official legislative data with news and blog coverage, social networking, public participation tools, and more to give users a comprehensive assessment of Congressional activity.
  • Capitol Words makes searchable all Congressional records from 1996 to today by state, date or politician to uncover the most popular words and phrases used by legislators in the U.S. Congress.

If you are a subject specialist interested in including Sunlight Foundation's electronic databases on your university library website, contact Adeeb Sahar at asahar@sunlightfoundation.com or Amy Ngai at angai@sunlightfoundation.com. See the Sunlight Foundation site for more information about our projects.

Challenges of Site Identification for the 2012 End of Term Web Archive

This post in our series is about the difficulty of selecting webs sites and building a list of "seed URLs." Seed URLs are the starting points that crawlers use to capture the web content you want to capture.

Part of the difficulty of building a seed list for the End of Term capture is that the federal web space is large. How large? In June 2011, the Office of Management and Budget made federal websites a target for improving transparency in providing government information, particularly reducing "duplicative" websites that create confusion. OMB’s Jeffrey Zients wrote that, "There are nearly 2,000 top-level Federal .gov domains; within these top-level domains, there are thousands of websites, sub-sites, and micro sites, resulting in an estimated 24,000 websites of varying purpose, design, navigation, usability, and accessibility." A "State of the Web" survey published in December 2011 reported that, "The .gov Web Inventory self-reported 1,489 domains and an estimated 11,013 websites from 56 agencies." This report goes on to describe the terminology used: Domains are registered .gov (or .mil, or even .com as the case may be) names on the Internet (in the form www.agencyname.gov). Most agencies (and some much more than others) use sub-domains that vary from the domain by containing a different root domain (for example, project.agencyname.gov).

While domains are registered through the General Services Administration and easily tracked, sub-domains are not. The term "website" is even more nebulous, described as "hosted content … which has a unique homepage and global navigation." As a result, the .gov website numbers are considered a "general estimate."

It isn’t just the "bigness" of the federal web space that makes the End of Term effort a challenge – there are also variants in how the different branches of the federal government are managed and tracked. The Library of Congress archives the legislative branch websites through a leg branch crawl run on a monthly basis, so for that effort a list of seed URLs (which may be anything from a domain to a sub-domain to a particular website or part of website) for the leg branch is assiduously maintained – in other words, the situation for that branch of the federal government is in good shape. (It doesn’t hurt that it is a relatively small branch of government.) There is no such regular effort organized for the judicial branch sites and they aren’t under GSA or OPM, so a reliable seed list for the judicial branch is not so easy to come by and why judicial branch seed URL nominations are a priority for the EOT project.

The Executive branch runs into problems because the OMB lists do not include most .mil, .org, .com, or other top level domain types sometimes used by federal agencies. The executive branch .gov domains are closely tracked and available at data.gov in a list. However, those sub-domains with different roots added to domain names are not tracked here. Crawlers can get derailed and not realize "xyz.govagency.gov" is part of "www.govagency.gov" and won’t capture it, thus xyz.govagency.gov should have its own seed. It can be particularly important on large sites, such as NASA.gov, to identify these sub-domains as separate seeds.

Much more common now are social media or quasi social media .com sites where the federal agency represents itself – the State Department, for example, has a "presence" in Facebook, Flickr, Google+, Tumblr, Twitter, and YouTube. All of these can and should be scoped separately.

Complicating things further, federal agencies of all sizes, but particularly smaller bodies, can use third party hosting solutions of varying types. Some House committees use a commercial company to provide their streaming and downloadable video. An example of this is the House Ways & Means Committee use of Granicus, (which is linked to from the House Ways and Means Committee website, along with links to their Facebook, Twitter, and YouTube pages).

When I first began doing some research for this blog post, my impression was that the situation is getting easier as the GSA leads an effort for federal "web reform." However as one sees the extent of social media and as third party hosting increases, this optimism is likely misplaced.

For now, the End of Term project can use your assistance!

Michael Neubert
Supervisory Digital Projects Specialist
Library of Congress

Irony = Consolidated Federal Funds Report discontinued, Senate to hold hearing on transparency of federal funding

We just posted about the impending doom of the Consolidated Federal Funds Report (CFFR). Well guess what I found in my latest weekly email update from the Project on Government Oversight (POGO)? I found an announcement for a hearing of the Committee on Homeland Security and Governmental Affairs on July 18, 2012 (Location: SD-342) entitled -- get this! -- "Show Me the Money: Improving the Transparency of Federal Spending." It seems to me that the quickest and easiest way to improve the transparency in federal funding is to re-fund the Federal Financial Statistics program and the Consolidated Federal Funds Report (CFFR).

I hope all of our readers -- and especially those from states with Senators sitting on that committee (CT, ME, MI, HI, DE, AR, LA, MO, MT, AK, OK, MA, AZ, WI, OH, KY, KS) -- will contact Senator Joe Lieberman (Committee Chairman) and Senator Susan Collins (Ranking member) and ALL of the other Senators and request that the CFFR be reinstated.

Here's sample email text to copy/paste:


Dear Senator ______________,

I see that the Committee on Homeland Security and Governmental Affairs will be holding a hearing on July 18th entitled "Show Me the Money: Improving the Transparency of Federal Spending." You may be aware that the Census Bureau's Federal Financial Statistics program will be shut down on July 31, 2012 due to budget cuts. This includes the critical publication "Consolidated Federal Funds Report (CFFR)" http://www.census.gov/govs/cffr/. According to the Census Website, the CFFR contains "virtually all Federal expenditures, including grants, loans, direct payments, insurance, procurement, salaries and wages and other awards (such as price supports and research awards). Data represent actual expenditures (or outlays)."

As a government information librarian at _________________________, I can attest that this publication is highly sought after by researchers, faculty, students, and the public looking into federal spending. Reinstating the Federal Financial Statistics Program and continuing publication of the CFFR would be a very large step in the right direction toward greater transparency in federal funding -- which I believe is the goal of this upcoming hearing.

Thank you for your attention to the important issue of government transparency and responsible spending.

Sincerely,

NAME
CONTACT