I love my Data Is Plural newsletter, Jeremy Singer-Vine’s weekly newsletter of useful/curious datasets! You can check out his archive from 2015(!) to present and also explore the archive as a google spreadsheet or as Markdown files (a dataset of interesting datasets :-)).
Today’s edition was especially good on the govinfo front: 2 really awesome datasets on House Committee witnesses (1971 – 2016) and a high-resolution transcription and CSV file of the Census Bureau’s 1900 report on immigrant populations. Check them out, and don’t forget to subscribe to the Data is plural newsletter!
House committee witnesses. Political scientists Lauren C. Bell and J.D. Rackey have compiled a spreadsheet of 435,000+ people testifying before the US House of Representatives from 1971 to 2016. They began with a text file scraped from a ProQuest database, provided by the authors of a dataset that focused on social scientists’ testimony (DIP 2020.12.23). Then, they determined each witness’s first and last name; type of organization; the committee, date, title, and summary of the relevant hearing; and more.
Immigrant populations in 1900. The 1900 US Census’s public report includes a table counting the foreign-born residents of each state and territory — overall and disaggregated into a few dozen origins, which range from subdivisions of countries (Poland is split into “Austrian,” “German,” “Russian,” and “unknown” columns) to entire continents (“Africa”). It’s officially available as a low-resolution PDF. Reporters at Stacker, however, recently transcribed it into a CSV file for easier use.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
I forgot to mention, but the Census report is an awesome example of the interest in and need for more than simple PDF documents. MANY government reports include amazing tabular statistics that are virtually impossible to find and access unless you happen to hit on them as you’re reading. I’d love to create a database of PDF documents where the tables are extracted as .CSV and indexed/findable. The relatively new subscription database called PolicyCommons has started to extract tabular data and make it searchable. Another example technology is from ZanRan