Home » post » THOMAS bulk download!

THOMAS bulk download!

Eric Mill announced today on the openhouseproject mailing list that he and Josh Tauberer (of GovTrack.us) and Derek Willis have completed a milestone in their project to produce a public domain scraper and dataset from THOMAS.gov. Here is the text of his message with links:

Hi all,

I’ve been working for the last month or two with Josh Tauberer (of GovTrack.us http://govtrack.us/) and Derek Willis on a project to produce a public domain scraper and dataset from THOMAS.gov http://thomas.gov/, the official source for legislative information for the US Congress.

It’s a reasonably well documented set of Python scripts, which you can find here: https://github.com/unitedstates/congress

We just hit a great milestone – it gets everything important that THOMAS has on bills, back to the year THOMAS starts (1973). We’ve published and documented https://github.com/unitedstates/congress/wiki all of this data in bulk, and I’ve worked it into Sunlight’s pipeline, so that searches for bills in Scout https://scout.sunlightfoundation.com/search/federal_bills/freedom%20of%20information use data collected directly from this effort.

The data and code are all hosted on Github on a “unitedstates https://github.com/unitedstates/” organization, which is right now co-owned by me, Josh, and Derek – the intent is to have this all exist in a common space. To the extent that the code needs a license at all, I’m using a public domain “unlicense https://github.com/unitedstates/congress/blob/master/LICENSE” that should at least be sufficient for the US (other suggestions welcome).

There’s other great stuff in this organization, too – Josh made an amazing donation of his legislator dataset https://github.com/unitedstates/congress-legislators, and converted it to YAML for easy reuse. I’ve worked that dataset into Sunlight’s products already as well. I’ve also moved my legal citation extractor https://github.com/unitedstates/citation into this organization — and my colleague Thom Neale has an in-progress parser for the US Code https://github.com/unitedstates/uscode, to convert it from binary typesetting codes into JSON.

Github’s organization structure actually makes possible a very neat commons. I’m hoping this model proves useful, both for us and for the public.

— Eric

— Developer | sunlightfoundation.com

Leave a comment

Your email address will not be published. Required fields are marked *


%d bloggers like this: