Home » Posts tagged 'thomas'
Tag Archives: thomas
Treaties, Text, and Timely Updates – Congress.gov Spring Cleaning, by Andrew Weber, In Custodia Legis (Library of Congress Law Librarians blog) (March 25, 2015).
Since introducing Congress.gov in September 2012, we have continued to add the databases from THOMAS to the new system. We launched with legislation, followed soon thereafter by the Congressional Record, Committee Reports, and nominations. Today, we are releasing treaty documents. You can select “All Sources” and search across all of these data sets at once, something that was not possible on THOMAS. With this, all of the data sets in the left hand navigation of THOMAS are included in Congress.gov. We have one more data set that was on the legacy system to add, Senate Executive Communications.
Congress.gov will be the sole source for texts of pending and passed legislation, committee reports, congressional floor speeches and cost estimates from the Congressional Budget Office beginning Nov. 19, the Library of Congress announced on Friday.
Three open government access advocates (Sunlight Foundation developer Eric Mill, GovTrack.us founder Josh Tauberer and New York Times developer Derek Willis) have put the United States Code on Github.
- The United States (Code) is on Github, by Alex Howard, O’Reilly Radar (December 6, 2012).
This fall, a trio of open government developers took it upon themselves to do what custodians of the U.S. Code and laws in the Library of Congress could have done years ago: published data and scrapers for legislation in Congress from THOMAS.gov in the public domain. The data at github.com/unitedstates is published using an “unlicense” and updated nightly.
…”It would be fantastic if the relevant bodies published this data themselves and made these datasets and scrapers unnecessary,” said Mill, in an email interview. “It would increase the information’s accuracy and timeliness, and probably its breadth. It would certainly save us a lot of work!”
Perhaps even more importantly, the project has released its computer code so that others will be able to scrape Thomas to build their own datasets of legislative data. The computer code also includes a U.S Code parser, which is significant because none of various formats in which the government produces the U.S. Code are suitable for easy reuse.
I also think it is fantastic that these developers understand the difference between putting information on the web in various hard-to-use, hard-to-preserve, and often hard-to-parse formats and actually publishing the data so that it can be easily obtained, used, and re-used. As Mill notes, publishing information makes scraping the web unnecessary, and publishing in open formats makes it much simpler to preserve information.
Eric Mill announced today on the openhouseproject mailing list that he and Josh Tauberer (of GovTrack.us) and Derek Willis have completed a milestone in their project to produce a public domain scraper and dataset from THOMAS.gov. Here is the text of his message with links:
I’ve been working for the last month or two with Josh Tauberer (of GovTrack.us http://govtrack.us/) and Derek Willis on a project to produce a public domain scraper and dataset from THOMAS.gov http://thomas.gov/, the official source for legislative information for the US Congress.
It’s a reasonably well documented set of Python scripts, which you can find here: https://github.com/unitedstates/congress
We just hit a great milestone – it gets everything important that THOMAS has on bills, back to the year THOMAS starts (1973). We’ve published and documented https://github.com/unitedstates/congress/wiki all of this data in bulk, and I’ve worked it into Sunlight’s pipeline, so that searches for bills in Scout https://scout.sunlightfoundation.com/search/federal_bills/freedom%20of%20information use data collected directly from this effort.
The data and code are all hosted on Github on a “unitedstates https://github.com/unitedstates/” organization, which is right now co-owned by me, Josh, and Derek – the intent is to have this all exist in a common space. To the extent that the code needs a license at all, I’m using a public domain “unlicense https://github.com/unitedstates/congress/blob/master/LICENSE” that should at least be sufficient for the US (other suggestions welcome).
There’s other great stuff in this organization, too – Josh made an amazing donation of his legislator dataset https://github.com/unitedstates/congress-legislators, and converted it to YAML for easy reuse. I’ve worked that dataset into Sunlight’s products already as well. I’ve also moved my legal citation extractor https://github.com/unitedstates/citation into this organization — and my colleague Thom Neale has an in-progress parser for the US Code https://github.com/unitedstates/uscode, to convert it from binary typesetting codes into JSON.
Github’s organization structure actually makes possible a very neat commons. I’m hoping this model proves useful, both for us and for the public.
— Developer | sunlightfoundation.com