Home » Posts tagged 'data mining'
Tag Archives: data mining
Hot off the presses from the National Academies is this prepublication version of a report Frontiers in Massive Data Analysis. This is a really nice survey of much of the state of the art and current issues involved in “big data.” Govt information librarians owe it to themselves to become well-versed as more and more researchers across many disciplines will become interested in govt information as a corpus to do larger analysis (I’m already getting questions about corpus research!).
From Facebook to Google searches to bookmarking a webpage in our browsers, today’s society has become one with an enormous amount of data. Some internet-based companies such as Yahoo! are even storing exabytes (10 to the 18 bytes) of data. Like these companies and the rest of the world, scientific communities are also generating large amounts of data-—mostly terabytes and in some cases near petabytes—from experiments, observations, and numerical simulation. However, the scientific community, along with defense enterprise, has been a leader in generating and using large data sets for many years. The issue that arises with this new type of large data is how to handle it—this includes sharing the data, enabling data security, working with different data formats and structures, dealing with the highly distributed data sources, and more.
Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data’s work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee’s recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis.
DataFerrett (Federated Electronic Research, Review, Extraction, and Tabulation Tool) is a free data mining and extraction tool developed by the U.S. Census Bureau that allows users to search, browse, combine, tabulate, recode, and analyze statistical data from a network of online data libraries. The DataFerret software can be downloaded from the website or ran in the browser via a java applet.
Some material to read before getting started:
Available data sets included:
- American Community Survey (ACS)
- American Housing Survey (AHS)
- Behavioral Risk Factor Surveillance System (BRFSS)
- Consumer Expenditure Survey (CES)
- County Business Patterns (CBP)
- Current Population Survey (CPS)
- Decennial Census of Population and Housing
- Harvard-MIT Data Center Collection
- Home Mortgage Disclosure Act (HMDA)
- Local Employment Dynamics (LED)
- National Ambulatory Medical Care Survey (NAMCS)
- National Center for Health Statistics Mortality (MORT)
- National Health and Nutrition Examination Survey (HANES)
- National Health Interview Survey (NHIS)
- National Hospital Ambulatory Medical Care Survey (NHAMCS)
- National Survey of Fishing, Hunting, and Wildlife (FHWAR)
- Small Area Income and Poverty Estimates (SAIPE)
- Social Security Administration (SSA)
- Survey of Income and Program Participation (SIPP)
- Survey of Program Dynamics (SPD)
DataFerret is a wonderful tool for exploring and analyzing data. Enjoy!
(found via Open Access News)
A few weeks back, we posted a story about an Atlantic article from November, 1967 called, “The National Data Center and Personal Privacy” in which was discussed the idea of a National Data Center, the precursor to Total Information Awareness. It was such a hot topic of the day that Congress held a hearing on computers and the invasion of privacy of US citizens (The computer and invasion of privacy. Hearings, Eighty-ninth Congress, second session. July 26, 27, and 28, 1966. by United States. Congress. House. Committee on Government Operations. Special Subcommittee on Invasion of Privacy.)
I started reading the hearing, and found that Yale Economics Professor Richard Ruggles (NYT obituary from 2001) had also testified before that hearing. So I started poking around about Ruggles, looking in WorldCat and Google Scholar. I found quite a few citations to a document entitled, Report of the Committee on the Preservation and Use of Economic Data submitted to the Social Science Research Council in 1965.
But for such a well-cited document that spawned a Congressional hearing and much worry in the mainstream press about computers and privacy, there were only 3 libraries in the whole country that held the report. Imagine that!
Well, I decided to liberate the report, so — after much finagling! — got a copy, scanned it, and uploaded it to the Internet Archive. Score one for the digital public domain!!
I hope to see more libraries listed as having a copy in WorldCat in the near future. And if you’ve got any fugitive documents laying around your hard drive, send them to us here at admin AT freegovinfo DOT info. We’ll make sure they get up on the open Web safe and secure in the Internet Archive!!
Thanks to Docuticker for pointing out this new Congressional Research Report on Federal data mining efforts:
Data Mining and Homeland Security: An Overview (PDF; 231 KB)
Source: Congressional Research Service (via Federation of American Scientists)
Aside from cataloging currently known datamining efforts by the federal government, the report identifies four areas of concern:
As with other aspects of data mining, while technological capabilities are important, there are other implementation and oversight issues that can influence the success of a project’s outcome. One issue is data quality, which refers to the accuracy and completeness of the data being analyzed. A second issue is the interoperability of the data mining software and databases being used by different agencies. A third issue is mission creep, or the use of data for purposes other than for which the data were originally collected. A fourth issue is privacy. Questions that may be considered include the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and possible application of the Privacy Act to these initiatives.
I’ve heard people say that data mining by the government is no big deal since advertisers and other corporate interests do it all the time in efforts to focus marketing and improve profits. If we don’t have privacy from corporate types, why should the government worry us? Because ratty data used by a marketer might result in a bald man getting shampoo ads, but when the government relies on ratty data for law enforcement, innocent people can get jailed or harrassed.
Hopefully Congress will be more vigilant on this issue.
I have a bunch of tabs open of boingboing posts that I want to share, but it’s been such a hectic day (I invited Rick Falkvinge of the Swedish Pirate party to give a talk today at my library!) so I think I’ll just list them and let you all sort them out.
- Peer to Patent: keeping the Patent Office honest with community review
- Amazon will distribute the US National Archive on DVD
- NY Public Library giving away free public domain books-on-demand
- Pirate Party founder at Stanford (I’ll post the video soon. W00t!)
- Bruce Schneier interviews TSA head Kip Hawley
- Data mining prompted fight over NSA domestic spying program (here’s a login-free link to the NYT article)
Now if THIS doesn’t convince you that a) blogs are incredibly useful tools for disseminating information and b) boingboing should be read several times a day as a matter of course then I don’t know what will convince you. Happy reading 🙂