Data Scraping Comes of Age With ScraperWiki.com:
A scrappy company to help journalists dig into Big Data has come into its own in the past year, including the requisite all-night hacking codeathon this week at the Investigative Reporters and Editors Computer-Assisted Reporting Conference in St. Louis. The company is called ScraperWiki.com and was started by Julian Todd and Aidan McGuire, two U.K.-based analysts who have been long involved in opening up government data to the public.
Take a look at this data that was mined from the UN peacekeeping troop levels, as one example of what you can do. It is really like the Wild West of data visualization. Todd says in one blog post about his own data scraping efforts, "Look, you have just got all this way starting from nothing, from finding something out in the world, to recognizing its potential, all the way to pulling in and transforming the original raw data and struggling for a way to analyze it."
If you are interested in writing your own data scraping routines, you can watch several how-to screencasts on ScraperWiki here. You can program in php, Python, or Ruby. Most of the time you are gonna have to know some SQL code to work your way around these data sets. At the St. Louis conference, work was begun on scraping various public data sets such as the US federal prisoners or FDA drug and food recalls.
IRE.org also has a collection of different databases, too, such as ones on environmental data and campaign spending, but these are only available to member journalists.
There are even bounties to be had (not much, a couple hundred bucks) if you write your own data scraping tool and make it available as part of the Open Corporates effort.
Clearly, as more data becomes available online, scraping apps abound. But part of the problem is that journalists don't necessarily know SQL, let alone Ruby or where to find these treasure troves. That is where the conference and the codeathon this week come in handy, where dozens of folks learned how to start to take a stab at these visualizations as part of their reporting jobs. We're glad to see this happening!