Is Infochimps' Aggregated Data a Boon to Researchers or a Privacy Nightmare?

main_logoA pair of slices from a massive scrape of Twitter’s API could be of great use to programmers and researchers alike — as long as users don’t mind. The company behind the mining effort, Infochimps, is trying to demonstrate and promote its data aggregation service while offering up some useful information to interested parties.

At the end of last year, Infochimps posted a heftier version of its scrape of Twitter, which was taken down at the behest of the micro-messaging site over user privacy concerns. By releasing curated, anonymized chunks of data, the company may avoid most of the user privacy concerns that arose last time around. Then again, it may not.

One of the sets, a “token count,” adds up the number of particular tokens (individual hashtags, smileys and URLs) that have been tweeted since March 2006. The other links the ID strings between Twitter’s Search API and the standard Twitter API. The two APIs issue different ID numbers to users, which makes it annoying, if not impossible, for developers to link data across both services to one user.

Infochimps says it hopes “to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers.” It also hopes to “start a conversation about where value really lies in this type of data, [and] the various ownership and privacy issues that arise.” Given the complaints from Twitter the first time data was posted, it’s a smart move on the part of Infochimps to add this disclosure and thoroughly anonymize the data. The company very much wants to avoid any sort of ill will or backlash from the Twitterati over the release of the data sets. Back in 2006, AOL Research released 20 million search keywords attached to user IDs for researchers to use. A number of individuals were identified as a result of the “anonymized” data, leading to a number of concerns over what sorts of data are kosher to be released.

Ownership and privacy aside, Infochimps is offering the “tokens” data set broken out by month for free, and $9,500 for a version broken out by hour. The “ID/API mapping” data set is being offered for $6,000.