Large web and social media datasets like Twitter’s — which the company is now opening up to a select group of researchers — are the basis of a tension that has been building for years between social scientists who want such data and the companies that keep it under lock and key. That Twitter is willing to open hundreds of billions of tweets up for academic research is great news, but its decision to limit the number of grant recipients highlights the problem.
The promise of unadulterated access to a dataset like Twitter’s is already evident. Back in 2012, we reported on some university researchers who scoured publicly available tweets to identify bullies, their victims and their methods. On Thursday, MIT Technology Review reported on a study that identified clusters of individuals who share content about the Syrian civil war on Twitter and YouTube. These are two of probably thousands of studies that have been conducted on social media data since Facebook first took hold several years ago, none of which presumably have had access to data beyond what they were able to scrape or otherwise collect online.
For social scientists and other researchers, access to the entire body of Twitter’s historical data might be even more valuable than the streaming data that so many companies covet to do everything from identify breaking news to monitoring their brand presence. Researchers would benefit greatly from being able to analyze the ways people communicate online, the language they use, where they live, the times of day they’re active and everything else that metadata can tell them. Viewed in the light of existing research into particular topics, social media data could let them uncover new insights, bolster existing theories or perhaps disprove them.
Just to get a small sense of how much hate is flowing over Twitter, I used (Structure Data award winner) ScraperWiki before bed on Wednesday night to search for tweets using the work “kill.” When I woke up, there were more than 30,000 from just a two-hour period and I was pretty easily able to discern some trends about who’s using the word and how lightly we throw around death threats on social media (and how much people love/hate Flappy Bird).
Just imagine what an actual researcher could do with hundreds of billions of tweets (and all the metadata that comes with them) and some serious data science techniques. Former Yahoo Chief Data Officer Usama Fayyad has told me he used to recruit talent to Yahoo Labs because they were so excited to get to work with so much and such good data.
For the companies that gather this data, though, granting broad access outside the company probably isn’t too desirable. There’s the value of the data to their businesses, of course, in being the only ones with access to these troves of insight into human behavior, language and relationships. In the case of Twitter, there are also deals with partners such as DataSift and Gnip around providing commercial access to the data. Apple seems to have wanted Twitter data so bad it bought Topsy and its firehose access for $200 million.
There’s also the privacy aspect, although it might be more legitimate for a company like Facebook, which keeps so much behind a login, than for Twitter where most tweets are public anyhow. Anonymization hasn’t proven very effective when companies have tried to release potentially valuable data sets, and there’s always the question — with Twitter, for example, as well as with mugshot sites — of whether it’s fair to gather such data in aggregate just because it’s possible.
Perhaps FTC Commisioner Julie Brill can shed some light on this during her talk with Gigaom legal expert Jeff John Roberts at our Structure Data conference in March.
It does seem like something has to give at some point, though. There are data grants like what Twitter is doing, and Google and Facebook have both worked with researchers and taken their own steps to combat specific problems over the years, but there’s so much more to be learned. We talk a lot about the value of open data, and then some of the most valuable stuff is kept largely secret.
Companies don’t have much commercial incentive to analyze their data in ways that won’t make them money, and we shouldn’t expect them to. But it would be great if we could find a way to open up more of that data to people who will. There’s a lot left to learn about Twitter use beside what people think about the latest awards show or which buroughs tweet the most.