I decided to play around a little more with ScraperWiki this week to see what people were talking about when they talked about big data on Twitter. The idea was to kill two birds with one stone: (1) demonstrate once again what’s possible in the realm of data visualization and mining even for novices using free online tools, and (2) give a little taste of what got people excited in the past seven days.
There are scientific studies and then there are collections of numbers, words and charts that purport to say something. This is decidedly the latter, but I really just wanted to see what types of stuff I could do with the data. If it’s at all interesting or useful, let me know. Maybe it can become a weekly thing.
Without further ado, here are some highlights of what I found, based on a sample of just more than 33,000 tweets mentioning “big data.”
Here are the general stats from ScraperWiki, showing the number of tweets collected, the most-mentioned users, screen names (read “tweeters”) and hashtags, and other info. (Click on any of the images to see a larger version.)
There were only a handful that were geotagged, but you can see how popular a topic big data is around the globe.
Next, I uploaded the data to IBM ManyEyes and eliminated obvious or irrelevant terms like “big,” “data” and “RT.” Apparently, people care a lot about analytics.
This phrase net from ManyEyes is pretty cool, too. It’s showing the most common three-word phrases where “is” is the middle word. As you can see, people think data is doing a lot of things.
Bill Gates had the most retweets in the dataset — 382 of a single tweet of him announcing his talk at the Microsoft Research Faculty Summit, which we covered here. Twitter says it has been retweeted a total of 682 times. I went to Tableau Public for this because it can handle a lot of rows and it’s able to show records for everyone mentioned in the data, not just the top few like the ScraperWiki summary.
Here’s the tweet.
— Bill Gates (@BillGates) July 15, 2013
Mashable also had a rather popular post highlighting “5 big data projects that could impact your life.” Just filtering by Twitter accounts will miss mentions that don’t include the source, but a tool like ManyEyes lets you easily search by title (including variations on it).
Then, I decided to expand the scope to include “Hadoop,” ”machine learning” and “data science OR data scientist.” Each of these topics had a much smaller sample size (these scrapes ran for a much shorter time and I assume there’s just smaller number of people talking about them), but here are some highlights from each.
@BigDataBorat is the undisputed king of Hadoop retweets, and his tweets apparently have no shelf life. Here was the most popular this week.
If want learn Spring Framework for Hadoop I suggest start with AbstractSingletonHadoopClusterProxyFactoryBuilderBean.
— Big Data Borat (@BigDataBorat) July 18, 2013
I thought this phrase cloud was pretty cool, too. It highlights Hadoop’s place in the big data stack, with the connector being “and”.
A handful of news items really dominated the machine learning discussion, namely: Ayasdi’s funding, Cloudera buying Myrrix, the new BloomReach mobile service and a story about recording the sounds of endangered species using iPods.
Although, people do seem to think highly of machine learning. Here’s a phrase net again with “is” as the connector.
People seemed to like this news from Joe Hellerstein, too.
Oh baby MADlib v1.0 is out! Serious parallel machine learning 4 SQL. Open Source industry+academia deep goodness. http://t.co/lFG5B7Vyyf
— Joe Hellerstein (@joe_hellerstein) July 12, 2013
After eliminating all the obvious terms, this word cloud from ManyEyes suggests that the University of California, Berkeley’s decision to offer an online graduate degree in data science was a big deal.
As was Unix. Why? Because Tim O’Reilly still loves the Unix command line.
— Tim O’Reilly (@timoreilly) July 16, 2013
And because GrubHub “data nerd” (his term) Greg Reda wrote a post titled “Useful Unix commands for data science.”
In terms of sheer volume of retweets, data mining website KDnuggets generated the most volume across a handful of them. This tweet — unattached to news, a blog post or Tim O’Reilly — just struck a nerve.
Ten years ago: I’m grepping this file and aggregating stuff with awk and sort, 20 mins and we can get lunch. Today: “I’m doing data science”
— Salvatore Sanfilippo (@antirez) July 15, 2013
And, finally, this one points to a rather insightful post about data science from Jetpac CTO Pete Warden.
Why you should never trust a data scientist – http://t.co/6eXCgVlc2y
— Pete Warden (@petewarden) July 18, 2013