20 Comments

Summary:

The big data skeptics have been getting louder over the past few months, but the message doesn’t resonate too loudly. No one said big data was perfect data.

There has been a backlash lately against big data. From O’Reilly Media to the New Yorker, from Nassim Taleb to Kate Crawford, everyone is treating big data like a piñata. Gartner has dropped it into the “trough of disillusionment.” I call B.S. on all of it.

It might be provocative to call into question one of the hottest tech movements in generations, but it’s not really fair. That’s because how companies and people benefit from big data, data science or whatever else they choose to call the movement toward a data-centric world is directly related to what they expect going in. Arguing that big data isn’t all it’s cracked up to be is a strawman, pure and simple — because no one should think it’s magic to begin with.

Correlation versus causation versus “what’s good enough for the job”

One of the biggest complaints — or, in some cases, proposed facts — about big data is that is relies more on correlation than causation in order to find its vaunted insights. To the extent that’s true, it’s a fair criticism. Only I’m not certain how often it’s true for things that really matter.

Honestly, for song or product recommendations, who really cares?

But in areas like medicine, finance and even marketing, people are becoming much more concerned with finding out “why” once they’ve found out “what.” If you’re a police department trying to figure out a strategy for stopping people on the street, for example, even a strong correlation between race and certain crimes probably won’t be enough to justify harassing minorities. Oncologists might benefit from seeing the similarities among cells in a biopsy, but targeting certain markers doesn’t guarantee you can cure someone’s cancer.

Or if you’re a retail store, knowing that Mac users who visit your site tend to buy more-expensive products might make you want to show them more-expensive products. Some deeper digging — perhaps even via direct questions — would show they’re really concerned with craftsmanship. The more you learn beyond what a clustering algorithm can tell you, the better you can connect with customers.

This is why some people call the process of asking interesting questions of data “exploratory analytics.” Data analysts can send out a virtual Christopher Columbus to see what’s doing inside their data. If they find something potentially valuable, they dig in further. Correlations are just a notice that there might be something worth looking at here.

Clusters show where oncologists should start investigating.

Clusters show where oncologists should start investigating. Source: Columbia University

And even in the realm of machine learning — where algorithms are tearing through datasets trying to discover complex patterns humans could never spot — very few people are seriously suggesting we take the machines at their word. In case after case after case, the story is the same: machines do the heavy lifting but humans still play critical roles in training the models by correcting mistakes or adding judgment into an otherwise entirely logical process.

Web data is only part of big data

There’s another idea floating around, too, which is that web-derived data — be it from social media, search queries or some other place — is somehow synonymous with big data. Critics are quick to point out that there are biases in this type of data and that we shouldn’t abolish traditional methods of qualitative, non-digital research in lieu of methods utilizing this fast, easy web data. Of course these critics are right.

But who is really suggesting we do away with traditional forms of research? Social media data shouldn’t usurp traditional customer service or market research data that’s still useful, nor should the Centers for Disease Control start relying on Google Flu Trends at the expense of traditional flu-tracking methodologies. Web and social data are just one more source of data to factor into decisions, albeit a potentially voluminous and high-velocity one.

Even if they’re biased or perhaps even slightly misleading, though, these new data types are still valuable, even for social science research. It is a source of new, large, and arguably unfiltered insights into attitudes and behaviors that were previously difficult to track in the wild. I’m thinking of the researchers who identified new insights into bullying by studying Twitter activity, and of those who have mapped racist tweets across the United States.

Floating Sheep's Hate Map

Floating Sheep’s Hate Map

The drawbacks should be pretty easy to overcome. Demographic or other biases might be relatively easy to spot when information is also tagged with geodata and perhaps profile information, for example. And assuming the data is mostly indicative of macro trends, there’s definitely value in being able to track it by the day, hour or minute and see trends shaping up in something far closer to real time than traditional research methods would allow.

It’s not all about insights

Which brings me to another point, this one about the idea that big data is all about finding out new things through exploration. Sure, that can be the case if you’re starting to analyze entirely new data sources (like social media data) or using entirely new techniques, and it’s a very compelling reason to get started down the big data path. But sometimes big data is just about automation.

Technologies like Hadoop, for example, aren’t designed to write you better models — they’re designed to process a lot more data a lot faster. If your models still work, Hadoop should help you run them better against a much larger dataset. That might lead to more accurate models and faster answers, but it won’t necessarily lead to some “a ha” moment — like that you’ve been doing business all wrong for all these years.

If you’re a law firm, analyzing e-discovery files faster and more accurately might be reward enough in itself. Or maybe you’re just trying to get a better view of customers or products by putting all your data on them, that you’ve collected over years, into one place. The point is these are valuable objectives even if they don’t involve finding a needle in the haystack.

I think MailChimp is a great example of this. It used big data techniques to discover some interesting things about the characteristics of spam, but the bigger goal was automating the spam-detection process. Those insights don’t directly affect the bottom line, but they did free up resources to help apply data science in others areas that could.

Lower your expectations. Or at least know them

Like anything in IT, big data is almost destined to be a money pit if you go into it without a plan. I’ve heard stories of large-enterprise CIOs deploying Hadoop clusters — sometimes numerous flavors of Hadoop clusters — just because they felt obligated to. I assume there are companies trying desperately to hire data scientists with no real idea what types of problems they’ll be trying to solve. That’s crazy.

In some ways, this type of thinking ties back to the idea that new digital data sources somehow overtake a company’s legacy data in terms importance. Without any actual plan of attack, proposing “We’ll use social media” as a solution to finding out more about consumers is about as useful as proposing “We’ll use Hadoop” as a solution to a question about a big data strategy. Both might very well be parts of any given plan, but they need to be used for what they’re good for.

One major takeaway from my recent interview with MetLife, for example, was how fast the company was able to move on a new data-centric project because it approached it with a plan in place about the types of data and technology it needed. I don’t think it’s surprising, either, to hear the team at Infochimps say that while customers often approach thinking they need Hadoop, it turns out they usually need to begin with something a little less industrial-strength.

So, no, new data types, technologies for processing them and techniques for analyzing them aren’t going to change the world through their mere existence. At the worst, they’re just bigger, shinier and arguably better versions of what we already had. At the best, however — and used appropriately — they really could make a big difference.

Big data will never equal perfect data, but it can definitely point us in the right direction. I suggest not throwing the baby away with the bathwater.

Feature image courtesy of Shutterstock user alri.

  1. Michael Warjas Tuesday, May 28, 2013

    Derrik, thank you for weighing in on this.

    Corinthians 8:2 And if any man think that he knoweth any thing, he knoweth nothing yet as he ought to know.

    Although big data is in the vaunted ‘trough of disillusionment’ , look way out there on the right – the plateau of productivity – and you’ll find ‘predictive analytics’. We all know that more data trumps a clever algorithm.

    Gartner is constrained by their chosen technology adoption paradigm, so they’ve fast-forwarded big data as quickly as possible. It has clearly crossed the Chasm (for which Gartner blatantly has ripped off Geoffrey Moore) and is well on it’s way. It’s still early majority (thank you Clayton Christensen) and generalizing to SMB will be a challenge. I fully agree with your stated title, ” If you’re disappointed with big data, you’re not paying attention “

    Share
  2. Vinod Shintre Tuesday, May 28, 2013

    Let’s face it. Looking or exploring existing data sets to investigate a potential business enhancer would not hurt anyone. End of the day business has to clearly identify what they want to achieve after parsing huge data sets & assess ways to pass on those metric to a system which can further implement them on a consumer system. This handshake is very necessary for all stake holders to gain out of big data.

    Share
  3. HadoopSphere Editor Tuesday, May 28, 2013

    Derrick, Agree with most of your comments here.

    After Gartner had posted the trough indicator in January, Mark Beyer had followed up with a blog post in March to explain what ‘trough of disillusionment’ and ‘Hype cycle’ movement means. (Mark Beyer was listed among top Big Data influencers of 2012 by hadoopsphere – link ).

    This is what Mark said to set the tone right :
    “Gartner is not saying big data is dead, or gone. To the contrary — we say it becomes the new normal and does so somewhere between 2015 and 2017…
    for something to move into the Trough is a maturation process. Implementers and organizations will begin to choose the winning solution architectures and technologies that support them. The chaff will be reduced away from the wheat…
    Then it will rise along the Slope of Enlightenment while others drop by the way side.”

    Share
  4. Great article Derrick. One of the biggest problems is how broadly the word Big Data is used – it has certainly become the last bastion of support for overly commoditised Storage hardware vendors and resellers to rely on, rather than invest as much as they should on Data Analytics and BI so that all of that Data being generated is actually of tangible value to the organisation. In my experience, Big Data is a bit of a misnomer – in fact for many it should be the pursuit of small data – i.e. only extract that which is of value, in near real time. Amusingly, I’m even seeing the emergence of the phrase Dirty Data – for pretty much everything that is left! I’m focusing our new site on Intent Data, trying to show how extracting value from Big Data using targeted Data Analytics can produce genuine business value! http://www.intentdata.com

    Share
  5. Nice Job Derrick. If one steps back and takes a look at the big picture of big data, of all that it influences and encompasses, and how intertwines with mobile and social, it becomes clear that there should be no “disappointment” in the phenomena. Of course, there are individual projects that go belly-up because, for example, IT people buy in naively to the “Hadoop DIY” promise without considering what might work in the long-term. But as a global, macro-trend in IT, business, government and human development, big data is a monster that is not going away any time soon.

    Share
  6. Patrick Pitre Wednesday, May 29, 2013

    Excellent article on Big Data! These are exactly the types of insights into Big Data that decision makers need to see.

    Share
  7. I have been saying for months that the best analogy for Big Data is Chaos Theory…finding simple structure out of a massively large unstructured dataset. And yes, not perfect, but way beyond traditional data analytics. The “hype” “trough of disillusionment” and “maturity” are amusing analogies to Geoff Moore’s “Crossing the Chasm,” and perhaps Adizes corporate life cycle infographic.

    Share
  8. Very interesting write up, Derrick that echoes much of what I was thinking but hadn’t figured out how to say. I wrote up something similar this morning based on criticism of Gartner and its Big Data Hype Cycle by someone I think missed the point of Gartner.

    http://successfulworkplace.com/2013/05/29/is-big-data-in-the-trough-of-disillusionment/

    Great minds think alike, even if I took a different tack (I referenced Nassim Taleb as well).

    Share
  9. John Thompson Wednesday, May 29, 2013

    Data is not good or bad. It is just data. The value is in what you make of it.

    Share
  10. That article lacks perspective, both technologically and historically. “Big data” is little more than a buzzword for a phenomenon that has existed as long as computing: having more data than can be feasibly processed by conventional computational methods. 20 years ago, we solved “big data” problems using PVM on AIX clusters with data sets that can now fit on a thumb-drive. 20 years from now, we’ll be processing today’s “big data” on our eyeglasses. The only people who aren’t disillusioned with “big data” are those who haven’t realized that yet, those who realized it from the beginning, or those who profit from custom-tailoring the Emperor’s new clothes.

    Share

Comments have been disabled for this post