If you’re disappointed with big data, you’re not paying attention

20 Comments

There has been a backlash lately against big data. From O’Reilly Media to the New Yorker, from Nassim Taleb to Kate Crawford, everyone is treating big data like a piñata. Gartner has dropped it into the “trough of disillusionment.” I call B.S. on all of it.

It might be provocative to call into question one of the hottest tech movements in generations, but it’s not really fair. That’s because how companies and people benefit from big data, data science or whatever else they choose to call the movement toward a data-centric world is directly related to what they expect going in. Arguing that big data isn’t all it’s cracked up to be is a strawman, pure and simple — because no one should think it’s magic to begin with.

Correlation versus causation versus “what’s good enough for the job”

One of the biggest complaints — or, in some cases, proposed facts — about big data is that is relies more on correlation than causation in order to find its vaunted insights. To the extent that’s true, it’s a fair criticism. Only I’m not certain how often it’s true for things that really matter.

Honestly, for song or product recommendations, who really cares?

But in areas like medicine, finance and even marketing, people are becoming much more concerned with finding out “why” once they’ve found out “what.” If you’re a police department trying to figure out a strategy for stopping people on the street, for example, even a strong correlation between race and certain crimes probably won’t be enough to justify harassing minorities. Oncologists might benefit from seeing the similarities among cells in a biopsy, but targeting certain markers doesn’t guarantee you can cure someone’s cancer.

Or if you’re a retail store, knowing that Mac users who visit your site tend to buy more-expensive products might make you want to show them more-expensive products. Some deeper digging — perhaps even via direct questions — would show they’re really concerned with craftsmanship. The more you learn beyond what a clustering algorithm can tell you, the better you can connect with customers.

This is why some people call the process of asking interesting questions of data “exploratory analytics.” Data analysts can send out a virtual Christopher Columbus to see what’s doing inside their data. If they find something potentially valuable, they dig in further. Correlations are just a notice that there might be something worth looking at here.

Clusters show where oncologists should start investigating.

Clusters show where oncologists should start investigating. Source: Columbia University

And even in the realm of machine learning — where algorithms are tearing through datasets trying to discover complex patterns humans could never spot — very few people are seriously suggesting we take the machines at their word. In case after case after case, the story is the same: machines do the heavy lifting but humans still play critical roles in training the models by correcting mistakes or adding judgment into an otherwise entirely logical process.

Web data is only part of big data

There’s another idea floating around, too, which is that web-derived data — be it from social media, search queries or some other place — is somehow synonymous with big data. Critics are quick to point out that there are biases in this type of data and that we shouldn’t abolish traditional methods of qualitative, non-digital research in lieu of methods utilizing this fast, easy web data. Of course these critics are right.

But who is really suggesting we do away with traditional forms of research? Social media data shouldn’t usurp traditional customer service or market research data that’s still useful, nor should the Centers for Disease Control start relying on Google Flu Trends at the expense of traditional flu-tracking methodologies. Web and social data are just one more source of data to factor into decisions, albeit a potentially voluminous and high-velocity one.

Even if they’re biased or perhaps even slightly misleading, though, these new data types are still valuable, even for social science research. It is a source of new, large, and arguably unfiltered insights into attitudes and behaviors that were previously difficult to track in the wild. I’m thinking of the researchers who identified new insights into bullying by studying Twitter activity, and of those who have mapped racist tweets across the United States.

Floating Sheep's Hate Map

Floating Sheep’s Hate Map

The drawbacks should be pretty easy to overcome. Demographic or other biases might be relatively easy to spot when information is also tagged with geodata and perhaps profile information, for example. And assuming the data is mostly indicative of macro trends, there’s definitely value in being able to track it by the day, hour or minute and see trends shaping up in something far closer to real time than traditional research methods would allow.

It’s not all about insights

Which brings me to another point, this one about the idea that big data is all about finding out new things through exploration. Sure, that can be the case if you’re starting to analyze entirely new data sources (like social media data) or using entirely new techniques, and it’s a very compelling reason to get started down the big data path. But sometimes big data is just about automation.

Technologies like Hadoop, for example, aren’t designed to write you better models — they’re designed to process a lot more data a lot faster. If your models still work, Hadoop should help you run them better against a much larger dataset. That might lead to more accurate models and faster answers, but it won’t necessarily lead to some “a ha” moment — like that you’ve been doing business all wrong for all these years.

If you’re a law firm, analyzing e-discovery files faster and more accurately might be reward enough in itself. Or maybe you’re just trying to get a better view of customers or products by putting all your data on them, that you’ve collected over years, into one place. The point is these are valuable objectives even if they don’t involve finding a needle in the haystack.

I think MailChimp is a great example of this. It used big data techniques to discover some interesting things about the characteristics of spam, but the bigger goal was automating the spam-detection process. Those insights don’t directly affect the bottom line, but they did free up resources to help apply data science in others areas that could.

Lower your expectations. Or at least know them

Like anything in IT, big data is almost destined to be a money pit if you go into it without a plan. I’ve heard stories of large-enterprise CIOs deploying Hadoop clusters — sometimes numerous flavors of Hadoop clusters — just because they felt obligated to. I assume there are companies trying desperately to hire data scientists with no real idea what types of problems they’ll be trying to solve. That’s crazy.

In some ways, this type of thinking ties back to the idea that new digital data sources somehow overtake a company’s legacy data in terms importance. Without any actual plan of attack, proposing “We’ll use social media” as a solution to finding out more about consumers is about as useful as proposing “We’ll use Hadoop” as a solution to a question about a big data strategy. Both might very well be parts of any given plan, but they need to be used for what they’re good for.

One major takeaway from my recent interview with MetLife, for example, was how fast the company was able to move on a new data-centric project because it approached it with a plan in place about the types of data and technology it needed. I don’t think it’s surprising, either, to hear the team at Infochimps say that while customers often approach thinking they need Hadoop, it turns out they usually need to begin with something a little less industrial-strength.

So, no, new data types, technologies for processing them and techniques for analyzing them aren’t going to change the world through their mere existence. At the worst, they’re just bigger, shinier and arguably better versions of what we already had. At the best, however — and used appropriately — they really could make a big difference.

Big data will never equal perfect data, but it can definitely point us in the right direction. I suggest not throwing the baby away with the bathwater.

Feature image courtesy of Shutterstock user alri.

20 Comments

Sandra Hendren

Finally!! A well balanced, well written, altogether smart assessment of big data – its warts and its beauty marks. I am so exceedingly tired of big data this and big data that, largely if not exclusively driven by marketing hoopla. Instead this insightful analysis is right on. Thank you, Derrick……

howgreenisyourgarden

‘I’ve heard stories of large-enterprise CIOs deploying Hadoop clusters — sometimes numerous flavors of Hadoop clusters — just because they felt obligated to…. That’s crazy.’

Yeah, I’ve recently heard a story about an intelligence agency doing much the same.

That’s the trouble with big data: you often don’t know what you are looking for until you find it, and because of that, you need ALL the source data up front.

Mona

Very interesting article Derrick. I completely concur companies need to understand and plan the strategy around Big Data before jumping to invest in it. If you look closely it has different meaning ,outcome and value for different companies. They need to think thru what they want to achieve and shop for right tool for Big data. It can be very effective if used intelligently with right strategy and plan. We do consulting and strategy planning for Bigdata and sometimes come across clients who actually don’t know what is their business driver for Big data. Great article.

Nicholas Goubert

Pareto rule applies for big data analysis as well and if helps you get 80% right then it is already a big step forward and I am pretty sure some of the 20% missing will get covered soon via better data cleaning, mining and representing methods and tools. Meanwhile, I am with you Derrick and I will urge people not to go ‘Hurler avec les loups’ as we say in French
Nicholas

Jason Mondanaro

Derrick,

I think your article is quite accurate. Everyone who is down on Big Data seems to have no one to blame but themselves for thinking that Big Data was somehow more than Big and Data. As you indicate, it isn’t the data that matters, it is what business insights and goals you want to achieve. Without having specific goals and insights you want to find and understand you will have no idea what type of analysis you should be performing or if you even have the right data to support such an analysis. Having a lot of data doesn’t guarantee you actually have useful data! On top of that, is the data even in a useful system? For example it is common in my experience to find operational systems logging mountains of interesting data, but without having it available in the billing system so you can actually to business and pricing modeling or trialing new offers to market segments, it is going to always be Big Data and never make transition to Business Insight. Perhaps the real problem is that Big Data just sounded a lot easier and “cool” compared to Decision Support Systems and Business Intelligence, which is another overloaded term!

-Jason

Leonel More, PMP

The problem is not with the word or the data itself, the technology, etc. To leverage the data and the new enablers you need a model to dig into it, and an objective for doing that. I agree with you 100%, it isn´t a magical new approach helping companies to get a new business from the scratch.

Check out what the finance sector did with Alpha computing algorithms using the huge historical data they handle. And that is nothing new. Those quants are around since several years. AND, indeed they know the models they use are not perfect, but even that they are a competitive advantage those companies have over the firms which doesn’t have them.

You can arge about my comment with the last bubble in the housing sector, but that´s another story. Nice post!

Shanmuga Sundara Ram

Like any new technology, Big Data also has to live through this period of Hype. Only difference being that the millenials and others in the fast changing world expect the results immediately without putting enough effort.

Big Data is not a magic wand. It provides provisions for a Research Org / Bank / Retailer to hold vast amount of information (in Peta Bytes) and retrieve them when necessary for analysis. What was supposed to be analyzing on Sample data is now done on actual data / entire data increasing the accuracy

Big Data is not just for Analytics / Insights as this article rightly points out. It can be used in varied ways, for example as Archival solution – where you can replace Tapes with Big Data clusters and retrieve the data in quick time

Satyendra Rana

Big Data or the Big Data attitude is impacting all in various ways. It is about time that Gartner revisits their technology hype cycle model to something more relevant in today’s fast paced world.

Mark Juric

That article lacks perspective, both technologically and historically. “Big data” is little more than a buzzword for a phenomenon that has existed as long as computing: having more data than can be feasibly processed by conventional computational methods. 20 years ago, we solved “big data” problems using PVM on AIX clusters with data sets that can now fit on a thumb-drive. 20 years from now, we’ll be processing today’s “big data” on our eyeglasses. The only people who aren’t disillusioned with “big data” are those who haven’t realized that yet, those who realized it from the beginning, or those who profit from custom-tailoring the Emperor’s new clothes.

John Thompson

Data is not good or bad. It is just data. The value is in what you make of it.

David Mayes

I have been saying for months that the best analogy for Big Data is Chaos Theory…finding simple structure out of a massively large unstructured dataset. And yes, not perfect, but way beyond traditional data analytics. The “hype” “trough of disillusionment” and “maturity” are amusing analogies to Geoff Moore’s “Crossing the Chasm,” and perhaps Adizes corporate life cycle infographic.

Patrick Pitre

Excellent article on Big Data! These are exactly the types of insights into Big Data that decision makers need to see.

EQ

Nice Job Derrick. If one steps back and takes a look at the big picture of big data, of all that it influences and encompasses, and how intertwines with mobile and social, it becomes clear that there should be no “disappointment” in the phenomena. Of course, there are individual projects that go belly-up because, for example, IT people buy in naively to the “Hadoop DIY” promise without considering what might work in the long-term. But as a global, macro-trend in IT, business, government and human development, big data is a monster that is not going away any time soon.

Intent Data

Great article Derrick. One of the biggest problems is how broadly the word Big Data is used – it has certainly become the last bastion of support for overly commoditised Storage hardware vendors and resellers to rely on, rather than invest as much as they should on Data Analytics and BI so that all of that Data being generated is actually of tangible value to the organisation. In my experience, Big Data is a bit of a misnomer – in fact for many it should be the pursuit of small data – i.e. only extract that which is of value, in near real time. Amusingly, I’m even seeing the emergence of the phrase Dirty Data – for pretty much everything that is left! I’m focusing our new site on Intent Data, trying to show how extracting value from Big Data using targeted Data Analytics can produce genuine business value! http://www.intentdata.com

HadoopSphere Editor

Derrick, Agree with most of your comments here.

After Gartner had posted the trough indicator in January, Mark Beyer had followed up with a blog post in March to explain what ‘trough of disillusionment’ and ‘Hype cycle’ movement means. (Mark Beyer was listed among top Big Data influencers of 2012 by hadoopsphere – link ).

This is what Mark said to set the tone right :
“Gartner is not saying big data is dead, or gone. To the contrary — we say it becomes the new normal and does so somewhere between 2015 and 2017…
for something to move into the Trough is a maturation process. Implementers and organizations will begin to choose the winning solution architectures and technologies that support them. The chaff will be reduced away from the wheat…
Then it will rise along the Slope of Enlightenment while others drop by the way side.”

Vinod Shintre

Let’s face it. Looking or exploring existing data sets to investigate a potential business enhancer would not hurt anyone. End of the day business has to clearly identify what they want to achieve after parsing huge data sets & assess ways to pass on those metric to a system which can further implement them on a consumer system. This handshake is very necessary for all stake holders to gain out of big data.

Michael Warjas

Derrik, thank you for weighing in on this.

Corinthians 8:2 And if any man think that he knoweth any thing, he knoweth nothing yet as he ought to know.

Although big data is in the vaunted ‘trough of disillusionment’ , look way out there on the right – the plateau of productivity – and you’ll find ‘predictive analytics’. We all know that more data trumps a clever algorithm.

Gartner is constrained by their chosen technology adoption paradigm, so they’ve fast-forwarded big data as quickly as possible. It has clearly crossed the Chasm (for which Gartner blatantly has ripped off Geoffrey Moore) and is well on it’s way. It’s still early majority (thank you Clayton Christensen) and generalizing to SMB will be a challenge. I fully agree with your stated title, ” If you’re disappointed with big data, you’re not paying attention “

Comments are closed.