10 Comments

Summary:

The IT hype machine has everyone jumping on the big data bandwagon. But before we start saving every scrap of data in the enterprise for fear that we will miss a nugget of insight, shouldn’t we focus on what we already have?

pistols duel

The IT hype machine has everyone jumping on the big data bandwagon. Consultants are making millions helping companies search for nuggets of insight in loosely related data and the hardware industry is enjoying a mini-boom as more data than ever is being saved and more storage and computing power is being sold to process it.

Why bigger doesn’t always means better

At the center of much of the discussion is Hadoop. This confusing collection of distributed computing technology may be open source, but it’s neither cheap nor friendly despite the cute elephant logo. In fact, Hadoop and big data seem like the dream ticket for the vendors of big storage and big iron, many of whom have made expensive acquisitions to get into this lucrative market.

But before we start saving every scrap of data in the enterprise for fear that we will miss a nugget of insight, shouldn’t we focus on what we already have? Surely, the real goal is to enable more people in the company to do more with their existing data before adding new data of undetermined relevance and quality? Perhaps it makes more sense to get off the big data bandwagon and focus on empowering the business user to use the data that they’ve already got more effectively — not feed the elephant and its ecosystem of hangers on.

Often, the big data discussion is framed by the implied premise that bigger is better and that adding more data will naturally produce insights. Should you buy into the hype? Big data projects come with big investments in complex computing systems and the specialized skills to make them go. Worse, they are burdened by notoriously long deployment schedules and poor performance.

You don’t need more dead data

Maybe some huge enterprises and government departments do need big data, but what about the rest of us? Can collecting more data help? Perhaps, but you must first answer: Am I getting useful, timely answers from the data I already have. Do I have disciplines in place to operationalize insights and measure their impact on the business? Sadly, if the answer is no, you are not alone. According to a recent study by Freeform Dynamics, only 15 percent of enterprises feel they fully exploit traditional database information for decision making.

It seems that most of the data already stored for analysis is going underutilized. To the point, Bill Inmon, father of the data warehouse, claims that 95 percent of data in a warehouse is “dormant.” Will adding terabytes or petabytes of unstructured data to your already underutilized data warehouse change this? Probably not. In fact, there’s a good chance that it will make dormant data, dead data.

What companies need is not dormant or dead data. They need data that helps them gain operational insights to make their existing business run better. They need data that empowers their business users to be more creative and productive. They need live “quick” data not dormant or dead data.  If this makes sense to you, how do you get there?

End big, but start small

First, take stock of what you already have: Not just data but also knowledge and skills. Select a project where you can demonstrate incremental improvement with existing resources. If you need to hire, think business analyst, not technologist, because a dollar spent answering a business question is an investment; one spent on specialized IT skills to support the process is sunk cost.

Second, consider more agile off-the-shelf tools that will allow you to think big, but start small and scale fast. Think friendlier tools, accessible to your existing staff.  This approach will deliver more business insight today and many such tools scale well unless tested against the most extreme big data problems. The solution should allow intuitive use by the business manager with extensibility to support more complex mining by an experienced analyst. Knowledge of the underlying data structure or processing platform should not be necessary.

The analytics engine should run on standard servers with no proprietary hardware or specialized configurations, database schemas or tuning required to achieve the required performance. And because loading data into the analytics database can become your most time consuming effort, connections to your data sources should be based on industry standards and designed to greatly simplify data load from multiple formats.

Finally, adopt an agile, iterative approach – don’t go big bang on big data. Successful analytics initiatives are based on an ongoing dialog with the data meaning that a set of questions is asked the answers to which set up the next round of discovery. With each cycle more is understood about what data present, what is relevant, what needs to be added and how much history is worthwhile to add. Rapid time-to-answer is the most critical factor in harvesting the value from your data—big or small.

Maybe big data analytics will someday become a must-have for every business, but don’t be persuaded that this is the case now just because management consultants and major vendors are throwing millions of dollars at “What are you missing by not using big data analytics?” messaging. In all likelihood, you’re not missing anything and your time and money are better spent putting your existing data in the hands of more business users and giving them the tools to do deeper and faster analytics now.

One thing that evolution shows us is that small, agile species tend to do better than large, specialized species. Maybe we should apply the same thinking to our data?

Fred Gallagher is general manager of Vectorwise at Actian Corporation.

Image courtesy of Shutterstock user lafoto.

  1. SolveOut.com is FOR SALE NOW at NetBinge.com

    Share
  2. As always Gigaom hits the nail straight on the head. This is extremely well written and very useful to anyone curious about the relevance and value of big data.

    Share
  3. Keith Bolam Monday, May 14, 2012

    Excellent thoughts that everyone should take head of.

    Share
  4. nicely written and self explanatory

    Share
  5. Great article. Large quantities on internal data is one thing, however, when analyzing vast amounts of internal and external data in real time, simple analytics won’t do by design. Like all other technologies, Big Data stack will evolve overtime and will become standard with more off the shelf tools.

    Share
  6. @dataElGrande Monday, May 14, 2012

    Although this is a thoughtful article, one must ponder over the question “How can I be sure the dormant data I have won’t bring me useful insights?” without actually going through the process of trying big data analytics. It’s the possibility of improvement that is driving companies towards powerful reporting and analytics, not so much their “fear of underutilizing data.” If interested, check out the helpful tools Pentaho has to offer dealing with Big Data Reporting

    http://www.pentaho.com/big-data/

    Share
  7. Nice article Fred! I think it is worth mentioning HPCC Systems open source offering provides a single platform that is easy to install, manage and code too. Their built-in analytics libraries for Machine Learning and integrations tools with Pentaho for great BI capabilities make it easy for data analysts to work with Big Data. I believe HPCC is better than Hadoop and commercial offerings, it has a real-time data analytics and delivery engine (Roxie) and runs on the Amazon cloud like a charm through the One Click portal. For more info visit: hpccsystems.com

    Share
  8. Great article. At TrendSpottr (http://trendspottr.com), we very much believe that one of the most disruptive Big Data opportunities is real-time analysis of small data. More specifically, our focus is on identifying trends and predictive insights from small data sets that reside within Big Data streams; i.e, a predictive early warning system based on real-time pattern recognition and statistical analysis. The reality, as you note, is that most of the data within Big Data is either dormant or dead. By focusing on only those data elements that provide the potential for insight and action, we can leverage Big Data by starting small.

    Share
    1. The article is based on a couple of premises that aren’t true: users have all their data available in a relational warehouse, and that analytics is the only use of Hadoop. Hadoop allows interrogation of disparate data sources in a way not possible with traditional BI tools. Agreed, you should be using those tools where possible, but just because traditional BI on a warehouse is easier doesn’t mean that it’s answering the questions that are most valuable. A great example of this is social media data, that storing locally is usually against the terms of the social network. You have to use a tool like Hadoop to do large scale queries involving internal and social data.

      In addition, there are lots of other uses of Hadoop. Lots of people are moving to different architectures where transactions are stored in their rawest form in a simple redundant database and Hadoop is used to present relational views of historical data, and queries are run over realtime and the historical sources. That massively reduces writes, allowing greater scaling and simplicity. For a better description of this architecture, check out the Pragmatic Programmers Big Data book currently in Beta.

      There may be a bit of a goldrush mentality around this technology, and some of it is overblown. But it is a flexible tool that has more uses than you are describing, and solves real problems that many firms are having.

      Share
  9. Sarunas Chomentauskas Wednesday, July 25, 2012

    One more false premise in this article is ‘more is not better’ when it comes to data. In fact, the opposite is true. Relatively simple math is unreasonably effective given sufficiently large data sets: a big data fact that is true, and on which the entire google edifice is founded.

    Share

Comments have been disabled for this post