Blog Post

Big data is useless unless it’s also fast, diverse

In the past, year big data has elevated from a hot topic in the enterprise to one of the most buzzed about, and potentially overhyped, phrases of the year. Big data has huge disruptive potential and the flood of attention should be no surprise. A recent IDC report stated that the business analytics software market grew by 14.1 percent in 2011 and will continue to grow to reach $50.7 billion in 2016, all driven by the focus on big data.

As one of the managing directors at Lightspeed Venture Partners, I spend a lot of time talking to companies about how they are using technology, such as big data, and meeting with entrepreneurs who are developing the next big disruptions in technology. I believe that to harness the power of this data revolution and gain a competitive edge, companies need to be able to do more than create and query their big data stores. They need to focus on making this big data fast, intuitive and easy to manipulate in new and interesting ways.

So what does that mean?

Companies need to do more than just store the data. Recent innovations in scale-out storage technology make it relatively inexpensive for companies to capture massive amounts of data. In many ways, that is how the big data conversation started. Companies such as Cloudera, Mapr, Vertica (acquired by HP) and Datastax are doing a great job of delivering the infrastructure required to hold and manage big data in a typical enterprise. (Full disclosure: Mapr and Datastax are both Lightspeed Venture Partners portfolio companies.)

Holding the data is step one of the process. The next challenge is how to use that data to help your business make better decisions. Right now, most companies are relying on data scientists to mine these raw stores of information. That’s a start, and we’ve seen some leading-edge companies make significant early revenue gains and cost savings with the help of data scientists. But, they are expensive, extremely scarce, non-real-time, and they don’t scale.

A new generation of startups is looking to democratize data science by building on top of the basic big data platforms to turbo-charge the speed, intuitiveness and collaborative methods by which businesses can extract value from the new flood of information.

The first challenge is to make this data fast. Today, it can take minutes or hours to get a response or glean a new insight buried in the typical enterprise’s mountain of data. As a result all questions must be carefully scripted and planned in advance. This limits the flexibility and agility of business questions that can be posed.

But in a world where big data can perform instantaneously or “at the speed of thought,” the results are dramatically different. When a user can maintain an unbroken train-of-thought, a fluid interplay starts to occur between asking an initial question, getting a response, refining and asking additional questions, and ultimately getting to a new, unanticipated “Eureka!” moment. Think Google Instant for the enterprise. There are a number of startups that are attacking this problem, including Qubole, Boundary, DataDog and several other stealth companies. (Full disclosure: Qubole and Boundary are both Lightspeed Venture Partners portfolio companies.)

The second layer of disruptive innovation relates to delivering dramatically improved experiences for navigating and manipulating data that has become so large that traditional spreadsheets, reports and charts would need millions of rows and pages to represent it (in simple terms, making data intuitive and easy to analyze).

New companies are focusing on a combination of AI (artificial intelligence), visualization, faceted search and social collaboration tools to empower hundreds or even thousands of ordinary business users to collectively mine, share and evaluate big data sets and gain insight without the need for a data scientist in the middle.

The emergence of self-service BI is allowing ordinary business users to drive the data warehouse for the first time and thereby eliminate expensive IT and data-scientist intermediaries. Historically, these intermediaries have been necessary to process requests and program reports, which ultimately constrained the business analysis process.  Some of the most exciting companies innovating in this space include Tableau, Cliktech and Edgespring. (Full disclosure: Edgespring is a Lightspeed Venture Partners portfolio company.)

The big data revolution is in its early innings, and it is about to get even more exciting. So while the term may be overhyped, the massive potential for companies to take advantage of these new innovations makes it worth all of the extra attention.

Ravi Mhatre is a managing director of Lightspeed Venture Partners (@lightspeedvp), where he focuses on investments in enterprise IT, mobility, and Internet and cloud-based services and applications. You can follow him at on Twitter at @RMTacct.

We will be discussing the challenges of big data and scalable analytics at GigaOM’s Structure: Europe conference in Amsterdam, October 16 and 17.

Image courtesy of Flickr user altemark.

25 Responses to “Big data is useless unless it’s also fast, diverse”

  1. Great article – It is true that Big Data storage providers must take the extra step forward to improve speed in order to be more effective. We have found that by using a decentralized and symmetric architecture, all storage nodes share the same functionality and responsibilities. The symmetric architecture also enables easy scalability simply by adding more storage nodes. The more storage nodes in the cluster, the shorter time it takes to complete any given task. I’d be interested to see what you think:

  2. Ravi, nice article. Have you checked out MarkLogic? (yes, I work there) We excel at each of your success criteria:
    1) Storing the data – MarkLogic does this, with a much smaller footprint than other tools (as evaluated by a 3rd party systems integrator) to save costs for our customers, and has also partnered with SGI to provide a Big Data appliance (news release came out this week).
    2) Making the data fast – we get customers’ query response time on Big Data down from hours to sub-seconds.
    3) Supporting self-service BI – MarkLogic integrates (via ODBC driver) with common BI tools such as Tableau and Cognos, also Excel.
    Check out our website at

  3. Charlene Son Rigby

    Ravi, Great article. I agree with your point on the need for speed of thought analytics. At Metamarkets, we often use the term “human-time”. We have built a self-service SaaS analytics offering with highly interactive visualization. Metamarkets enables business users to get the insights they need in human time so they can use Big Data for on-going decision making. Please check it out and we’d love your feedback:

  4. Hi Ravi, thank you for an interesting post. As co-founder of SNTMNT we’re in the middle of the big data hype, although less from an entreprise perspective and more on social. Still I think, that the uprising of companies like Dataminr, Bottlenose, PeerIndex and Peerreach shows that the big data revolution is gaining momentum in the social space as well.

  5. I’ve seen a lot of articles recently about big data analytics but very few explain where these tools have provided tangible business benefits. How about some examples or case studies?

  6. All stated requirements, that using big data needs to be fast, intuitive and easy to manipulate in new and interesting ways, can be most naturally met by the tried and true spreadsheet – if a spreadsheet could handle unlimited amounts of data. While the familiar, conventional spreadsheets certainly cannot, there is at least one that can: The Trillion-Row Spreadsheet(SM) offered by 1010data. The TRS has all the features one would expect in a user-oriented tool, including a visual display of the data, an interactive method of use, and the ability to do “off road” analysis, but it can handle any amount of data (even hundreds of billions or trillions of rows) and has a full set of advanced analytical functions built in. This makes the data accessible to people who aren’t technical and reduces the skill set required of a data scientist.

  7. It’s quite logical to deduce that processing of big data will be slow if performed using conventional hardware and software systems as the existing systems are not designed to play with exploding volume of data.
    Companies need to develop & deploy special infrastructure which is capable and efficient enough to process and respond the large volume of data in very low time. In order to to do so, specialized IT infrastructure services need to be deployed from the collaboration of different vendors so each can contribute with the specialization of their own in order to influence the processing time of Big data.

  8. Good article Ravi! With the explosion of big data, companies are faced with data challenges in three different areas. First, you know the type of results you want from your data but it’s computationally difficult to obtain. Second, you know the questions to ask but struggle with the answers and need to do data mining to help find those answers. And third is in the area of data exploration where you need to reveal the unknowns and look through the data for patterns and hidden relationships. The open source HPCC Systems big data processing platform can help companies with these challenges by deriving insights from massive data sets quick and simple. Designed by data scientists, it is a complete integrated solution from data ingestion and data processing to data delivery. More info at

  9. Article puts the finger on the issue but fails to capture how it needs to be addressed.

    For data to payoff two things need to happen – a) Generate insights/ actions that add to bottomline/ topline from the data b) Push the action from the insight to the customer action points (channels,sales, servicing, interent, dealer network, marketing etc) at a customer level so that the recommended action can be taken when the customer interaction happens. Honestly I dont see anybody doing it. Big data will turn out to be a big fad.

    That’s not to say it does not matter. Banking and Insurance companies have been doing this for over 15 years now through the use of risk scores, propensity models applied to lead lists, price elasticity etc although not in real time.

    Big data will payoff when it combines the volume/speed of data with pushing intelligent actions to the channels.

  10. The one thing I don’t see in your writeup is the use of Big Data without storing it down in a data warehouse (data at rest). To make the use of data fluid, put it into cache memory using one of the many in-memory products available today. Batch processing of large data sets is interesting, but real-time analytics is more interesting.

    Talking to healthcare and other industries, there is increasing amount of data that is very volatile…it only has value for a short time. It has to be used in real-time or it doesn’t ‘work’.

    I wrote up this concept here:

    • I think this is a great question! Both types of information have inherent business value and while it is early innings, its becoming increasingly clear that substantial additional value can be derived from intersecting the two sources in a typical enterprise. The ability to dynamically “mash up” a selection of traditional “structured sources” (ie operational data warehouse) with newly generated information from an “unstructured system” (ie Twitter api , Facebook api, log-generated event data ) can create valuable insights when compared to either data source stand-alone. I plan to write more on this subject in the future.

  11. Hi Ravi. Great points. I am wearing my PR hat here and wanted you and your readers to check out Chiliad and their recent update to their product Discovery/Alert 7.0. They nail it when it comes to saving hours/manpower/smarts to use a big data search solution. Their tagline explains it well — Find meaning in Big Data. Every bit of information at once. Usable by everyone. With economics you’ll love. Real-world Big Data is rarely a tidy package. It lives in diverse formats in many locations. Chiliad Discovery/Alert 7.0 tames it, showing what matters with tools to help you decide what it means. You’re no data scientist? Discovery/Alert 7.0 speaks your language. Tell it what you need and enjoy the results. Doing more with less? Discovery/Alert is frugal. Save millions ordinarily spent on data consolidation or legions of data scientists and software engineers. With a simple subscription and cost-effective deployment, meaning is no longer beyond your reach.

  12. A man smarter than me ones said don’t skate where the puck is or was, skate where the puck will be. Now if we look at a professional player, does he calculate the probability of all players on the ice moving with a probable velocity in a probable direction, his team his opponents taking into account energy already consumed by players, time in the game, and a hundred other variables and process it in a split second with a system running double digits energy consumption?[1]
    After floating a paper to get feedback, the company which large R&D to respond immediately was IBM. I wouldn’t count them out, Watson, Data finds Data … They know the problem inside out and understand it requires something else than faster.


  13. Derek Pappas

    Actually step one is figuring out how to capture the data and how represent the captured data. Finding the “signal” in the raw data using custom algorithms is the real value add that start ups bring to the table. Scaling NOSQL data stores and manipulating the data for reporting purposes will be commodities in a few years. The front end data acquisition through custom algorithms requires innovative solutions. Taking a raw Tweet stream and classifying Tweets at a fine granularity requires new and innovative algorithms that are not in text books. Reverse engineering online databases stored in web sites and normalizing the data from different sites is another significant problem.
    Step two is storing the data.
    Step three is processing the data from different sources to find such things as relationships and clusters.
    Step four is visualization and reporting.
    Companies which supply tools do not have high exit multiples. Companies which supply must have solutions do.

    • Derek,
      These are good points. I agree that there will be a host of unanticipated problems (?opportunities for start-ups :) related to ingestion of “big data”. It is inherently unwieldy both because of its size and the heterogeneity of origin sources.

      With respect to mining Big Data repositories for business insight, we believe algormithmic techniques (ie – artificial intellilgence, machine learning, statistical clustering) will be valuable sources of signal. However we also believe in the critiicality of next-gen BI technologies (ie visulaization, faceted search, iterative exploration and tagging, rapid on-the-fly aggregations, structured collaboration) with humans in the loop. These capabilities inform which alogorithms to use and provide an explanation of the “cause-and-effect drivers” behind machine generated insights which are critical to determining the appropriate business actions to be taken.

    • kalpataru

      Great points Derek. Solving the first step is critical for optimized big-data platform. Can a CDN solution help to identify customer tags/business identifiers/hints and and make multi-dimensional data ( instead of raw logs) available for customers to consume?