The big data world is operating at 1 percent


Many would be shocked to know that researchers analyze and gather insights from only 1 percent of the world’s data. That 1 percent of analyzed data has been the only driver of innovation and insights in what we now know as “big data.” The other 99 percent of the 1 quintillion bytes of data that is collected every day (according to a recent study from IDC) remains untouched.

We all know that big data has so much promise. For a very large number of problems today, the effective use of data is a bottleneck. The drug discovery problem is more about data than chemistry. The discovery of new energy sources is more about data than geology. It’s the same for tracking terrorists, detecting fraud, and more.

Today, we recognize that these, and many other critical global issues, are all data problems. This fact alone has given rise to a huge investment into big data, created the hottest job title around –data scientist — and propelled the valuations of private data analytics providers into the billions. However, imagine the endless possibilities when the world is operating on the insights gathered from 100 percent of its data.


Where do you start when you have a data set as large as the human genome, for example, or President Obama’s recent call to map the human brain? To achieve the breakthroughs we need to address the world’s most perplexing problems, we need to fundamentally change the way we gain knowledge from data. Here’s what we need to start thinking about:

  • Starting with queries is a dead end: Queries are not inherently bad. In fact, they are essential once you know what question to ask. That’s the key: the flaw is starting with queries in the hope that they will uncover a needle in the massive digital haystack. (Spoiler alert: they won’t.)
  • Data has a cost: Storing data is no longer expensive, in most cases. Even querying large amounts of data is becoming more cost effective with tools like Hadoop and Amazon’s Redshift. This is just the hard cost side of the equation, though.
  • Insights are value: The only reason why we bear the cost is because we believe that data has insights that unlock value. Ultimately, the undiscovered insights that organizations miss have a much higher cost in terms of being able to solve big problems quickly, accelerate innovation and drive growth. The cost of data collection can be high, but the cost of ineffectual analysis is even higher. The tools for getting at insights don’t exist today. Today, we rely on very smart human beings to come up hypotheses and use our tools to validate — or invalidate — those hypotheses. This is a flawed strategy since it relies on (arguably smart) guesswork.
  • You have the right data today: There’s often a belief that, “If we only had more data, we could get the answer we’re looking for.” Far too much time and money is wasted collecting new data when more can be done with the data already at hand. For example, Ayasdi recently published a study in Nature Scientific Reports that shows important new insights from a 12-year-old breast cancer study that had been thoroughly analyzed for over a decade.

Big data is the beginning, not the end

I’m very concerned that the growing hype around the term big data has set us all up for disappointment. Query-based analysis is fine for a certain class of problems, but it will never be able to deliver on the expectations the market has for big data.

We are on the cusp of critical breakthroughs in cancer research, energy exploration, drug discovery, financial fraud detection and more. It would be a crime if the passion, interest and dollars invested to solve critical global problems like these were sidetracked by a “big data bubble.”

We can and should expect more from data analysis, and we need to recognize the capabilities that the next generation of solutions must be able to deliver:

  • Empower domain experts: The world cannot produce data scientists fast enough to scale to the size of the problem set. Let’s stop developing tools just for them. Instead, we need to develop tools for the business users: biologists, geologists, security analysts and the like. They understand the context of the business problem better than anyone, but might not be up to date with the latest in technology or mathematics.
  • Accelerate discovery: We need to get to critical insights faster. The promise of big data is to “operate at the speed of thought.” It turns out that the speed of thought is not that fast. If we depend on this approach, then we will never get to the critical insights quickly enough because we’ll never be able to ask all of the questions of all of the data.
  • Marriage of man and machine: To get to those insights faster, we need to invest in machine intelligence. We need machines to do more of the heavy lifting when it comes to finding the clusters, connections and relationships between data points that gives business users a much better starting point to begin discovering insights. In fact, algorithmic discovery approaches can solve these problems by looking for rare, but statistically significant signals in large datasets that humans would never be able to find. For example, in a recent study, previously unreported drug side effects were found by algorithmically searching through web search engine logs.
  • Analyze data in all its forms: It’s understood that researchers need to analyze both structured and unstructured data. We need to recognize the diversity and depth of unstructured data: text in all languages, voice, video and facial recognition.

When it comes to the evolution of big data, we’ve only begun to scratch the surface. It stands to reason that if we continue to analyze 1 percent of data, then we’ll only tap into 1 percent of it’s potential. If we’re able to analyze the other 99 percent, then think about all of the ways that we can change the world. We can accelerate economic growth, cure cancer and other diseases, reduce the risk of terrorist attacks, and many other big ticket challenges that we’re faced with.

That’s something that we can all rally around.

Gurjeet Singht is the co-founder and CEO of Ayasdi, an insight discovery platform built on topological data analysis technology. He will be speaking at Structure: Data, March 20-21 in New York.

Have an idea for a post you’d like to contribute to GigaOm? Click here for our guidelines and contact info.

Feature image courtesy of Shutterstock user Sergey Lavrentev.


Jack Rivkin

Can’t wait until we are using more than 1% of the data available and accumulating. There is a “law” that applies here: “The number of conclusions reached expands proportionately to the data available and inversely to the time required to reach a conclusion.” What one measures or analyzes affects behavior. It does start with strategy and what one is trying to accomplish. Big Data offers some exciting opportunities, but our history of data analysis says “proceed with caution.” See “What is the Big Deal About Big Data?” for some more thoughts on this:


Thanks to everyone for posting such thoughtful comments. I find it interesting that several of you pointed out the value of queries. I agree that queries are important, but only after the entire data set is algorithmically mapped into a topology that a data scientist, or business user, can begin with. In other words, once the relationships are mapped across the entire dataset, then the user can “zoom into” specific areas of the visualization to explore the meaning of shape and color. It’s about starting with a machine-generated view of the data that is unbiased and holistic (using 100% of the data). Thomas is correct when he says that “starting with queries is a dead ethos that gets increasingly meaningful as we approach huge datasets and very complex correlated sciences that no on has yet dreamt about a possible hypothesis to start with.” We are in the first inning of the Big Data game and I look forward to what we will all discover when we explore new approaches.

Thomas Chacko

Agree to this mostly Gurjeet..We have enough historical data available in some form within specific domain datastores. This can be mined by tools from Ayasdi and others to usher in critical breakthroughs in cancer research, energy exploration, drug discovery, financial fraud detection and more. However most entities, unless mandated by regulation OR not involved in any traditional data mining and/or piloting a big data project – attach a very high cost with data retention. With the result, most of these valuable data keeps spilling outside the retension-window and lost for ever for any future application.

“Starting with queries is a dead end” — this ethos gets increasing meaningful as we approach huge datasets and very complex correlated sciences where no one as yet dreamt about a possible hypothesis to start with. The caveat being that we are to identify and steer clear of ‘false positive’ traps and be able to verify the outcome confidently using multiple approaches.
Nevertheless, a very interesting era we are all staring at..


The most important aspect to big data is the company strategy and how the company is willing to align to the insights from big data. First they must have a strategy and know what they want from the data, and “more sales” is not valid. Then, when they have the results they must be able to act on those results, make decisions and be more flexible in implementing those decisions.


I am not a Sara scientist by any stretch of the imgaination so forgive me if I sbuse the terminology. Queries are not valuable solely for finding positives. Their real value lies in finding negatives, and eliminating those from the data being considered.


Here here! Insights are harder to derive when the ingrained assumptions of causality lead us to look for affirming evidence, rather than understanding more fully the interactions, or impacts of different event chains.
The problems Gurjeet raises reflect the persistence of linear thinking and heuristic bias. The opportunities that Genetic Algorithms, or self-learning models present parallels processing that occurs in the human brain requires additional automation to make the dynamic feedback or output manageable. I’m all in favor of adding new tools that allow us to reliably reconstruct scenarios and iteratively discover what doesn’t fit the expected patterns. What makes us well, may be different than what makes us sick. The timing of data capture, which accounts for why we may be only at 1%, doesn’t have to be a limitation. To the contrary it’s proven useful to sample.
The difference between data capture and analysis to date and what the future holds hinges on correctly applying extrapolation methods. Only when life and death are at stake, does eliminating the confidence interval in your sample based prediction warrant inclusion of more data.
In a dynamic world, certainty may be over-rated and leaving some room for doubts will help us continue to keep looking, probing and discovering.

Michael Brill

While I understand the sentiment behind “starting with queries is a dead end,” I think that’s exactly where you need to start. More precisely, getting to user intent is critical. Otherwise, we’ll continue down this path where our fancy analytics just generate a steady stream of false-positives and otherwise non-actionable output.

Maybe the trick is creating software that can elicit queries from humans in a structure that is more usable than simple database queries against fixed schema. This gives machines and data scientists something market-driven to aim for rather than guessing what the market might way… kind of lean principles applied to analytics.

Comments are closed.