The big data world is operating at 1 percent

needle in a haystack

Many would be shocked to know that researchers analyze and gather insights from only 1 percent of the world’s data. That 1 percent of analyzed data has been the only driver of innovation and insights in what we now know as “big data.” The other 99 percent of the 1 quintillion bytes of data that is collected every day (according to a recent study from IDC) remains untouched.

We all know that big data has so much promise. For a very large number of problems today, the effective use of data is a bottleneck. The drug discovery problem is more about data than chemistry. The discovery of new energy sources is more about data than geology. It’s the same for tracking terrorists, detecting fraud, and more.

Today, we recognize that these, and many other critical global issues, are all data problems. This fact alone has given rise to a huge investment into big data, created the hottest job title around –data scientist — and propelled the valuations of private data analytics providers into the billions. However, imagine the endless possibilities when the world is operating on the insights gathered from 100 percent of its data.


Where do you start when you have a data set as large as the human genome, for example, or President Obama’s recent call to map the human brain? To achieve the breakthroughs we need to address the world’s most perplexing problems, we need to fundamentally change the way we gain knowledge from data. Here’s what we need to start thinking about:

  • Starting with queries is a dead end: Queries are not inherently bad. In fact, they are essential once you know what question to ask. That’s the key: the flaw is starting with queries in the hope that they will uncover a needle in the massive digital haystack. (Spoiler alert: they won’t.)
  • Data has a cost: Storing data is no longer expensive, in most cases. Even querying large amounts of data is becoming more cost effective with tools like Hadoop and Amazon’s Redshift. This is just the hard cost side of the equation, though.
  • Insights are value: The only reason why we bear the cost is because we believe that data has insights that unlock value. Ultimately, the undiscovered insights that organizations miss have a much higher cost in terms of being able to solve big problems quickly, accelerate innovation and drive growth. The cost of data collection can be high, but the cost of ineffectual analysis is even higher. The tools for getting at insights don’t exist today. Today, we rely on very smart human beings to come up hypotheses and use our tools to validate — or invalidate — those hypotheses. This is a flawed strategy since it relies on (arguably smart) guesswork.
  • You have the right data today: There’s often a belief that, “If we only had more data, we could get the answer we’re looking for.” Far too much time and money is wasted collecting new data when more can be done with the data already at hand. For example, Ayasdi recently published a study in Nature Scientific Reports that shows important new insights from a 12-year-old breast cancer study that had been thoroughly analyzed for over a decade.

Big data is the beginning, not the end

I’m very concerned that the growing hype around the term big data has set us all up for disappointment. Query-based analysis is fine for a certain class of problems, but it will never be able to deliver on the expectations the market has for big data.

We are on the cusp of critical breakthroughs in cancer research, energy exploration, drug discovery, financial fraud detection and more. It would be a crime if the passion, interest and dollars invested to solve critical global problems like these were sidetracked by a “big data bubble.”

We can and should expect more from data analysis, and we need to recognize the capabilities that the next generation of solutions must be able to deliver:

  • Empower domain experts: The world cannot produce data scientists fast enough to scale to the size of the problem set. Let’s stop developing tools just for them. Instead, we need to develop tools for the business users: biologists, geologists, security analysts and the like. They understand the context of the business problem better than anyone, but might not be up to date with the latest in technology or mathematics.
  • Accelerate discovery: We need to get to critical insights faster. The promise of big data is to “operate at the speed of thought.” It turns out that the speed of thought is not that fast. If we depend on this approach, then we will never get to the critical insights quickly enough because we’ll never be able to ask all of the questions of all of the data.
  • Marriage of man and machine: To get to those insights faster, we need to invest in machine intelligence. We need machines to do more of the heavy lifting when it comes to finding the clusters, connections and relationships between data points that gives business users a much better starting point to begin discovering insights. In fact, algorithmic discovery approaches can solve these problems by looking for rare, but statistically significant signals in large datasets that humans would never be able to find. For example, in a recent study, previously unreported drug side effects were found by algorithmically searching through web search engine logs.
  • Analyze data in all its forms: It’s understood that researchers need to analyze both structured and unstructured data. We need to recognize the diversity and depth of unstructured data: text in all languages, voice, video and facial recognition.

When it comes to the evolution of big data, we’ve only begun to scratch the surface. It stands to reason that if we continue to analyze 1 percent of data, then we’ll only tap into 1 percent of it’s potential. If we’re able to analyze the other 99 percent, then think about all of the ways that we can change the world. We can accelerate economic growth, cure cancer and other diseases, reduce the risk of terrorist attacks, and many other big ticket challenges that we’re faced with.

That’s something that we can all rally around.

Gurjeet Singht is the co-founder and CEO of Ayasdi, an insight discovery platform built on topological data analysis technology. He will be speaking at Structure: Data, March 20-21 in New York.

Have an idea for a post you’d like to contribute to GigaOm? Click here for our guidelines and contact info.

Feature image courtesy of Shutterstock user Sergey Lavrentev.

You're subscribed! If you like, you can update your settings


Comments have been disabled for this post