Blog Post

5 technologies that will help big data cross the chasm

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

We’re on the cusp of a real turning point for big data. Its applications are becoming clearer, its tools are getting easier and its architectures are maturing in a hurry. It’s no longer just about log files, clickstreams and tweets. It’s not just about Hadoop and what’s possible (or not) with MapReduce.

With each passing day, big data is becoming more about creativity — if someone can think of an application, they can probably build it. That makes the concept of big data a lot more tangible and a lot more useful to a lot more companies, and it makes the market for big data a lot more lucrative.

Here are five technologies helping spur a shift in thinking from “Why would I want to use some technology that Yahoo built? And how?” to “We have problem that needs solving. Let’s find the right tool to solve it.”

Apache Spark

When it comes to open source big data projects, they don’t get much hotter than Apache Spark. The data-processing framework is garnering a lot of users and a lot of supporters — including from Hadoop vendors MapR and Cloudera — because it promises to be almost everything for Hadoop deployments (arguably the foundation of most enterprise big data environments) that MapReduce wasn’t. It’s fast, it’s easy to program and it’s flexible.

Right now, Spark is getting a lot of attention as an engine for machine-learning workloads — for example, Cloudera Oryx and even Apache Mahout are porting their code bases to Spark — as well as for interactive queries and data analysis. As the project’s community grows, the list of target workloads should expand, as well.

Source: Databricks
Source: Databricks

Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration.

Cloud computing

This might seem obvious — we’ve been talking about the convergence of cloud computing and big data for years — but cloud computing offerings have advanced significantly in the just the past year. There are bigger, faster and ever-cheaper raw compute options, many offering high memory capacity, solid-state drives or even GPUs. All of this makes it much easier, and much more economically feasible, to run myriad types of data-processing workloads in the cloud.

The market for managed Hadoop and database services continues to grow, as well as the market for analytics services. They’re quickly adding new capabilities and, as the technologies underpinning them advance, they’re becoming faster and more scalable.

Amazon CTO Werner Vogels announcing Kinesis in November.
Amazon CTO Werner Vogels announcing Kinesis in November.

Cloud providers are also targeting emerging use cases, such as stream processing, the internet of things and artificial intelligence. Amazon Web Services offers a service called Kinesis for processing data as it crosses the wire. Microsoft is previewing a service designed specifically to capture and store data streaming off of sensors. A handful of vendors, including IBM, Expect Labs and AlchemyAPI are providing various flavors of artificial intelligence via API, meaning developers can build intelligent applications without first mastering machine learning.

We’ll talk a lot more about the future of cloud computing at out Structure conference June 18 and 19 in San Francisco. Speakers include Amazon CTO Werner Vogels, Google SVP and Technical Fellow Urs Hölzle, and Microsoft EVP Scott Guthrie. Also, Airbnb VP Mike Curtis will discuss how that company runs big data workloads in the cloud, and New York Times Chief Data Scientist Chris Wiggins will talk about the newspaper’s work in machine learning.


A lot of talk about sensors focuses on the volume and speed at which they generate data, but what’s often ignored is the strategic decisions that go into choosing the right sensors to gather the right data. If there’s are real-world measurements that need to be taken, or events that need to be logged, there’s probably a fairly inexpensive sensor available to do the job. Sensors are integral to smarter cars, of course, but also to everything from agriculture to hospital sanitation.

And if there’s not a usable sensor commercially available, it’s not inconceivable to build one from scratch. A team of university researchers, for example, built a cheap sensor to measures the wing speed of insects using a cheap laser pointer and digital recorder. It helped them capture more, better data than previous researchers, resulting in a significantly more-accurate model for classifying bugs.

The setup used to measure the insects' data.
The setup used to measure the insects’ data.

That type of creativity highlights what’s possible thanks to the convergence of sensors, consumer electronics, big data, and, presumably, the maker movement and 3-D printing. If more, different and better data will lead to better analysis, it’s easier than ever to collect it yourself rather than wait for someone else to do it.

Artificial intelligence

Thanks to the proliferation of data in the form of photos, videos, speech and text, there’s now an incredible amount of effort going into building algorithms and systems that can help computers understand those inputs. From a big data perspective, the interesting thing about these approaches — whether they’re called deep learning, cognitive computing or some other flavor of artificial intelligence —  is that they’re not yet really about analytics in the same way so many other big data projects are.

AI researchers aren’t so concerned — yet — with uncovering trends or finding the needle in the haystack as they are with automating tasks that humans can already do. The big difference, of course, is that, done right, the systems can perform tasks such as object or facial recognition, or text analysis, much faster and at a much greater scale than humans can. As they get more accurate and require less training, these systems could power everything from intelligent ad platforms to much smarter self-driving cars.

Remarkably, the techniques for doing all this stuff are being democratized at rapid clip and will soon be accessible to a lot more people via software, open source libraries and even APIs. Google and Facebook are spending hundreds of millions of dollars advancing the state of the art in AI, but anyone brave enough to give it a whirl can get their hands on similar capabilities for very little money, if not free.

Quantum computing

Commercial quantum computing is still a way off, but we can already see what might be possible when it arrives. According to D-Wave Systems, the company that has sold prototype versions of its quantum computer to Google, NASA and Lockheed Martin, it’s particularly good at advanced machine learning tasks and difficult optimization problems. Google is testing out computer vision algorithms that could eventually run on smartphones; Lockheed is trying to improve software validation for flight systems.

It’s powerful stuff that could help companies of all stripes solve some difficult computing and analytic tasks that today’s most-advanced systems and techniques can’t. Or, at least, quantum computing should be able to solve those problems faster and more efficiently.

Before that can happen, though, mainstream businesses will need access to quantum resources and some knowledge in how to use them. D-Wave is vowing to make the resources available via the cloud, and is working on compilers to simplify the programming aspect. There’s a lot of ground to cover before that happens, but the technology is moving fast and quantum computer instances delivered via the Amazon Web Services or Google clouds isn’t out of the realm of possibility.

7 Responses to “5 technologies that will help big data cross the chasm”

  1. Sean Suchter

    Great list, and Spark definitely belongs here. I think Spark will become a pretty useful hammer in everyone’s big data toolbox, and the adoption of YARN will help everyone get there. I think that it’ll get way beyond the current ML platforms into interactive queries and even user-facing workloads. Once it’s on the same cluster, sharing the CPU, ram, disk, and network as MapReduce/HBase/others, businesses will have to make a comprehensive performance management plan for all of these components.

    Sean Suchter

  2. Oneasasum

    I thought that what was developed last year in machine learning and AI was neat, but this year and the next should see even more amazing possibilities through the use of Big Data.

    Things like what the Allen Institute has supported: a program that can accurately match diagrams to supporting text in geometry problems; a program that learns EVERYTHING about images (and videos), all unsupervised (needs lots of data!) — concepts, parts, actions, you name it ; and various programs that dramatically improve question-answering technology, and can even handle informal and mal-formed questions. What they’re funding for this coming year looks even more exciting, and includes deep machine reading and commonsense reasoning.

    Then there are the upcoming talks at ICML, one of whose titles is “Distributed Representations of Sentences and Documents,” by Quoc Le and Tomas Mikolov. Get that? “sentences and documents” not “words and phrases”.

    There are the neural nets being developed by people at Oxford that can already beat Socher’s sentiment classifier. There are several papers about this.

    There’s the work of Tom Mitchell and others on fusing brain-scan data with data generated from large corpora to produce better word vectors. It appears that adding just a dash of brain data dramatically improves the quality of word vector representations, and allows the embedding dimension to be much larger without degradation in performance.

    There’s the recent work of Zweig on how to set up word vectors to handle antonyms (by using thesaurus data).

    And the list goes on and on…

  3. Geoffrey Moore

    I love the technology explosion and the fountain of ideas it is unleashing, but given the amount of disruption and make-it-yourself going on, we are a long, long way from even reaching the chasm, much less crossing it. This is still the Early Market, a time for visionary sponsors to underwrite game-changing projects to garner dramatic competitive advantage.

    • Derrick Harris

      Well, far be it from me to disagree with the guy who wrote the book on crossing chasms ;-) I don’t know the exact time frame (commercial quantum computing, for example, is a few years off, at least), but I do think the applications for big data are becoming much more clear, and the technologies are easier to consume and able to do more of what mainstream companies will expect. As the tech gets commercialized, companies will be able to move from idea to pilot pretty easily.

  4. timdockins

    Reblogged this on Timothy Dockins and commented:
    I am positioning myself, it appears, on the forefront of the application of AI to predictive analytics. I have a decent academic background in machine learning where I started focusing on AI in my undergrad and continued it in graduate school. I’m currently researching at the forefront of transfer learning where we try to leverage past learning to improve learning performance on new tasks; all in an effort to learn more and faster. Part of that effort delves into automatically discovering new features about data that isn’t readily apparent. I’ve been looking at Deep Belief Networks, Convolutional Neural Nets, Sparse Coding, and now Restricted Boltzmann Machines. These tools, combined with something like Apache Spark, could really dig deep into some Big Data!