7 Comments

Summary:

Big data has been a buzzword for years, but it’s a lot more than just buzz. There are now so many tools and technologies for creating, collecting and analyzing data that almost anything is possible if you know where to look.

We’re on the cusp of a real turning point for big data. Its applications are becoming clearer, its tools are getting easier and its architectures are maturing in a hurry. It’s no longer just about log files, clickstreams and tweets. It’s not just about Hadoop and what’s possible (or not) with MapReduce.

With each passing day, big data is becoming more about creativity — if someone can think of an application, they can probably build it. That makes the concept of big data a lot more tangible and a lot more useful to a lot more companies, and it makes the market for big data a lot more lucrative.

Here are five technologies helping spur a shift in thinking from “Why would I want to use some technology that Yahoo built? And how?” to “We have problem that needs solving. Let’s find the right tool to solve it.”

Apache Spark

When it comes to open source big data projects, they don’t get much hotter than Apache Spark. The data-processing framework is garnering a lot of users and a lot of supporters — including from Hadoop vendors MapR and Cloudera — because it promises to be almost everything for Hadoop deployments (arguably the foundation of most enterprise big data environments) that MapReduce wasn’t. It’s fast, it’s easy to program and it’s flexible.

Right now, Spark is getting a lot of attention as an engine for machine-learning workloads — for example, Cloudera Oryx and even Apache Mahout are porting their code bases to Spark — as well as for interactive queries and data analysis. As the project’s community grows, the list of target workloads should expand, as well.

Source: Databricks

Source: Databricks

Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration.

Cloud computing

This might seem obvious — we’ve been talking about the convergence of cloud computing and big data for years — but cloud computing offerings have advanced significantly in the just the past year. There are bigger, faster and ever-cheaper raw compute options, many offering high memory capacity, solid-state drives or even GPUs. All of this makes it much easier, and much more economically feasible, to run myriad types of data-processing workloads in the cloud.

The market for managed Hadoop and database services continues to grow, as well as the market for analytics services. They’re quickly adding new capabilities and, as the technologies underpinning them advance, they’re becoming faster and more scalable.

Amazon CTO Werner Vogels announcing Kinesis in November.

Amazon CTO Werner Vogels announcing Kinesis in November.

Cloud providers are also targeting emerging use cases, such as stream processing, the internet of things and artificial intelligence. Amazon Web Services offers a service called Kinesis for processing data as it crosses the wire. Microsoft is previewing a service designed specifically to capture and store data streaming off of sensors. A handful of vendors, including IBM, Expect Labs and AlchemyAPI are providing various flavors of artificial intelligence via API, meaning developers can build intelligent applications without first mastering machine learning.

We’ll talk a lot more about the future of cloud computing at out Structure conference June 18 and 19 in San Francisco. Speakers include Amazon CTO Werner Vogels, Google SVP and Technical Fellow Urs Hölzle, and Microsoft EVP Scott Guthrie. Also, Airbnb VP Mike Curtis will discuss how that company runs big data workloads in the cloud, and New York Times Chief Data Scientist Chris Wiggins will talk about the newspaper’s work in machine learning.

Sensors

A lot of talk about sensors focuses on the volume and speed at which they generate data, but what’s often ignored is the strategic decisions that go into choosing the right sensors to gather the right data. If there’s are real-world measurements that need to be taken, or events that need to be logged, there’s probably a fairly inexpensive sensor available to do the job. Sensors are integral to smarter cars, of course, but also to everything from agriculture to hospital sanitation.

And if there’s not a usable sensor commercially available, it’s not inconceivable to build one from scratch. A team of university researchers, for example, built a cheap sensor to measures the wing speed of insects using a cheap laser pointer and digital recorder. It helped them capture more, better data than previous researchers, resulting in a significantly more-accurate model for classifying bugs.

The setup used to measure the insects' data.

The setup used to measure the insects’ data.

That type of creativity highlights what’s possible thanks to the convergence of sensors, consumer electronics, big data, and, presumably, the maker movement and 3-D printing. If more, different and better data will lead to better analysis, it’s easier than ever to collect it yourself rather than wait for someone else to do it.

Artificial intelligence

Thanks to the proliferation of data in the form of photos, videos, speech and text, there’s now an incredible amount of effort going into building algorithms and systems that can help computers understand those inputs. From a big data perspective, the interesting thing about these approaches — whether they’re called deep learning, cognitive computing or some other flavor of artificial intelligence —  is that they’re not yet really about analytics in the same way so many other big data projects are.

AI researchers aren’t so concerned — yet — with uncovering trends or finding the needle in the haystack as they are with automating tasks that humans can already do. The big difference, of course, is that, done right, the systems can perform tasks such as object or facial recognition, or text analysis, much faster and at a much greater scale than humans can. As they get more accurate and require less training, these systems could power everything from intelligent ad platforms to much smarter self-driving cars.

Remarkably, the techniques for doing all this stuff are being democratized at rapid clip and will soon be accessible to a lot more people via software, open source libraries and even APIs. Google and Facebook are spending hundreds of millions of dollars advancing the state of the art in AI, but anyone brave enough to give it a whirl can get their hands on similar capabilities for very little money, if not free.

Quantum computing

Commercial quantum computing is still a way off, but we can already see what might be possible when it arrives. According to D-Wave Systems, the company that has sold prototype versions of its quantum computer to Google, NASA and Lockheed Martin, it’s particularly good at advanced machine learning tasks and difficult optimization problems. Google is testing out computer vision algorithms that could eventually run on smartphones; Lockheed is trying to improve software validation for flight systems.

It’s powerful stuff that could help companies of all stripes solve some difficult computing and analytic tasks that today’s most-advanced systems and techniques can’t. Or, at least, quantum computing should be able to solve those problems faster and more efficiently.

Before that can happen, though, mainstream businesses will need access to quantum resources and some knowledge in how to use them. D-Wave is vowing to make the resources available via the cloud, and is working on compilers to simplify the programming aspect. There’s a lot of ground to cover before that happens, but the technology is moving fast and quantum computer instances delivered via the Amazon Web Services or Google clouds isn’t out of the realm of possibility.

  1. Reblogged this on Timothy Dockins and commented:
    I am positioning myself, it appears, on the forefront of the application of AI to predictive analytics. I have a decent academic background in machine learning where I started focusing on AI in my undergrad and continued it in graduate school. I’m currently researching at the forefront of transfer learning where we try to leverage past learning to improve learning performance on new tasks; all in an effort to learn more and faster. Part of that effort delves into automatically discovering new features about data that isn’t readily apparent. I’ve been looking at Deep Belief Networks, Convolutional Neural Nets, Sparse Coding, and now Restricted Boltzmann Machines. These tools, combined with something like Apache Spark, could really dig deep into some Big Data!

    Reply Share
  2. Geoffrey Moore Tuesday, May 6, 2014

    I love the technology explosion and the fountain of ideas it is unleashing, but given the amount of disruption and make-it-yourself going on, we are a long, long way from even reaching the chasm, much less crossing it. This is still the Early Market, a time for visionary sponsors to underwrite game-changing projects to garner dramatic competitive advantage.

    Reply Share
    1. Derrick Harris Tuesday, May 6, 2014

      Well, far be it from me to disagree with the guy who wrote the book on crossing chasms ;-) I don’t know the exact time frame (commercial quantum computing, for example, is a few years off, at least), but I do think the applications for big data are becoming much more clear, and the technologies are easier to consume and able to do more of what mainstream companies will expect. As the tech gets commercialized, companies will be able to move from idea to pilot pretty easily.

      Reply Share
  3. I thought that what was developed last year in machine learning and AI was neat, but this year and the next should see even more amazing possibilities through the use of Big Data.

    Things like what the Allen Institute has supported: a program that can accurately match diagrams to supporting text in geometry problems; a program that learns EVERYTHING about images (and videos), all unsupervised (needs lots of data!) — concepts, parts, actions, you name it ; and various programs that dramatically improve question-answering technology, and can even handle informal and mal-formed questions. What they’re funding for this coming year looks even more exciting, and includes deep machine reading and commonsense reasoning.

    Then there are the upcoming talks at ICML, one of whose titles is “Distributed Representations of Sentences and Documents,” by Quoc Le and Tomas Mikolov. Get that? “sentences and documents” not “words and phrases”.

    There are the neural nets being developed by people at Oxford that can already beat Socher’s sentiment classifier. There are several papers about this.

    There’s the work of Tom Mitchell and others on fusing brain-scan data with data generated from large corpora to produce better word vectors. It appears that adding just a dash of brain data dramatically improves the quality of word vector representations, and allows the embedding dimension to be much larger without degradation in performance.

    There’s the recent work of Zweig on how to set up word vectors to handle antonyms (by using thesaurus data).

    And the list goes on and on…

    Reply Share
    1. Derrick Harris Wednesday, May 7, 2014

      You are consistently a great source of info on this space. Now if you’d just link to some of the research ;-)

      Reply Share
      1. Ok, here’s a list of references, with links. It would be great to hear more (hint, hint) about some of these:

        1. These are a few recent papers funded by the Allen Institute, or published on their page:

        a. The one on geometric diagram understanding (really amazing! — hard to believe they could make it that accurate):

        http://www.allenai.org/Content/Publications/diagram_understanding_in_geometry_questions.pdf

        b. This one is on improving q-a (see the level of improvement):

        http://www.allenai.org/Content/Publications/acl2014.pdf

        c. Another one on improving q-a (even more amazing! — read the level of improvement, and the fact that it handles malformed questions and informal language):

        http://cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf

        d. Here’s an article on a computer vision program that can learn “everything about anything”. It’s unsupervised, and isn’t up to the level of the best supervised algorithms; on the other hand, it’s very general:

        http://www.allenai.org/Content/Publications/objectNgrams_cvpr14.pdf

        2. You’ll have to wait until Le and Mikolov publish their paper on the web. It should appear fairly soon, as ICML is drawing near.

        3. Some of the papers by the people at Oxford:

        a. This one is neat, but there are no numbers yet on how well it performs. The input is sentences/questions (encoded in a standard way), and the output is database queries in whatever form you wish:

        http://arxiv.org/abs/1404.7296

        Perhaps they will even find a way to handle the answer generation using a neural net, thereby creating an entire q-a system, all in one, single neural net (pre-trained in pieces, and then combined).

        b. This paper is about a convolutional neural net that, among other things, can be used to determine sentiment — see the part about how it forms representations of sentences:

        http://arxiv.org/abs/1404.2188

        There are other papers…

        3. The paper of Mitchell and others on word embeddings that leverage brain data:

        http://www.cs.cmu.edu/~afyshe/papers/acl2014/jnnse_acl2014.pdf

        One could imagine many uses for this, given enough brain data; for example, maybe it would improve parsers (as in a paper by Baroni), sentiment classifiers (by initializing models with brain+text word vectors), language transation, etc.

        4. Recent work by Zweig on word vectors and antonyms:

        http://research.microsoft.com/pubs/213051/antoLM.pdf

        5. Here’s a paper on automating problem generation and solution for various problems in STEM fields:

        http://research.microsoft.com/en-us/um/people/sumitg/pubs/cacm14.pdf

        (Imagine a soft-AI tutor for math, chemistry, physics, etc.)

        6. There was the recent work on automatically solving algebra word problems:

        http://people.csail.mit.edu/regina/my_papers/wp.pdf

        7. And what have the people on the Spaun project been up to??

        http://compneuro.uwaterloo.ca/files/publications/bekolay.2014.pdf

        They have a new Nengo 2.0 platform, on which neural nets the size of Spaun run 50 times faster than they did on Nengo 1.4. They also have some robot experiments!

        8. There’s work by Josh Tenenbaum at MIT on physical commonsense / approximate physics, that appeared in PNAS:

        http://www.pnas.org/content/110/45/18327.full

        This could be extremely important to the future of AI.

        9. New results on video recognition from some people at Stanford and Google, using deep neural nets:

        http://cs.stanford.edu/people/karpathy/deepvideo/

        10. It will be interesting to see the outcome of the CoNLL “shared task” on unrestricted grammar correction. The task is: you feed in an essay, and the computer returns for you a corrected version, taking into account any type of error. Think of what that could mean for foreign students… or even people who don’t write well.

        Reply Share
  4. Sean Suchter Wednesday, May 7, 2014

    Great list, and Spark definitely belongs here. I think Spark will become a pretty useful hammer in everyone’s big data toolbox, and the adoption of YARN will help everyone get there. I think that it’ll get way beyond the current ML platforms into interactive queries and even user-facing workloads. Once it’s on the same cluster, sharing the CPU, ram, disk, and network as MapReduce/HBase/others, businesses will have to make a comprehensive performance management plan for all of these components.

    Best,
    Sean Suchter
    Pepperdata

    Reply Share