3 Comments

Summary:

Forget about how much data a disk can store or whether companies will use Hadoop. The questions for big data going forward are how they’ll use Hadoop, how intelligent our systems can actually become and how we’ll keep them in check.

Big data has evolved a lot of the past few years; from a happy buzzword to a hated buzzword, and from a focus on volume to a focus on variety and velocity. The term “big data” and the technologies that encompass it have been pored over, picked over and bastardized sometimes beyond recognition. Yet we’re at a point now where it’s finally becoming clear what all of this talk has been leading up to.

It’s a world of automation and intelligence, where it’s easier than ever to mine data, but also to build intelligence into everything from mobile apps to transportation systems. Big was never really the end goal, but the models driving this change generally feed on data to get smarter. Variety was never really a goal, it’s just that the more we can quantify, the more we can learn about the world around us.

It’s a world we’ll delve into in great detail at our Structure Datastructuredata2014_300x200_editpost2, which kicks off just a week from today (March 19) in New York. We have speakers from nearly every tech company that matters, as well from some of the biggest companies in the world and some of the smartest startups around. They’ll be talking about everything from fighting human trafficking to the future of Hadoop and the cutting edge in artificial intelligence.

Here are five of the big trends I have been watching that helped shaped who’ll be speaking and what they’ll be talking about. Hopefully, it gives you something to think about and, if you’re planning to attend, something to look forward to.

1. Hadoop’s march toward true platform status

Apache Hadoop might still be a distributed file system and the MapReduce processing framework, but Hadoop is so much more. Thanks to general advances such as YARN, Hadoop clusters can now run any number of different processing frameworks for any number of different workloads, all taking advantage of the same underlying storage infrastructure. What was once a MapReduce cluster for ETL jobs, for example, can now also operate simultaneously as a Spark cluster for machine learning, a Storm cluster for stream processing and a Tez cluster for interactive SQL.

Source: Hortonworks

Source: Hortonworks

Essentially, Hadoop is transforming from a tool useful for certain tasks into a bona fide platform capable of supporting all sorts of applications. Early adopters such as Airbnb and Twitter are already taking advantage of this new reality, and efforts by Hadoop vendors such as Cloudera, Hortonworks and MapR to build new capabilities into their products and support new frameworks suggest mainstream Hadoop users also will at some point. Startups such as Continuuity, Mortar Data and WibiData should speed this evolution as they make it easier to build big data applications, all the while open sourcing some of their technological underpinnings and thus giving tools to even more developers.

Of course, it won’t just be developers feeling the effects of Hadoop as a platform, but incumbent software vendors, as well. Traditional data warehouse, database, and even statistics software companies will have to find ways to cope with the fact that Hadoop can now store a lot more data they they can for a lot less money, and also analyze it a variety of ways.

2. The rise of artificial intelligence, finally

We have the computers, we have the data, and we have the algorithms: so we now have the artificial intelligence. No, it’s not yet the fear-mongering stuff of science-fiction or the human-replacing stuff of Her, but AI is finally for real. Thanks to advances in machine learning, we have smartphones that can recognize voice commands, media services that predict what movies we’ll like, software that can identify the relationships among billions of data points, and apps that can predict when we’re most fertile.

We have IBM’s Watson system putting together ingredient lists for chefs.

AI now comes in a food truck. Source: IBM

AI now comes in a food truck. Source: IBM

Looking forward, the work being done in areas like deep learning will make our AI systems more useful and more powerful. Set loose on complex datasets, these models are able to extract and identify the features of what they’re analyzing at a level that can’t be programmed. Without human supervision, deep learning projects have figured out what certain objects look like, mapped word usage across languages and even learned the rules of video games. All of a sudden, it looks possible to automate certain tasks such as tagging content to make it searchable, or predicting with high accuracy what someone’s words means or what they’ll type next.

Applied to new types of content in new areas, these methods could prove even more valuable. What are the features that comprise a certain type of cancer cell? Can we help nurses know as much as doctors? What combination of previously unmeasurable variables might signal a suicide risk among teenagers? How do we make self-driving cars and drone delivery services commercial realities? We can’t call it a savior yet, but AI does seem to hold a lot of promise.

3. Analytic power to the people

It might not seem like a big deal compared with the really hard infrastructure and algorithm work being done elsewhere, but efforts to make data analysis a standard  and easy-to-achieve skill could prove transformative to our society. Just giving everyday people the power to visualize the data around them in new ways can open up entirely new ways of thinking about our lives.

Yesterday, for example, I used free software to build a network graph of my iTunes library and compare the words Edward Snowden used in a recent interview to those used by NSA boss Gen. Keith Alexander. I wasn’t doing data science or deep learning, but I was able to perform simple analysis on, and then visualize, data that I found interesting. Previously, I’ve mapped my Twitter followers, analyzed Gigaom writers’ headlines and even visualized my food intake and exercise. Who knows, getting young people interested in analyzing their own data with interesting visualizations might actually help spur that data-savvy workforce everyone seems to think we need.

greatest hits

And as the tools available to laypeople get more advanced, and as we accumulate even more data about ourselves via fitness trackers, connected cars and the internet of things, in general, being able to get a sense of our quantified selves will become much more important. We are, for many purposes, becoming numbers fed into and spit out of algorithms. Our personal data will influence everything from the ads we see to the job offers we get, and it will behoove individuals to see at least a modicum of what companies, institutions and the government are seeing.

4. The cloud

I’ve said three years ago that cloud computing and big data are on a collision course, and it’s finally happening — only in a much broader way than I predicted. In fact, the biggest impacts of this convergence might have little to do with being able to consume Hadoop, business intelligence suites or any other sort of analytic software as a service. Those things are happening, and they’ll make life easier for startups and established companies that want to move new workloads into the cloud, but the beauty of the cloud to me is now its ability to democratize hard computer science.

Already, some of the technologies and techniques I’ve highlighted are being delivered as services, often via API, and the list will only grow. If you’re a developer and you want to learn Hadoop and use Elastic MapReduce, that’s available. But if you just want to connect to a service like IBM’s Watson cloud or the MindMeld API and have someone else’s algorithms provide a layer of artificial intelligence to your data, that’s an option, too. The work being done at places from Google to Pinterest to Netflix assures that many of these techniques will just be embedded into the service we consume.

A hack using AlchemyAPI to  extract concepts from typed notes. Source: AlchemyAPI

A hack using AlchemyAPI to extract concepts from typed notes. Source: AlchemyAPI

Assuming these approaches really work and let developers deliver real intelligence (as opposed to, say, generic recommendation features that become more of a plague than a benefit) it will raise the bar for what consumers expect, even for mundane tasks. Many of us will expect to know not just that our shopping list contains lettuce, but also what recipes it’s good for, what are good alternatives if the store is out and where we can get the lowest prices. Paired with the available processing power and data capacity of our smartphones and other computers, well-designed apps can actually make this a reality whatever signal we’re picking up from AT&T’s towers.

5. The law

And, finally, the legal system is the potential rain — or, depending on how you look at it, parental chaperone — on this big data party. Already, judges, legislators, regulators and even the president are trying to get their heads around what all this data collection means and then carve out some semblance of order. It is not easy terrain to navigate, especially with all the competing interests at play.

With its breach and its targeting of a pregnant teens, Target is the ultimate example of what can go wrong with data. Source: Flickr user j.reed.

With its breach and its targeting of a pregnant teens, Target is the ultimate example of what can go wrong with data. Source: <a href="http://www.flickr.com/photos/jreed/379881272/sizes/l/">Flickr user j.reed</a>.

One of the trickiest areas to govern will be the consumer privacy area, where there’s great potential to elevate the consumer experience but also great risk of invading individuals’ privacy. Oh, and lots of lobbyist money is now coming into play. We want to get the best deals on food or new clothes, and we want to be able to have our DNA sequenced for $99. We also need to make sure the potentially sensitive information we supply isn’t used in unexpected ways or doesn’t pop up in places we didn’t expect it to, such as in banner ads on a shared computer.

It will be a big challenge for lawmakers and others in the legal field to craft a framework of laws, regulations, and case law that lets consumers have their cake and it too when it comes to privacy. Frankly, I’m not sure they can without understanding the technology and where it’s headed, and I’m not sure we’ll ever really be happy with the results.

Sure, we don’t want Facebook, Google and ultimately even someone like Geico analyzing the heck out of all our data, but we also don’t want to go back to a world of weirdly designed websites, waiting for taxis, and generally inefficient, not-personalized lives.

  1. Regarding AI, it’s interesting to see some of the comments that Yoshua Bengio recently made in his Reddit AMA. Here are a few:

    1. I predict that deep learning will have a big impact in natural language processing. It has already had an impact, in part due to an old idea of mine (from NIPS’2000 and a 2003 paper in JMLR): represent words by a learned vector of attributes, learned so as to model the probability distribution of sequences of words in natural language text. The current challenge is to learn distributed representations for sequences of words, phrases and sentences. Look at the work of Richard Socher, which is pretty impressive. Look at the work of Tomas Mikolov, who beat the state of the art in language models using recurrent networks and who found that these distributed representations magically capture some form of analogical relationships between words. For example, if you take the representation for Italy minus the representation for Rome, plus the representation for Paris, you get something close to the representation for France: Italy – Rome + Paris = France. Similarly, you get that King – Man + Woman = Queen, and so on. Since the model was not trained explicitly to do these things, this is really amazing.

    2. I believe that the really interesting challenge in NLP, which will be the key to actual “natural language understanding”, is the design of learning algorithms that will be able to learn to represent meaning. For example, I am working on ways to model sequences of words (language modeling) or to translate a sentence in one language into a corresponding one in another language. In both of these cases we are trying to learn a representation of the meaning of a phrase or sentence (not just of a single word). In the case of translation, you can think of it like an auto-encoder: the encoder (that is specialized to French) can map a French sentence into its meaning representation (represented in a universal way), while a decoder (that is specialized to English) can map this to a probability distribution over English sentences that have the same meaning (ie. you can sample a plausible translation). With the same kind of tool you can obviously paraphrase, and with a bit of extra work, you can do question answering and other standard NLP tasks. We are not there yet, and the main challenges I see have to do with numerical optimization (it is difficult not to underfit neural networks, when they are trained on huge quantities of data). There are also more computational challenges: we need to be able to train much larger models (say 10000x bigger), and we can’t afford to wait 10000x more time for training. And parallelizing is not simple but should help. All this will of course not be enough to get really good natural language understanding. To to this well would basically allow to pass some Turing test, and it would require the computer to understand a lot of things about how our world works. For this we will need to train such models with more than just text. The meaning representation for sequences of words can be combined with the meaning representation for images or video (or other modalities, but image and text seem the most important for humans). Again, you can think of the problem as translating from one modality to another, or of asking whether two representations are compatible (one expresses a subset of what the other expresses). In a simpler form, this is already how Google image search works. And traditional information retrieval also fits the same structure (replace “image” by “document”).

    3. I just started a page that lists some of the papers on neural nets for machine translation: https://docs.google….5Zjv6VtEVgcFr6k

    Briefly, since neural nets already beat n-grams on language modeling, you can first use them to replace the language-modeling part of MT. Then you can use them to replace the translation table (after all it’s just another table of conditional probabilities). Other fun stuff is going on. The most exciting and ambitious approaches would completely scrap the current MT pipeline and learn to do end-to-end MT purely with a deep model. The interesting aspect of this is that the output is structured (it is a joint distribution over sequences of words), not a simple point-wise prediction (because there are many translations that are appropriate for a given source sentence).

    Share
  2. Personally, I would like to see it all erased. Every last bit of it. Then we can think rationally if we ever want Big Data again. I doubt it, if we really think about the potential for abuse and the amount of abuse that we have already endured.

    Share
  3. Venture Hire Sunday, March 23, 2014

    currently,Big data is pumping all over the world in a vast way.people having more datas and to handle that data ,they are moving forward to Big data Analytics and its applications to work with thousands of computationally independent computers processing petabytes of data.
    Hadoop was derived from Google’s MapReduce and Google File System.

    After gone through this blogs,we can observe that Big Data will continue even after 5 yrs.

    Thanks!
    Would be waiting for some more new blog post.

    Share

Comments have been disabled for this post