For now, Spark looks like the future of big data

7 Comments

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

7 Comments

Rahul

Very great content and beautifully explained. Thanks for sharing your post. Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat. More at http://www.youtube.com/watch?v=1jMR4cHBwZE

Hari Sekhon

Companies are rising and falling at the speed of code these days… Spark’s ascent is a sign that spending $1B on an incumbent tech company is not smart when those companies are making only a tiny fraction of that circa $100M which is all lost in operating costs. Cloudera’s Mike Olson says “code trumps cash”… which should keep all Hadoop vendors sharp and contributing to the Hadoop ecosystem of Apache projects, benefitting users or falling from the grace of the developers and architects that recommend which platform to buy or migrate to/from. Good times to be a user technologist either way! :)

Viplav

A nit – Cloudera sales are $100 million. It did not “earn” $100 million.
On Spark, I think you are generally right. I can see some tension already between Hadoop(Yarn/Impala/Tez) and Spark and hope that all of them find their niches in the broader Hadoop ecosytem.

Carnot Antonio Romero

I think Spark will mature and supplant MapReduce faster than you imagine. Other than that, I think you’re spot on.

Carnot Antonio Romero

I had missed this story, thanks for sharing it. (Like the rest of the world, my eyes were glued to Spark.) I think this is a radical shift in the positioning of ClearStory, which I had seen as a (very cool) silo that needed a way to bridge back out to the rest of the world. If it becomes the face of your data lake, it’s much more interesting.

marc

Liked your review of Spark but your comment about HP is off-base. HP has the only platform in Big Data that can store, manage and analyze 100% of the world’s data between Vertica and IDOL, the two key components of Haven. HP also has a SQL on Hadoop and FlexZone offering for mining dark data… According to Gartner, Vertica and IDOL are both in the Leaders Quadrant in their respective categories. This is a little more than trying but succeeding.

Comments are closed.