Hadoop is like a puppy: free, but with hidden costs

David Menninger, Head of Business Development and Strategy, Pivotal

Session Name: The Dominant Data-Management Platform?


Mike Ferguson

David Menninger

Announcer 00:00

Welcome to the first session of the day. It’s The Dominant Data-Management Platform? That’s going to be moderated by Mike Ferguson, and he’s going to be talking with David Menninger, Head of Business Development Strategy of Pivotal an EMC Subsidiary. Please welcome our first panel to the stage.


Mike Ferguson 00:25

Morning. This session is The Dominant Data-Management Platform. I’m here today with David Menninger of Pivotal. First of David, can you let me know what Pivotal is about?

David Menninger 00:40

Sure. Pivotal is formed from EMC and VMware, about equal numbers of employees from each side, and about a half of dozen products that are focused on three things: delivering applications, the deal with big data, and Cloud-like infrastructures. We have three layers to the fabric: a Cloud fabric, a data fabric, and an application fabric.

Mike Ferguson 01:04

We’re going to talk a lot this morning about Hadoop. Before I do, let me just take a quick poll of the audience here. How many of you people have Hadoop, either a commercial or open source distribution, already installed in your organization today? Probably about 5% or 6% of you for tops. Well, let’s talk about that as a platform and very central to the big data environment. But in regards to data-management David, where would you see the role of Hadoop here?

David Menninger 01:45

We see Hadoop playing a central role in data-management. Hadoop brings to the table a couple of characteristics that are new, or different, or easier to embrace. First of all, Hadoop allows us to deal with very large volumes of data. Secondly, Hadoop allows us to process less structured information, so information that’s either not neatly formatted or is, for instance, unstructured information like audio, video, or even text information. I would suggest that the real fundamental difference with Hadoop is that it brings a new economic structure to processing information. The economics of Hadoop are dramatically different than the economics of commercial relational database systems. When I ran product management for an MPP SQL company just three years ago, our list price was $100,000 per terabyte and $20,000 a year annual maintenance. A street price for a node of Hadoob, which can easily process a terabyte of information today, is maybe $2,000. That’s two orders of magnitude difference in terms of the economics of managing information.

Mike Ferguson 03:03

Do you think if there’s benefits from an economic perspective on sheer cost of hardware, what about the fact that we’re now dealing with much lower level APIs? That requires higher skill sets that could slow down development and drive your people costs up. Would you not balance it out that way?

David Menninger 03:24

It’s absolutely true. I think Hadoop is free like a puppy, you get to take it home, and feed it, and take it to the vet. So, there’s plenty of cost associated working with Hadoop, and that’s one of the challenges. There aren’t as many skilled Hadoop resources in the market as there are skilled SQL resources. We actually think that those two worlds will not remain independent, that those two worlds will merge over time, but that’s one of the reasons that Hadoop does require that you work at a much lower level.

Mike Ferguson 03:51

Do you see Hadoop being used as a data staging area, refinery, exploratory analytics? What do you think the role of it is here? Do you think it’s going to be used to offload the whole data warehousing set up onto that platform?

David Menninger 04:10

I’m not sure we would yet see offloading of data warehousing. We see Haddop more often being injected prior to the data warehouse. It’s a data landing zone where information collected from around the organization can be put into Hadoop, and then feed those data warehouse systems. I’m not a big fan of rip and replace. Certainly in greenfield opportunities, you might see Hadoop emerging first, but what we see often is this consolidation of information into a data landing zone. One of the things that Hadoop provides is the ability to do late binding of your schema, or schema at query. You don’t have to structure your information first. Collect it and then when you want to analyze it perhaps more rapidly than putting it into a data warehouse, you can do that right off of the landing zone. We see it as a consolidation point.

Mike Ferguson 05:02

So you don’t have to design a data model before you load the data, it’s just loaded in there, any old schema will do?

David Menninger 05:09


Mike Ferguson 05:09

Then navigate it at the point you want to access it. If I have to prepare data and integrate it, which I have to do for analytics whether it’s big data analytics or any other kind of analytics, then I’m inter-writing Java MapReduce applications, or am I? Clearly, is that not a black box? In data warehousing, we started out 20 years ago when I got involved in that space, we started out writing our own code. Then we had the emergence of data integration software to come and help us get far more productive. Is that the same cycle you think we’re going to see here in the sense that if we’re just writing Java MapReduce code, nobody has a clue what’s going on in there, there’s no metadata lineage or any of that. How do you see that?

David Menninger 06:01

That’s why we think these worlds of SQL and Hadoop are going to merge. Yes, lots of Hadoop work is done in Java MapReduce and there’s lots of valuable things you can of there. I’m sure many of you have struggled with how to write certain types of analytics in SQL. It’s not a procedural based approach and so you’re limited by set theory. You have to figure out how to make it work. There are things you can do in Java MapReduce, but the billions of dollars that have been invested in the SQL world, are not going to go away. We have that body of skill, and knowledge, and investment. We see those two worlds coming together, so we’ve actually taken one approach where we do SQL processing directly on top of HDFS. It allows HDFS as the Hadoop distributed file system, the underlying storage infrastructure, and so you now have two access points. You can access the information via SQL, or via MapReduce. It lets you do those specialized analyses and then perform SQL operations once you get that information.

Mike Ferguson 07:05

There’s no optimizer here right? Let’s get something clear, it’s just not like a magical relational database just appeared in HDFS, it’s a distributed file system. From a development perspective, is there not a long way to go around SQL on Hadoop right now?

David Menninger 07:23

No, we’ve actually taken the ten years of investment we’ve made into the SQL MPP processing and placed that on top of HPFS. The whole query plan execution, the optimizer is all ported over and runs on top of HDFS.

Mike Ferguson 07:38

So that’s the case in Pivotal, not necessarily the case in everybody else, right?

David Menninger 07:42


Mike Ferguson 07:43

Do you see people taking existing ETL investment and, if you said Hadoop’s going to become a landing zone, are people going to move their entire ETL processing over there?

David Menninger 07:59

We do see, already, the ETL vendors and others bringing their processing to Hadoop. Whether they’re open source vendors or propitiatory vendors, they’ve all adopted Hadoop as a data-source and a data target. Those landing zones I talked about earlier, we see those same commercial tools being used to feed from the landing zone out to the warehouses and data marts that exist. So yes, I think we will see those products and approaches.

Mike Ferguson 08:26

That would mean therefore that there’s a little bit of help here for a data scientist. He doesn’t have to parallel the data himself, he could potentially leverage the existing data integration software to exploit the power of Hadoop to process data that’s on there.

David Menninger 08:40

Absolutely. Obviously one of the challenges is, how do you make this information available to the various audiences that need to work with it?

Mike Ferguson 08:48

Right. At the same time, things like master data-management, you probably wouldn’t envisage it happening on a Hadoop platform?

David Menninger 08:58

Well it depends on your approach. Master data-management, like any other tool, would be adapted to operate on top of the Hadoop infrastructure. The Hadoop infrastructure will become invisible over time. It will be part of the infrastructure, but you don’t look underneath the covers when you’re running a data-base to see what the underlying storage is. We’ll see those things get there, the’re not there yet today.

Mike Ferguson 09:21

Right. Do you think going back to the title of this session, that Hadoop will become the dominant data-management platform, or is it just another piece in a bigger ecosystem?

David Menninger 09:30

We think it becomes dominant. It becomes dominant because of the expanded capabilities that we have, both in terms of data volume, and in terms of types of data that can be processed. And we see various other pieces being adapted to work with HDFS. We do see the HDFS file system storage infrastructure becoming the common dominant platform.

Mike Ferguson 09:53

In that case, what we’re seeing is vast majority of data going in there, multi-structured, some cases structured data, potentially data integration software expanding to manage that environment, as well as existing data warehouse environments. Potentially the role of SQL is growing to potentially bridge over on top of the Hadoop world as well, is that right?

David Menninger 10:28


Mike Ferguson 10:29

Let me ask a crazy question then. Is relational going to swallow Hadoop in that case?

David Menninger 10:39

It depends on your definition. I believe that we’re going to have to leverage the body of investment we’ve made into the SQL world. The knowledge and skill with millions of people trained and knowledgeable in the SQL world, and we have probably thousands that have Hadoop skills, so those worlds are going to come together. I would say it would encompass Hadoop. It’ll continue to be able to exist independently, but we’ll also be able to leverage it through that set of investments that we’ve already made.

Mike Ferguson 11:14

What about the form factor here? Clearly there’s a lot of Cloud vendors, I noticed on the way in this morning, exhibiting here. How do you see Hadoop, as a data-management platform, being deployed? Is it going to be on appliances? Is a lot of those being announced at the moment? Is it going to be on the Cloud? Where do you see that?

David Menninger 11:35

We would see choice of deployment as really resting with the organizations, but we do see Cloud becoming a more common choice of deployment. Appliances are still very popular. We see our customers choosing appliances both for Hadoop implementations and for more traditional implementation. The Cloud and appliances offer the same benefit. It’s that ease of start-up, less configuration, ability to get up and running much more quickly. So Yes, I think we’ll see Cloud and appliances continue with a trend toward more and more Cloud deployments.

Mike Ferguson 12:13

I guess that would fit with short-lived data like Sentiment for example, and Twitter streams, those things are constantly changing. Therefore, if you’re going to analyze it, you’re going to maybe take a snapshot of that data, load it up there, produce some insights from it, discard it, bring in the next sometime further on. So Cloud’s useful from a short-lived analytical projects on big data would you say?

David Menninger 12:41

Sure, short-lived projects and expansion to deal with bursts of activity, both those scenarios are common scenarios for Cloud deployments.

Mike Ferguson 12:50

Does that mean that data integration software also has to reach into the Cloud in that case?

David Menninger 12:55

Absolutely. As a source, as a target, and processing in the Cloud. All those are becoming requirements.

Mike Ferguson 13:04

What about something that’s probably a little more closer to the heart of Europeans and it might be state side, which is data privacy. I know, certainly I work all over Europe, and when I go into places like Germany it’s seriously high on the agenda. How do you see that panning out if we’re just throwing data around on all these platforms?

David Menninger 13:26

Well, there are a couple of issues. Certainly the privacy issues are probably exasperated in Hadoop. The infrastructure to protect data is not yet as robust as it is in relational database systems, and the notion of having distributed data is also an issue. I think Hadoop does provide an abilitty potentially to federate data more easily and bring that data together from multiple locations, allowing you to keep it in country when necessary.

Mike Ferguson 13:54

Again, in that regard, data management software has further to go – if you’d like – to cover all bases in the Hadoop world, not just in the structured database.

David Menninger 14:07

Yeah, I would say the journey’s just beginning, not ending.

Mike Ferguson 14:10

What about this overall trend around SQL? Clearly at the moment, SQL is probably the only API in town with regards to any dominant standard that’s out there. We all know that paralyzing queries has taken 30 years in some cases from some of the MPP database vendors. Is it just a case of switch it all on, it works on Hadoop when it’s a file system?

David Menninger 14:52

Well, yes and no. There’s some fundamental limitations to Hadoop. Hadoop is append only, so you can’t update data, you can add to the data. Right now, the SQL processing is Insert and Select. There’s no Update or Delete.

Mike Ferguson 15:07

It’s batch process.

David Menninger 15:09

And all those Insert and Select statements are exactly the same in both environments. There’s work to do in terms of updating and deleting.

Mike Ferguson 15:19

I think, as well from my experience there is work to be done on certain kinds of queries like self-joins and correlated sub-queries, which are not easy to paralyze on structured database management systems, pretty damn hard to paralyze on Hadoop as well. Let me just ask the audience, how many of you are already starting to do data integration on a Hadoop platform? For those of you, put your hands up. Anybody? One, a sole brave man. I think that says very much that the experience in the audience so far is pretty early days with regards to all of this. It’s interesting. London’s a great place for big data. Big data meet up and stuff like that going on here in London’s the biggest in Europe, and in fact happening tonight. Certainly, for me, it’s a very exciting area going forward. I think big corporate’s now looking to extend the use of bringing big data technologies in to deepen our insight into different aspects of their business. Whether it’s customer insight in order to understand what’s happening online, or to understand what’s happening perhaps in our operations with machine learning data that’s generated at high volumes. Are you at Pivotal seeing any specific verticals adopting Hadoop as a data management platform sooner than others?

David Menninger 16:59

Well, there are certainly industries where it’s relevant. We could spend another session on that one, but I think we’re probably out of time to talk about that today. There’s plenty of these cases we can share, but why don’t we plan to do that some other time?

Mike Ferguson 17:12

Okay, sounds good. Well, if you have any questions for either David or myself, during the breaks we’ll be here most of the morning. Thanks again.


firstpage of 2

Comments have been disabled for this post