Big data is still hard, but it gets better

View Live Version of This Article

What’s standing between your staff and big data analysis? That was the existential question posed of DJ Patil and Jeff Hammerbacher at the GigaOM Structure:Data event today in New York. The two had different takes on how easy it was to give people the power to use data, with Hammerbacher, who is the co-founder of Cloudera, saying that it’s pretty simple today.

He did say that today many aspects of the input and ingress of data will end up being automated, much like systems administrators responsible for running the data center have seen many of their tasks automated.

Patil, who is now a data scientist in residence at Greylock Partners, was a bit more focused on end users. He shared his visit to a nonprofit called earlier today, and said that people there had plenty of curiosity and a desire to play with data and ask questions, but they didn’t always know what to ask to get the insights they seemed to want. “We need another layer to help those people figure out what they want to ask,” he said.

From Patil’s perspective we need tools that will help us tell stories with data and let people play with it in ways that can help people come to new conclusions or see new relationships. “This is less of a machine learning problem than a ‘Can I try a bunch of things with the data?’ kind of problem,” said Patil.

And for those who are still intimidated by playing around with big data Patil has this to say, “Most people doing sophisticated analysis they don’t really know what they are doing.”

Check out the rest of our Structure:Data 2013 coverage here, and a video embed of the session follows below:

A transcription of the video follows on the next page

Session Name: Big Data Bottlenecks: What we need to be aware of.


S1 Derrick Harris S2 Jeff Hammerbacher S3 DJ Patil



Alright, so if you guys were here this morning and you saw Sean Gourley from Quid Talk, you might recognize these two from his presentation, these are the two guys who coined the term data scientists, right. And they both have data scientists in their title I think, or something. So, it’s something like that, so… I want to start off before we talk about Big Data Bottlenecks, I thought I would just reference to the distinction between data science and data intelligence. That’s Sean brought up before and maybe if you want to defend the realm the data science versus data intelligence. And I know Jeff obliviously when you’re also famous for the– that’s mind from my generation, or trying to make people click on ads, right. So, maybe you have a thought on me.

JEFF H 00:49

I guess I’m missing the contexts, what is data intelligence referring to?


It’s was like you have solving bigger problems, right with messier data and harder things, and no so much web and ads and…

JEFF H 01:03

Yeah. Sure, I think the tools and techniques that have been developed in the web space certainly have applicability outside of the web domain, so… And I think that’s well within the scope of data sciences. I’d personally define it. [laughter]

DJ PATIL 01:17

For me, I think this kind of no showing of labeling is a little bit arbitrary. I mean, I think you are the one that actually told it to me first. It was like this whole notation of part of this is how do you start taking back data from the high priest hood of the data warehouse, where these guys have kept stuff locked up. And as you start to get to the point of what are people supposed to do with data. I think there’s a level of increasing sophication. And what really matters at the end of the day are going to be the tools that you have available to go to war against the data problem. And the people who are going to be versatile in using those to actually make stuff happen, and if it’s intelligence versus science, I don’t know if was called intelligence first, then you’ve said well now it’s science. So, it would take whatever term is not taken.


Well that kind of that actually I think kind of is good set up for the topic of this discussion, right. Which is Big Data Bottlenecks and the idea of freeing data from, let’s say the high priest hood, or making it more available for more people to do stuff with and to put their ideas into and take action on their ideas. I want to talk about–so, I think–we had a little call last week and chatted about some of this and I thought one of the big things I think is important to realize is, we hear about those of data scientist storage, right. And big data’s stressing the HR departments of companies. But, it sounds like, I mean, and maybe that’s not the case so much, or it sounds like maybe it’s not the knowing how to do data science that’s a problem anymore, it’s maybe–what are the challenges? If the knowledge isn’t so, and the tools arent so siloed away any more, I mean, what are the challenges now?

JEFF H 3:14

Yeah. I think on the call we talked about motivation as being one of the big challenges, in other words. It’s one thing to equip someone with the skills to be able to ask questions. And the software to facilitate that process, but it’s another thing to get them interested in the problem domain in the first place. And to get the people who will be the beneficiaries of the work they do, to ask and answer questions. To realize that they need someone like this, so… I can’t actually remember the solution I proposed for increasing the motivation. [chuckles] I don’t know if you have an answer, but yeah, that’s one area that I felt like was important. Then we also talked about, ways of building software tools that can enhance the capabilities of these individuals. DJ likes to use the term giving people a super power, and so that’s one thing. You think about the human augmentation, the JC [inaudible] perceptive. We take a look at in the [inaudible] you hear guys often speak in this language. You take a look at what is a data analyst doing on a day-to-day basis. And then how can I build a software tool to allow them to do that faster, or better? So, I think we’re very early days in the software tool set that a data scientist has to accomplish their tasks versus the tool set that a software engineer for example has to accomplish their ties.


I mean how do you–so you work for a company that sells hadoop. I mean where does that fall into the–as you’re looking at software tools. Because that’s still kind of very hard and difficult thing to–

JEFF H 04:47

I don’t know man have you tried Cloudera manager Derrick? [laughter] I find it quiet easy now to install and configure clusters. So, I do think we’ve had five years of software development at this point at Cloudera. So, we’ve managed to sand down most of the sharp edges, and I do there’s a prepetition of this myth that it’s hard. But, I’ve installed Oracle RAC by hand. And I’ve used Cloudera manager to install hadoop. And I’ll let you be the judge on which one is easier. I find it pretty trivial to get up and running with hadoop. But, I think there are additional layers of expertise required to extract knowledge from hadoop. So, okay great. It’s really easy to install and configure hadoop cluster today, but the most recent project that released Cloudera and Navigator is moving one layer up the stack, so… Cloudera manages for the systems administers to deploy configure and meet the SLA’s for the infrastructure. And the Cloudra and Navigator, is for data steward who’s working on data governess challenges. So, we’re doing things capturing all the history, capturing lineage, the queries so that you can trace data set gets turned from its raw data into its cooked sort of reportable form. And then there is a whole set of expertise above the data steward. So, people who are building chats and reports, people are doing modeling, people who are managing the deployment and refresh of breaks and models deployed in a production. So, I think at Cloudera we’re very much at much at the base of a very large pyramid of tools. And we rely on ETL vendors like Informatica or a Talend or a Pentaho and IBM vendors like a Tableau or QlikTech and analytics like I said.


So, you said data steward. Just tell me what is a data steward? I think that’s a new term for a lot of people. What’s the–

JEFF H 06:25

Yeah. Data steward I think is someone who is really an expert in what data do we have accessible within this organization. What are the processes that are collecting into our data warehouse? What are all of the various look-up tables? And what is sort of the historical quirks, like, Oh, on September 29th the UT fails, so there are going to be zeroes in all of those columns. Who are the powerful end users of that table, so if I have a question about this area, the data warehouse. Who do I ask? The data steward is often sort of a subject matter expert through data governess expert within the organization. So, people–if you look inside of Bing, for example. The Microsoft team they have these data stewards, where they have a giant cosmos cluster collecting hundreds of petabytes of data for analyses. And then for each line of business unit they have these data stewards who really service the interface between the cosmos infrastructure and analyst downstream. And they have that expertise in what data is being collected and how it’s being used.


Alright, sounds almost higher arch. Like an organizational approach to solving this problem [inaudible] and more.

JEFF H 07:26

Yeah. But, it the same way that we encode the best practices of system administrators in our software tool of Cloudera manager, our goal is to encode the best practices of the data steward and to Cloudera Navigator. So, there’s all different kinds of things that system administrators are doing on a day-to-day basis that are being automated by software. And you can do the same thing with the data steward.


Alright, DJ this idea of building tools let’s say, that make easier for people to do stuff with data. I mean when you look at–so you’re at a VC firm right now. And I mean you’ve done this stuff, linked it before. I mean what kind of tools do you really need, to empower. Let’s say, a non-data analytic, even to really get used,–to really be able to deliver value with data.

DJ PATIL 08:09

Yeah. I think it almost goes–maybe want to say it, or put it in context is–I spend this morning at an organization called And it’s the largest no for profit for teens. And it embodies this thing, is they have these people incredible curiosity inside the organization who are asking phenomenal questions or data and interactions they have with teens. And then their stuck, they just can’t get going any more. And then sort of look at their stack of what things are on, and it’s kind of a lot of these traditional technologies from five, ten years ago. And where looking at how their trying to architect the things and a bunch of the questions were–you just realized, hey, we don’t actually know what question we’re going to ask. So, how do we actually even have underlying layer, like a Cloudera style system to actually give us in place to start asking things.



DJ PATIL 09:03

And then once we have that layer. Then comes the question of okay, how do we make this easier for broader organization to look at? And whether that’s just straight business intelligence style dashboards to being able to ask very simple queries, to being able to slice and cut data in unique ways. I think one of the big challenges that we found is how do we enable the art of storytelling with data? How do we make everything from the graph, the picture, the analysis? The ability for some–how many times have you actually seen a graph at the–you get and you’re just thinking, hey can we just refresh that. Or could we add another line or another data element to that. And it just turns out to be incredibility hard to do. That’s where I think the tooling needs to get to, to enable us to create a faster cadence in working with the data. Things like in Poland these layers are really critical first steps. And we need to continue to build on top of those things to make it even go further and faster.


Alright, but how much of it is technological, and then technological in a software engineering since, versus technological in a, I don’t know how to do–I don’t speak sequel. I don’t know how to write use machine learning or libraries. I don’t know how to do that. I mean, how much of it is a technological? How much of it is, let’s say a knowledge or a math problem?

DJ PATIL 10:26

For me, it’s actually–and I’m probably more of a dissenting or minority vote, voice on this, is I think it’s actually less of a machine learning problem and more just of just a straight of hey, can I try a bunch of things with the data? And if you don’t actually know what the algorithm does, but something interesting start to come out with it, if gives you incentive to learn more about the algorithm. Like most people who do something with numerical analysis, with [inaudible] values or something very sophisticated. They don’t really know what they’re doing and you need a true numerical analysis to get into the real nitty gritty of the details to understand when the problem get hard. But, at least it’s motivation for you to start learning and trying more things. Then if we tried more things at least it would get people motivation way they should use more sophisticated things over time.


I don’t know if that marries up with your experience?

JEFF H 11:14

Yeah. Sorry, I was thinking about what I was going to say, so… [laugher] I didn’t hear what you said, [inaudible]. So, now let me move into that of what he was saying.


ETL issue.

JEFF H 11:26

Yeah. I’m not sure that that’s how they do the Segway’s on TV, but… But any case, so, yeah. Let me say something that I think about. [laugher] So, I think there’s this big missing piece–sorry I had this idea and then I just started deterring about it.


Go forth.

JEFF H 11:42

Yeah. Let me serialize into a vocalization. So, there’s this big missing piece in like a statistics education. I don’t even know if it’s statistics, a scientist education, of just the high level of overview, of what is the scientific method. And what tools do we have at our disposal today, both from a theoretical perspective as well as software perspective. To attempt to extract knowledge from–it’s almost starting from epistemology and philosophy of science. And then looking at what questions can we pose of the universe to which we–and then what kinds of answers can we expect to get using the tools of modern statistics. So, there’s you know kind of jump right into this is what a random variable is, this is what hypothesis test is. You don’t have that higher level of use. So, what I think the big missing piece for a lot of people is not the level of sophication of their mathematics background, but it’s more about being able to step back and understand, what is science? What are we actually able to say about the universe with tools of the modern scientific method?

DJ PATIL 12:44

Josh Alman and I gave a talk were we talked about what’s called the data scientific method. And so you form a hypothesis, you create a bunch of tasks you look at the data. You make an assessment. And everyone was saying, Wow, this is amazing, it’s a great insight.” We’re like it’s just a scientific method.


That sounds–

JEFF H 12:59

We’ll put data in front of it. And somehow that makes it appealing to it. And think the point that–the take away I have with that is that, if like Jeff was saying, if we’re able to really to empower people to actually apply the scientific method and increase the cadence at which you can do that, on a problem or a set of problems. That’s the type of tooling, that’s the super power that I’d like to say we need to give the business analytics to turn them into data scientist or whatever we want to call that hierarchy of people.


Do you have to start a business analysis or how soon, I mean, you’re talking about into education, scientific education. I mean is that something that you can take and elementary school student and somehow enable that curiosity to ask a question, get an answer, right. And then maybe that blossom into, I now I want to learn about what actually is underneath us, right.

JEFF H 13:49

Well that’s what a science fair is, right? I mean, kids are competing in that from elementary school on. So, there’s sort of this progressive refinement that happens where the tools that you have to create knowledge are pretty blunt when you’re in elementary school.



JEFF H 14:05

But, you’re still being taught sort of the harness within which you can apply those tools. And then over time you sort of progressively refine the power and the specificity of tools. So, I think there’s–yeah, I mean, you can do sort of starting in elementary school.


I think maybe a science fair project doesn’t how a potato can conducts electricity. But, it’s like how–

JEFF H 14:26

It’s like–


–how Google–how a web site is targeting ads, or whatever, so there relatable–

JEFF H 14:30

Well maybe–


–real sort–

JEFF H 14:31

So, we’re the example–



JEFF H 14:32

–I remember doing a science fair in first or second grade in which I put a piece of bread in the cupboard and then I put a piece of bread on my counter. And I let them sit for two weeks. And then I saw which one grew mold faster. And the point is, well the cupboard is dark, and so that’s an environment and it’s a context that facilitates the growth of mold. But, the same question can asked at a much more granular level, in which you start talking about chemical properties of the bread, And you’re doing much more fine gran measurements of the bread in terms… So, the same exact question could be asked by a working scientists today. I’m sure there is someone in this world who studies, how mold grows on bread. So, I could even first and second grade I could ask a question that probably still has signification to scientist at the cutting edge of research today, so… significant.

DJ PATIL 15:19

Hey, you just wrote about this Derrick, which is this idea of you’ve got your [inaudible] you’ve got your op or whatever data you have. We’re sitting there, we wear all of stuff and we have really do actually understand what’s happening? How are we applying it to own selves? How do we be ourselves to AB test or try different things. The same way I think we get more sensors in the house, or the home. I know, at least my kids would love to play–if we had all of the data coming off different things we could construct all sorts of different tests and experiments. We just go around trying to figure out what these plants sensors for the PH, and trying to see what’s growing. So, I think there’s this idea that’s becoming more natural in the educational process, of being able to actually look at something, measure something, and having something a little more concrete. Thank just sort of putting something in the bread like when we did it–

JEFF H 16:05


DJ PATIL 16:05

–as kids. To say what’s actually happening. That actually accelerates the love of learning.


Well we’re going to stick on I forget the exact acronym, massively open online courses or is it the Coursera’s the advances that accessing the world? I mean, what kind of–what role does that play, because we’ve seen these people, the young college students even, start take these classes, then start winning Kegal competitions. And that’s this whole–it seems like, that’s a real good answer to addressing some of these problems, right? I mean some of these knowledge problems that we use on the education front.

DJ PATIL 16:41

I mean, I’m a huge fan of it, of I guess, one of the things I was very fortunate to do when I spent some time in the government, was work with a number of really great people in Iraq and we built something, which is called Rocky Virtual Science Library, which is now the largest online education platform in Iraq. And it’s proven the wonder of being able to actually moving learning across difficult regions, time zones and all of the stuff. And I think there’s–it’s just the very beginning and what we’re seeing around the world of how people do stuff and share things. I think it’s–I’m a huge fan.

JEFF H 17:20

Yeah. I think there’s a lot of value in terms of massively online courses anybody to shove knowledge in the brains with lower friction. But, there’s still a lot of issues around the creation of curriculum. The design of the exercises that one under goes in the online course, there’s motivating students to actually both sign up for and complete the course. It was creating kind of the larger path within which the course exists in a mind of student. There’s the mentorship that occurs, so, their one part of a very large question. There’s actually a really interesting article that I read today at lunch about of sort of massively online courses that are engaged in regulatory capture right now, so… There are a lot of down sides to I think to what they’re doing. But, overall, I mean, the course that I taught at Berkeley, we – I paid out of my own pocket to capture the videos and put the online for free. So, I think there’s tremendous value in distributing the content and create.


I mean if you’re looking to hire someone, let’s say. And I mean and that’s on a resume point. I mean is that enough now were you say this is–this is probably the last question, actually. But, I mean is that the point where you say, is that enough to say, Oh, this guess did really well, in this Coursera course, right? So therefore maybe I hire him or do you really need still a degree or some real hard, you know, hard core experience in this stuff?

JEFF H 18:34

It’s a signal of motivation more than any–I don’t look at it as a signal of competence. I look at it as a signal of motivation. I often see resumes of people who are engineers and would like to do data science. And they’ll say I worked my way through the Andering online course. So, it’s the–it lets me know that that’s what their thinking about. But, I’m certainly not–I’m much more interested in digging into their level of competence. I don’t take it as any sort of signify of competence.

DJ PATIL6 18:57

Yeah. The only thing I would add to that is all of that and what I’m predominantly looking for is a level of curiosity and just and intensity and tenacity of attacking the problem, and then using that as motivation to learn more things to attack the space.


Okay, cool. Listen you guys, it’s been enlightening. I could go on forever, but out time is up, so… Thanks a lot.

JEFF H 19:17

Thank you.

[applause] [music]


Thank you Derrick. Alright we’re going to move from that on into a talk on the Missing Manual for Data Science remix, reuse, and reproduce it…

View Live Version of This Article