Pursuing big data utopia: What realtime interactive analytics could mean to you

Ashok Srivastava Trident Capital Verizon Silvius Rus Quantcast Todd Papaioannou Continuuity Bhaskar Ghosh LinkedIn Michael Driscoll Metamarkets Structure Data 2013

Session Name: The Guru Panel: What’s In Your Stack?

Speakers: Announcer Michael Driscoll Bhaskar Ghosh Todd Papaioannou

Silvius Rus Ashok Srivastava


All right, we’ve got the Guru Panal up next and they’re going to be talking about What’s In Your Stack? It’s going to moderated by Michael Driscoll, the CEO of Metamarkets. He’s going to be talking with Bhaskar Ghosh, the Senior Director of Engineering and Data Infrastructure at LinkedIn, Todd Papaioannou, founder and CEO of Continuuity, Silvius Rus, Director of Big Data Platforms and Quantcast, and Ashok Srivastava, the Chief Data Scientist at Verizon and Venture Advisor for Trident Capital. Please Welcome our next panel.


Thanks everyone for joining us. I’m very excited to have our distinguished, as mentioned in the program, this is one of the most valuable panels we sit down with. So, thank you Om, or whoever wrote that about this great group. You’ve heard everyone introduced, so rather than go through and re-introduce everyone, I thought we would just dive into it. I’ll set the stage a little bit. As we were talking in the green room, I think we were trying to put ourselves in the shoes of our audience and understand why are you guys here at GigaOM Structured Data? What are some of the decisions that you guys are wrestling with? I think, what we have here are folks who have had the opportunity to have built big data stacks at places like Yahoo and LinkedIn, and will be building big data stacks at places like Verizon and we wanted to share with you some of the key criteria for decisions we’ve made about the stacks that we’ve created. I thought I would kick it off and say one of the big things that I think we struggle with in this new era of big data is a ‘buy versus build’ decision. We’ve heard from a lot of vendors here at GigaOM Structure, we know that there are a lot of offerings out there, and there’s some maturity, but there’s also some immaturity. I thought I would start and go down the panel here and talk to Todd, Silvius, Ashok, as well as Bhaskar and ask you what have you guys, from your perspectives, what do developers want? What do internal ops people want? What do data scientists want? And finally, what do engineering managers want? So, maybe we’ll start Ashok, with you, as the incoming Chief Data Scientist with Verizon. When you think about the big data stack, when you think about the constituents at Verizon, you are the head data scientist, so you’re the king of that group, what is it that data scientists want from a big data infrastructure? What is it that you want your team to have? What are some of the criteria you look for in that stack?


Well, first of all, thanks for having me here, it’s a pleasure. We’re looking at necessarily a massive amount of data; between 500 terabytes and 1, 000 terabytes of data generating from our systems per day. What that’s going to require is the ability to do real time analytics. So, taking that data straight off those data systems and allowing it to come into a repository that enables the use of machine learning techniques, as well as advanced analytics techniques. We’re going to be addressing a number of different constituencies, including customers within Verizon, but also external customers. So, it’s going to be monetizing that data and generating new analytical capabilities that are really pushing beyond what you would normally find in traditional reporting infrastructure, even a traditional matching learning structure because it’s going to be all of those things combined in order to address a critical problem.


How many data scientists are currently employed today and will be employed at Verizon?


Well, right now, I think there is one. I think we have another person in our team, so there’s two. Then we’re going to be scaling it out to about 20 in total. Not just data scientists though, because as you know, we need to have data science people, operations people, infrastructure people, as well as a strong product management team because those are the people who are going to be articulating what we need to build and how we’re going to build it for our consumer base.


So, maybe if we picture this as a stack where you’ve got the end business users who are talking to the data scientists and the data scientists are talking to the developers and the developers are talking to some of the core infrastructure folks and on down. So, maybe to take it one layer down on the stack, Todd, you’ve seen stacks built at Yahoo. You were at TeraData before that, now you’re at Continuuity, and you’re really focused at Continuuity on building tools or developers, you want to make this easier for all of us to get value out of all of this data we’re collecting. Tell us a little bit about what you think developers want from all of these tools that are out there. What are they looking for? What’s the criteria for the big data stack?


I think developers are looking for simple tools, simple APR’s to become productive and actually build applications quickly. Right now the state of the art in the system is dealing with low-level infrastructure components and most developers are not going to sit down and suddenly become PhD’s and distributor systems are not going to become PhD’s in master scale networking, but, they’re very, very good at building business logic and middle tier applications. What they want is to be able to use the power of the infrastructure without actually having to really worry about the internals of the infrastructure.


Could you maybe give an example of today, without Continuuity or existing tools, what is today too hard to do as a developer with you big data? What, in your experience, have you seen that just takes too long?


Well just think about what are the simple things that you want to do with your data? You want to be able to get your data, you want to be able to process it, you want to be able to share it, you want to be able to query it, right? Right now you’re going to end up with maybe three or four systems, just data ingestion on any kind of scale, you have one system running, probably some real time streaming system, could be Storm, could be TODD PAPAIOANNOU, could be some data pipeline thing that you’ve built yourself. You’re going to try to land that off into a processor on top of that. It’s a completely different skill set now for building mass produced jobs, which by the way is a low-level arcanery. To serve it from your website, you’re going to have to ship it out of that and you’re probably going to stick it into some serving tier. It could be a memory database, probably a traditional RBMS database, or some key value store. This entire set of different components in your application stack, they require different skills from the developers and actually different hardware prints in the large part to make them work. That’s BS. People don’t want to do that. They want to just get value out of their data.


Interesting. So, I guess I’ll go one layer down to Silvius at Quantcast. You were telling us in the green room a little bit about how you made the decision at Quantcast to rewrite HDFS. You guys have something called QFS, which I said stand for Quality Distributed File System, you said, no, it’s Quantcast, Q for Quantcast. So, you were talking about the need for efficiency, how there are times that you guys obviously looked for open source offerings and you made some decisions that you need to rewrite some of that internally. I’ll ask in a different way, which is, if you were starting over today at Quantcast and you came in with a budget, what things would you build and what things would you buy? What are some of the criteria and things you’re looking for, and might you do everything over and as you did?


So, we probably would for the file system. We wrote QFS, Quantcast File System, because HDFS uses too much space. So, with QFS, you get the HDFS-like file system that uses just half the space and it gives you better file tolerance. It sounds too good to be true, but it is.


You’ve open-sourced QFS, correct?


Thank you for reminding me, yeah.


So, it’s available to anyone else out there who is interested.


Just go and get a hub and download it and try it out. You can try it out on your laptop, you can try it out on a thousand different devices, it’s going to work just as fine. Actually, a company recently picked it up and it saved them an otherwise costly expansion, and they’re storing 3. 6 metabytes of data on it. So, the part on if I would do it again, for mass produce, we would do it again as well. We did it because we’ve been early users and we enjoy it and it is a great community. At some point we went ahead and went off the target data size, so we’re about 10 to 15 times larger than the average use case. We are not the target of the companies who sell it, so we have to do it on our own, simply because our size was larger than the average target.


Great. So, I’m going to go all the way down and maybe at the bottom of the stack here is not just ops, but actually engineering managers. So, Bhaskar, you manage a team of folks at LinkedIn. There are some very strong infrastructure engineers there. I wonder, from your perspective, what is it that engineering managers are looking for in this stack? Maybe you could expand a little bit on how we were talking earlier about how some of us are here from Silicon Valley companies and the types of engineers that we may hire may not be the kinds of engineers that Verizon may choose to hire. Talk a little bit about what you were saying earlier around the kinds of engineers that you may hire to build infrastructure versus those you might have if you’re buying things.


So, the thing that you brought up was build versus buy and we all agreed that it depends on two things; the phase of the company and whether what you build is part of your competitive story. So, given the chance, if we had to rebuild things at LinkedIn, we would still invest in a lot of the same things that we have built out, because they are giving us a competitive edge. On the online side, we’ve built out a lot of the infrastructure right and we have build Kofka which has used massive outside data. Coming back to the talent aspect, when you’re building deep systems, it is easier to attract top talent in that area. By bringing them in to build deep systems, which are ultimately helping us compete, folks won’t just build those systems; they will go through the ecosystems and build more stuff, which will add to the value. So, building depends on the size of the company, it depends on the phase of the company and how you want to grow. I would say, to go out on a limb, some amount of building is a good thing for a company. Otherwise you become a company where systems integration becomes a team and you don’t want to go there.


So, one thing I heard when we were talking, and Todd, I heard this thing from you before as well, for someone who came from a place like TeraData, where clearly you guys were building a relational data store for sale, going to a place like Yahoo, where obviously, we talk about these deep infrastructure engineers, Yahoo has no short of them. Tell us a little bit about that distinction between the things that an ecosystem can do and some of the old, boring things that these relational database systems do. Maybe paint a picture for us about where you see Hadoop and how it feeds in certainly at Yahoo and then going forward to some of these other SQL systems.


First thing I just want to tackle is that I don’t think a relational database system is boring. Maybe some of them are old, but you’re going to have one. Everybody’s going to have a relational database system, they’ve been around for 30 years, there’s a ton of investment there and I still have a lot of friends at TeraData in Green Plum, so I’m not going to poke them in the eye too much. That being said, the transition from TeraData where we were working on probably the world’s most sophisticated parallel database engine to Yahoo and working on a brand new distributor system was a bit of an eye opener because the factors of what you’re focusing on are very different. At Yahoo, we were massively focused on scale, we had a ton of data. When I was there it was 200 metabytes worth of data, so we were always pushing the envelope in the software to scale to that point, where it was the name noted or the file system or whatever it turned out to be. The very big difference that I saw though was we were not focused on things like efficiency, where a commercial company is looking at things like is it efficient from a storage stand point? Is it efficient from a job management stand point? Because they’re having to compete with that. At Yahoo, our approach was basically to just throw more machines out there. It’s cheaper to just throw more machines out than to actually focus on the software. So, you see in the Hadoop ecosystem things like QFS coming out now which are looking at some of the bits that are lacking in Hadoop and HDFS and saying, ” Look, this was nice, but it’s not going to be thing that’s going to get you over to the next 10 to 15 years”, we need to fix that.


So, you think that it’s fair to say that every organization will always have a relational database somewhere in their big data infrastructure?


Don’t see that going away for the next 15 years, no.


If I can add something that came up a lot. In the offline analytic space, what we are finding now is that Hadoop is progressing and evolving rapidly, but with the entire stack, Hadoop is not ready yet to do everything we need it to do. So, we at LinkedIn are really looking at a very tightly integrated story between Hadoop and a traditional relational database to drive insights on one side and then to drive feedback loops on the other side.


Great. So, one thing we were also talking earlier on was, the fact that we’re not just talking about hardware and software choices. When we think about the big data stack, it’s a lot about people, it’s about organization, it’s about how IT manages budget and the people within that. Ashok, to put you on the spot again since you’re entering into a new organization, you obviously are being brought in to think a little bit about how the people are architected. Maybe you can comment a little on some of these tensions between should this be centralized? Should the big data expertise of a company be a centralized unit that services all its constituents? Should it be de-centralized? Should there be a data scientist in every department? How are you thinking about the people that map to this big data stack at Verizon as you come in to take that role?


It’s an interesting question because we are starting off from a relatively small team and we are going to be growing it. One of the key things is that it has a lot to do with what the stack actually is and the decisions that are made there. So, what I’m seeing is that there’s going to be the need for building an organization that can deal with very, very large real time streaming systems, and that is going to drive a lot of the decisions that we make as we make investments there. Three years ago, it would have been a very different story. I would be telling you that we will need data scientists, but we really need to build some operational capabilities immediately and build a lot of the things that have been developed already. Well, three years have gone by now and now we’re in a situation where we can come back in and because we can leverage some of the work that’s gone on at LinkedIn and companies like Continuuity and other places, we might be able to take that and put more emphasis on upper levels of the stack, including analytics, including reporting, including product management. One thing that I observed in my capacity in the Venture Capital industry is that a lot of emphasis has gone on in the infrastructure and I think it’s obviously needed, but really where the value of big data is going to come is in solving critical business problems. So, you can think about big data being embedded in your organization and as it gets embedded, solving real critical problems that could not be solved without the advent of big data. So, bringing people in who can think like that, who can articulate that vision and then who can execute it, that’s going to be a big emphasis for us in the coming years.


So, maybe we’ll use that to segway into something that I think is a key part of this talk that we haven’t discussed yet, which are applications. The applications that sit on top of all this infrastructure and I think there are two ways that often one can think of these applications. Often, visualizations, dash boards, this is an active area of development on the one hand and then certainly less sexy, but just as critical are these outbound API’s that may feed into other systems. Silvius, at Quantcast, you guys have an enormous amount of data. What are some of the tools and applications that you and your business users use to get some of this insight out of all the data that you’re storing? We know there’s Tablow and there’s Good Data and there’s Click View and Spot Fire and MicroStrategy, what are some of those in that set of choices, what are you guys looking at as your go-to tools?


That’s a good question. That’s certainly not our strong point. We tend to do things fast and large, but it is our need certainly. So, we look at, in terms of big data, we have data pipelines. We have analytics like reporting. We have reports that are generated daily, such for billing, for instance. We have reports that are then on demand. Those I call Interactive Analytics. So, Interactive Analytics is a complex problem because you have at one end, the user, who may be a business person, business intelligence and says, “What if we did things a little different? How could we get value out of this?” So, they want to validate. They have 10 ideas and they that 9 are wrong. They know that one in there is good, so how do we get that back to them in seconds, not tomorrow? Tomorrow it will be too late, they need to build on top of this. The big challenge there is, they don’t know Java, and we don’t want them to know Java. We don’t want to expose the system that way to them. It could not be run by us. So, what we need to do is we need to give them the system in a way that they can manage and understand. So, data discovery has to be made easy and what’s really challenging in my mind about this is, unfortunately there are parts of the system that cannot be hidden from the user; data distribution. If you distribute your data across a thousand nodes and ask a question that requires a large amount of computation, if your question does not align with your data distribution, it’s not going to be interactive. So, how do we solve that problem? That’s what we’re focusing on.


How do you guys solve that problem? Because certainly anything that’s running on top of Hadoop is not going to be interactive. So, what are some of the approaches that Quantcase takes to feed these interactive analytic dashboards that you give to users? Or is still a challenge?


It’s certainly still a challenge, but we wrote something. We have this habit for re-inventing the wheel, but sometimes it’s not the wheel that we need. So, we have the luxury of having large data sets in-house. For people who develop big data stacks, many of them do not. We have a large infrastructure and we have the data and we have the business resources to mine that data. So, we come up with tools that work for us and maybe even work others as well. So, one such analytics tool that we put together was essentially to run on the file system, but it still goes over. It allows people to select sets of 20 terabytes out of 20 metabytes, and it gives an answer in anywhere between 10 and 60 seconds. It’s not where we want it. We want it in 5 seconds.


So, there are a number of companies in the last 6 to 12 months, a number of initiatives that have tried to address this issue of Hadoop is great for batch processing in your stack, but it may not be great for interactive queries at the speed of thought. We’ve got Cloud’s Impala project, we’ve got the Apoche Drill Initiative, certainly inside of Google there’s the Dremil Engine that we’ve heard about and there’s another initiative power drill. Maybe we’ll segway into this talk about real time. There are really two things that people conflate when they talk about real time. The first is what we were just speaking of; interactive analytics. I ask a question, I get something back instantly, although the data that I operate on may be old. So, Quick View and Tablow may allow me to ask fast questions of old data. The second piece of real time, and maybe this is what I would love to hear about the LinkedIn architecture, is basically the data is live as it comes in. Suddenly LinkedIn has a hundred thousand new sign-ups. How have you architected for that second style of real time? How does it sit next to the existing big data stack?


You’re absolutely right. There are two feedback loops which go back to the site. There are lightweight feedback loops, which are streaming. So, we have our Espresso real time database service to the site, and we have tracking data and database data coming out in streams through Kofko database going through very fast database systems. You compute simple aggregates there, simple counters, you do simple stuff and you serve it back to the site very quickly, so it’s near real time. Then the other part is what you already mentioned that you get stuff into your offline system and then you do an interactive query. So, our stack there is from the site, from Espresso, which is a distribution database, we take stuff out, we take data out and put it into Kofko and we feed it back.


So, this sounds like we have the first generation of real time systems which are fairly simple counters, incrementers sitting on top of streams, maybe distribution pipelines. Todd, as the guy who’s building the next generation of developer tools, can we go further? What’s next? I think real time is the Holy Grail for all of us. We’ve solved the problem of scale, we’re trying to solve this problem of speed. How do we get there? Can Continuuity’s tools get us there?


Of course we can.


Leading question.


Yes, thank you for the leading question. I won’t answer it from a vendor’s stand point. I think the interesting thing is that right now what we have in the industry is a set of tools over here for doing real time pipeline stream ingestion, and then a tool over here for doing batch and part of your lead into this session was we think we solved the batch problem and that’s interesting. What we have is a file system with a batch layer on top of it and then a real time layer over here and a bunch of people have a search this over here, maybe some real time query. What’s more interesting, I think, from a stack stand point and an architecture stand point is, let’s say, we assume we have a distributing file system, HDFS, QFS, whatever it turns out to be, there are a set of workloads that need to run on top of that data, and more interesting to me, and one of the things that we’re really focused on and investing heavily in at Continuuity is how do we actually expose different engines that run across the same data set. So, you don’t have a real time stack over here and a real time cluster over here and have a batch cluster over here and a search cluster over here. You bring all of that stuff together, so the applications you can build can actually span those workloads. So, on our platform, for example, you can do real time and batch in the same application; it’s sitting on the same stack. I think that’s actually the future of where application developers want to be. They want to be able to use these capabilities within a unified platform. That’s not going to solve the edge case problems, maybe Quantcast is an example of a company that has hit the barrier and has hit the edge of the envelope, but for the majority of the world out there, the majority of the customers and the enterprises out there, it will solve their problem.


So, maybe I’ll drill in a little bit on this. It seems that maybe one of the hardest challenges for all of us who have ever worked in building big data stacks is actually the ingestion phase and the transformation phase. Often, ETL, Extract Transform Load, before we stage it into an environment, what is the language that developers are going to be using to describe this ingestion and these transformations. Is it SQL? Is it Pig? Is it Hive? We have a number of options, but I think one of the challenges is that all these different systems require slightly different languages, and where do you see simplification going there? What do you think is the language of transformation for developers going forward?


I think actually whether it’s SQL or it’s Pig, there’s a minute play in some ways. What’s more interesting when you think about the large scale is actually how do you describe the overall pipelines? It’s not a simple thing to go from one machine to another and there’s one pipeline. It’s a massively distributed interconnected thing that turns out to be a very complex distribution system. Being able to describe that and overlay that and understand the way the data flows through it and the data life cycle, whether you have data quality issues at any point, that is a far more interesting problem than the way you decide to slice up the stage on the way in or whether to do it in Java or Python or JavaScript. All of those tools are useful at different points in that data flow.


Ashok, now we’re on the topic of real time. I can’t imagine that any industry could be more interested in real time than Telcos. One of the things that you have continuously emphasized is that we’re doing all these things with data, we’re writing all this infrastructure, we’re writing our own file systems, but obviously there’s a reason, there’s a business use case, there’s an application. Let’s think forward to a utopian world where these real time systems have been built, where we actually do have the ability to both ingest, transform, analyze data in real time as it streams from the millions of Verizon Wireless phones out there. As you look forward as someone who has a business view, what are some of the things that can be done in that brave new world of real time?


I’ll tell you, it’s an extremely exciting time because I think within the next few months, if not the next year, what you’re talking about is going to come closer to reality, and we’re planning for that. Some of the things that we’re thinking about obviously are in the marketing space, I think that Quantcast and other companies, I think it’s going to be very critical that we develop technologies that can make real time decisions in an automated fashion. Having said that, I think there’s an enormous opportunity in other areas, for example, in the area of cyber security. We heard a talk earlier today from someone talking about the need for understanding motion and movement of people, identities, and concepts through society. I think these kinds of technologies from the perspective of cyber security would be extremely interesting. From the stand point of science, I was at NASA before I joined Verizon, some of the things that we are talking about there are truly astonishing from the stand point of using real time data, for example, citizen science. Imagine taking your cell phone and taking some pictures, but now multiply that across millions of people and turning that into a data product that can actually be used by scientists to do real science. These are some of the things that we’re thinking about. Medical applications, for instance, other areas which can take the data that you called very nicely, is the old data, maybe from yesterday sitting in HDFS, but then the immediate data of this moment, using all of that together in order to make some real time predictions about a person’s state of health or about machine to machine health. These are some of the areas that we’re thinking.


Or maybe just helping us get through Manhattan traffic a little faster would be great. We’ve got a few minutes left and I think what I wanted to focus on in our last few minutes would be, we’ve talked a lot about our own big data stacks and I think there’s always the appeal of the new, so I first want to ask each of you to go down and tell us about a new technology that you’ve come across in the last year that you’re particularly excited about. Then, secondarily, I think it’s always important in the era of the new, of no SQL, of streaming data, there are some old dependable favorites that we shouldn’t lose sight of. Todd mentioned relational databases as being one of them. So, maybe Bhaskar, if we could start with you, what are some things that you’ve seen at LinkedIn that are exciting and new that you guys are particularly looking forward to and what are some of the old dependable things that you want to make sure that we don’t forget?


A little and extremely useful thing that we open sourced recently was a generic distribution cluster manager called Helix, which is an Apoche incubator project and we’re getting a lot of interest about that and we’re happy to help people use it. Two specific areas where we are actually building stuff, which is exciting, is in the streaming engine area. So, to do more stuff sitting close to the serving system. So, that’s very important to us and I think that will expose a lot of business value. The other is something that got discussed today in the Hadoop Panel, which is about do we keep on improving that HDFS in mass produce layer or do we build intelligence and serving systems and move forward? We are seriously looking at all our serving systems sitting on top of HDFS, I would say those two are extremely interesting areas that we’re actually building for.


Right, and then what’s the old reliable?


Old reliable is serving out of SQL’s works and we made it work with Espresso and using a parallel relational database with Hadoop may not be an attractive thing to say, but it seems to work very well for us.




By the way, Helix is a really cool project. We were actually taking a look at that at Continuuity. One of the things that we’ve been excited about over the last year at Continuuity is we’ve been working a lot with Yarn. Yarn is the new resource manager for Hadoop. Turns out that it was build basically to really just think about mass produced jobs. At Continuuity, we’ve built a new real time stream engine called Big Flow, and we’re actually using Yarn to do all of the resource deployment and mangement of that. I think the thing that you guys should be looking out for is Weave. It’s a higher level management framework that sits on top of Yarn. It allows you to build a much wider class of applications on top of Yarn. So, we actually think that Yarn is a fundamental primitive within the distributed data colonial which is Hadoop, is something that we will be going forward with for at least the next half a decade. Weave allows you to actually build more wide scale applications on top of that and distribute out the different containers and stuff. On the old school, get a database. You’ve got a dataase already, you’re not going to get rid of your database; parallel database, MPP database, I can give you 4 or 5 out there that work.


Right. Vertica, Green Plum, TeraData and Aster perhaps. Silvius?


I’ll start with the dependable and older thing. We use Pulse and it has been great. The cool new thing that’s not really new but it was critical and we enjoyed it was the language developed by Google and it was great for a class of users that otherwise didn’t have access to data. It’s so lightweight that it uses hundreds of kilobytes at one time. When did you hear kilobytes last?


And Ashok, I’m actually going to give you just the shiny new thing to state in the last few seconds we’ve got here. What are you most excited about looking forward?


I’ve touched on it already. I think that we’re looking into a world now where business problems can be solved using big data that could never be solved before and I think the underlying technology, whether it’s a relational database or Hadoop or some combination thereof is less relevant than actually solving problems that could never be solved before. From the VC perspective, I think those are the investments that we’d like to make.


Great. Well, thank you all four for sharing your thoughts as the esteemed Guru Panel. Let’s get a round of applause from our audience for these four. Thanks guys.


Thank you.

firstpage of 2

Comments have been disabled for this post