Blog Post

When data lakes become landfills: how to avoid drowning in surplus information

[protected-iframe id=”9075eb4454a2123ed165542fc93fbcfc-14960843-25766478″ info=”http://new.livestream.com/accounts/74987/events/2361507/videos/30396427/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Session Name: How To Bite Off More Than You Can Chew And Succeed: The Big Data Infrastructure Guru’s Speak.

Chris Albrecht
Tim Moreton
Ron Bodkin
Adam Fuchs
Bhaskar Ghosh
Raymie Stata

Chris Albrecht 00:03

Thank you, Brian. Moving right along, our next panel is How To Bite Off More Than You Can Chew and Succeed: The Big Data Infrastructure Guru’s Speak. That’s going to be moderated by Tim Moreton, he’s the CTO of Acunu, he’s going to be talking with Ron Bodkin, CEO of Think Big Analytics, the CTO of Sqrrl Data because I forgot to announce your name, Bhaskar Ghosh, Senior Director of Engineering at LinkedIn, Raymie Stata at Altiscale. Please welcome our next panel.

Tim Moreton 00:34

Thank you, everyone. We’re here to talk about big data infrastructure. These bunch of industry practitioners who have witnessed first-hand the challenges of managing very large data sets and building infrastructures to cope with them. I’m going to ask the panelists to introduce themselves, and without further adieu after that, we’ll kick off into the discussion.

Ron Bodkin 00:56

Thanks, Tim. I’m Ron Bodkin, founder and CEO of Think Big Analytics. Prior to starting Think Big, I was VP of Engineering of QuantCast, the pioneer in big data that started using Hadoop in production in 2006, and ramped up to 50 petabytes of data. At Think Big, we’re the first of a new breed of services firms, focused on big data, helping the enterprise build high value solutions with Hadoop, SEQUEL, and Distributed Data Science. Excited to share our experience for real world deployment.

Adam Fuchs 01:22

I’m Adam Fuchs, I’m CTO of Sqrrl. Sqrrl is a company that does search and discovery of big data with and emphasis on complex security models, like those you might find in healthcare and in government space. In fact, my founding team came out of the US Intelligence community, so all of you that have been talking about NSA lately, I know about you.

Bhaskar Ghosh 01:51

Hi, I’m Bhaskar Ghosh, for the last three-plus years I used to lead all of LinkedIn’s data infrastructure engineering efforts. You might have heard of our open source projects like Voldemort, DataBus, Helix, and Kafka, and our new open source database, Espresso. Going forward, I’ll be jumping into the brave new world of data verticals and analytics and figuring out our business analytics platform strategy for sales and marketing for our internal initiatives.

Raymie Stata 02:24

I’m Raymie Stata, the CEO of AltiScale. Prior to AltiScale, I was at Yahoo! for over seven years, where I spent a lot of time on the Hadoop project, itself. At AltiScale, our goal is to make the type of highly scaled, professionally operated Hadoop infrastructure that we had at Yahoo! that companies like Facebook have. We’re going to make that available as a cloud service for any company to have access to.

Tim Moreton 02:47

Great, thank you, guys. Just to give you a bit of context about me, I’m CTO at Acunu, we do real-time analytics on streaming data. Essentially Self-Service BI on top of SEQUEL. We work with customers from manufacturing, from financial services, and a number of considerably smaller data start-ups, as well. Helping people get instant value out of streaming data. I wanted to start with a bit of a “so what?” question. Clearly, the larger your data set, the more challenges you have in deploying an architecture that allows you to get value out of that. Why bother? What have you seen organizations or the clients that you work with achieve with data?

Ron Bodkin 03:35

Tim, we see that customers in a variety of industries are excited about big data because of the ability to create value out of data sets they couldn’t handle before. Sometimes it’s organizations with massive amounts of transactions that algorithms to operate on them are too much, but often it’s things like web and mobile data, click steams, device data, internet of things where you have devices out the field, like in a project we did earlier on for Network Appliance. Similarly, we see financial data, we spoke at the Structure Data Conference about work we’ve done with NASDAQ on integrating trading and order data with new data sets. So we see all these new sets of data and organizations realizing they’re not getting value out of the data, because the traditional architectures and ways of processing the data are not suited for these new challenges.

Tim Moreton 04:25

Absolutely, and sometimes the data sets are new, they’re being newly created by data sources that simply didn’t exist. So telemetry data, or exsource data from social media sites, but sometimes it’s that the data sets were the previously, but couldn’t be well-exploited.

Ron Bodkin 04:42

Right, people traditionally had a view of data, it was more of a bunker mentality. How quickly can I condense it down, summarize it away, and limit what I have to work with because I have such limited resources? And today we see people using data science to create analytic applications where they say, “What new data could I use to get smarter? How can I get results into the hands of my customers to give insights? How can I automate decisions to move faster?”

Tim Moreton 05:05

Great. Adam, how do your customers get value out of big data?

Adam Fuchs 05:10

I think you make a very good point that in many cases, the big data has already been there, there’s new ways of dealing with it. Businesses have always been trying to evolve and some of the impediments for that is how much information can you get on which direction you need to evolve in. From my perspective, I think a lot of the big data movement is about more rapid application development and infrastructure to support that more rapid application development.

Tim Moreton 05:40

Great, and Bhaskar, your organization has always had a lot of data, but increasing has focused data being one of its core values.

Bhaskar Ghosh 05:51

LinkedIn, from a company point of view, our focus on the consumer side is our members, and as our members have scaled and come back and engaged with us more and more, the core data that we have generated is part of our core business value. Managing that data, both for our site facing side, as well as our enterprise partner side, which we don’t talk about much, but it’s an extremely rich area, and building platforms around it end to end. We have opted for a [?] approach. Managing that data and deriving data from it to feed back to the site, and deriving value from it to run our higher selling market businesses is part of the DNA of the company.

Ron Bodkin 06:31

I think LinkedIn is a great example of how companies mature in using big data, where they’re at the fifth stage of adoption of big data, where they’re creating new businesses, new capabilities out of data, it’s core to their business, whereas we see most of the enterprises out there are still at earlier stages, trying to do agile analytics and be better able to use the data they have in new ways, and not yet automating with machine learning and data science, let alone creating new business, but those stages will come.

Tim Moreton 07:00

Absolutely, I think there is definitely an interesting distinction between the creation of greenfield projects, where the whole proposition is based around being able to do things with data that you couldn’t do before, perhaps new data sets, versus brownfield sites where you’ve got a data warehouse, you’ve got a database or data source that you know has value in your business, but your current tool set simply cannot take advantage of that. Do you have any thoughts, Raymie?

Raymie Stata 07:30

In terms of my experience at Yahoo!, I could regale you with examples of how, for relevant science, we used all kinds of weird signals to help improve search relevance, and for advertising, same thing. All kinds of interesting signals to drive conversions. But since leaving Yahoo!, what I’ve noticed is the lack of the use of data, because of the bunker mentality that Ron mentioned, where people will proudly show me how they take all their web data and crunch it down into this tiny little bit and then hand that off.

Ron Bodkin 08:06

You said something wrong. The NSA systems are pondering.

Raymie Stata 08:15

Nobody by the Ops team sees, this is for consumer web application companies, nobody by the Ops team sees the raw log data. They see this very aggregated form of it, and it’s just a shame that the product scenes don’t have access to rich signals to really drive engagement, for example.

Tim Moreton 08:36

I think a large part of the drive towards making better use of data requires big organizational shifts. It requires you to be able to, as an organization, accept that you should be making decisions around data and understanding the relevancy and completeness of that data, which is not always straight forward, is it?

Ron Bodkin 08:59

We see that many enterprises starting the big data journey are looking back at recent patterns and try to apply them, and they don’t recognize that there’s a big cultural shift, have a new role having data scientists, IT is being called on to innovate and partnership with business to actually create new value, rather than simply commoditize and cut cost, so it’s a big shift and it’s not something that’s familiar. We see a lot of organizations stumble by trying to handle big data like it’s a data warehouse implementation, or SAP implementation, instead of saying, “Let’s experiment, let’s work closely with the business. Let’s have a more agile approach. How quickly could we get something out there and start to get value?” We think that the traditional approach of analytics, using data warehousing, those days are numbered. Instead of having this big system that’s in the glass house that no one can touch, and that all data needs to be curated and perfectly organized before anyone can have access to it downstream in the lowly data mark, you have the data lake. Dupe-based environment where you can store this data in a raw form, and then give access to data scientists in a raw form and then feed marks directly for fast analytics for everybody else. It’s agile, so the organization can start to take in data immediately, and promote things that are high value.

Tim Moreton 10:17

I think those organizational changes and cultural shifts, like shifts in patterns of working are necessary to prevent this data lake from becoming this data landfill, as Bhaskar described it to me, earlier. Adam, what do you think?

Adam Fuchs 10:31

I think, as an infrastructure company, we have a challenge, which is basically deciding who our customer is. Are we selling to developers that are embedded in various organizations or are we selling to the business decision makers, or analysts working at a higher level? I think a lot of the big data ecosystem and the software that’s out there, especially when you’re looking at Hadoop and those types of systems, they’re organized towards developers. A lot of companies that want to take advantage of big data don’t have that development community internally. I think that’s part of the shift that we’re seeing. Having the right tools in place to support the actual end-users, that’s a real challenge.

Tim Moreton 11:16

Absolutely, we focus very much more on the operational analytics, end of the spectrum, so rather than analyzing data at rest to work with, to find needles in a haystack or unknown unknowns, we look a lot more at helping people understand and get insight into data sets where they can anticipate the analytics they’re doing. We find very much the same problems there. There’s definitely a gap between the Self-Service BI that the tools have developed to become rather sophisticated, and traditional data warehouses, with what happens in these days with big data set ups. We’ve had customers tell us anecdotally, they still have to have teams in house to translate between the requirements of the traditional business user and map produced jobs to actually be able to ask the systems the questions that are necessary. That’s a challenge that you mentioned, Ron, I think.

Ron Bodkin 12:16

You often see a real gap in that you’ll have technical teams that have set up Hadoop, build it and they will come mentality of load some data in and hope that value occurs, maybe cut some costs on ETL and the business has these real problems and they’ve given up on being able to solve them because they’ve been told, “No, it’s not possible to use data that way.” And getting them together in what we call a brainstorming session to help think through what can you do in this new world that solves real business problems, going beyond creating that lake of data to actually start using it to move the needle on the business.

Tim Moreton 12:52

Absolutely, how does LinkedIn deal with this?

Bhaskar Ghosh 12:54

Let me throw a couple of anecdotes out. In a very data-driven ecosystem like LinkedIn, a bunch of people sitting here may have heard of our successful open source projects, like Kafka and our cluster manager called Helix is getting amazing adoption. What is not known outside, which I am personally going to dive into more and more is within the LinkedIn ecosystem within the company, we have done some really innovative things for our sales, our marketing folks. We have come up with social graph based ideas like SPI, which is social [?] to figure out how do our sales people sell? Someday we’ll come on and talk about it. We’ll build tools called Merlin, which our sales folks use, which is a personalized view of who you’re trying to sell to, using the LinkedIn graph. In a data-driven ecosystem, which you have a centralized clear lake of data, what I’m finding out is, if you look at the architecture of people and data, around the central hub there will always be sports and marts, all time data-marts. And within these data-marts, businesses will have to be very agile and innovative, fast, and there is school for platforms there, as well, all the way down to the regular platforms we have talked about, storage and pipes.

Tim Moreton 14:15

That’s very interesting. Raymie, what are your thoughts on this?

Raymie Stata 14:19

I think when you look at data lakes versus landfills, I think one of the still unsolved problems in the world of Hadoop is there’s not good data management tools, and stuff goes in, six months later it’s like, “What is this?” There’s no scheme of management, there’s no metadata, there’s no provenance, you don’t know where it came from. People have a habit of taking the original source and making some subset of it and then the next guy makes the same subset and the next guy makes the same subset and before you know it, you have all of this storage and nobody can remember who did it. You can set up a lot of manual tools and policies to try and catch that, but there’s no good data management tools and I think to avoid the data landfill from happening, it’s important that the community somehow make that happen.

Bhaskar Ghosh 15:12

Raymie, when you say data management, do you mean tools for scheduling, handling dags, metadata, core processing, or all of the above?

Raymie Stata 15:23

It’s more about just metadata on data. Where did this data come from? Where did this file come from? What’s in it? If it’s derived, what was it derived from? Who derived it? And if there’s a retention policy, when can we get rid of it? Data dependency really kills you in terms of who consumes this data? If I were to get rid of it, who’s going to be impacted? And that is one of the killers in terms of feeling like you can’t ever delete anything, because you just don’t know who might be using it.

Tim Moreton 15:52

I think traditional concerns about data quality, dealt with at the ETL stage, as they become less relevant or processed or handled in different ways, I think questions of provenance and sometimes compulsion to delete and remove things. Not just, “Where did this come from?” but “Should we even still have it?”

Raymie Stata 16:12

Are we allowed to have it?

Tim Moreton 16:14

Are we allowed to have it?

Ron Bodkin 16:18

These are great, quality problems to have, but what’s happening is organizations start using big data really start to get such value that you get a proliferation of people analyzing data and putting it in optimized ways of doing experimentation. These are good problems to have when the organization has lots of data scientists generating copies of data and you’re worried about, “How do I manage that success?” That means that the business is really getting some amazing results from your data. We see that the most foundational problem is getting that initial spark. Getting the system up and running, starting to get insights into, “Why are customers buying? How do we figure out before our machine breaks so we can service it and avoid expensive down time?” Problems like that are often places when people get started in their big data journey.

Tim Moreton 17:03

Absolutely, and that comes back to, how do you get those quick wins from the set of tools that we have available today that are widely deployed? As they are mainly low-level infrastructure tools, how do we move to a world, or is it even desirable to move to a world where there are off the shelf, vertical solutions for particular challenges that mainstream organizations are going to encounter? Raymie, maybe?

Raymie Stata 17:30

In the early stages of any kind of new technology trend, vertical integration is important to get things bootstrapped. You have applications dominating early on. If you take web analytics as an example of a vertically integrated stack to help you get business value out of your web blogs, quickly and easily. I think what happens over time, and again, analytics is a good example, at some point, after consuming that, you outgrow the product and you want to start to reach underneath and get at the raw data and do stuff on your own. One of the interesting challenges in the big data universe is when those transitions happen, what happened? If all of my data is over at Omniture, right now all of a sudden I want to get at it, “Well wait a minute, that’s my data. Why do you have it? Can I get it back? What do I put it in?” While it’s important, and I think people will have applications that harness the value of big data in a packaged way, enterprises will be frustrated at some point when they’re ready to touch the data themselves they don’t have it.

Ron Bodkin 18:39

Raymie, there’s a great point, a lot of customers we work with will say, “Hey, we’ve got data at Omniture, they’re taking feeds and they’re building customer reports to get more out of it.” A lot of times we see people blending Omniture data, maybe Double Click, maybe some social or other CRM data to form a more complete view of their customer and start to get analysis and insight. So that’s a great example of how you might start with a point application and outgrow it and have more strategic value to your company in blending data sets that are unique to you. Our passionate belief at Think Big is that the highest value of big data applications are going to be unique to the character of your company and how you engage with your customers, your product, your market. It can be assembled. Over time you’re going to have more components that come in that can be leveraged and you get more value out of that, but it’s not one size fits all any more than companies today don’t have packaged websites that they deploy, they build a website assembled from components.

Tim Moreton 19:37

Absolutely, and it does introduce that question about ownership of the data. The more you verticalize the solution and outsource the analytics and storage of your data that ownership is brought into question. I guess it’s not a problem that you guys have at LinkedIn.

Bhaskar Ghosh 19:50

Which leads us to something we’ve all talked about today, which is about people and talent. What does big data mean for the human ecosystem of the company? Is that something we should talk about?

Tim Moreton 20:02

Absolutely, yeah. One of the interesting things is that big data and big data tools require a significant shift in the typical background and resumes of the people within an organization that you need to hire in order to achieve success. I guess there’s two aspects to it I’d really like your guys’ input on. First, if you’re within an organization and you can see big data opportunities, how do you get started? Who are the sponsors within your organization that you need to get buying from? And then once you’ve got that buy, who do you hire? How do you hire them? Ron?

Ron Bodkin 20:42

We see that successful big data initiatives are partnerships between a business executive that’s trying to achieve a result out of their data and an IT executive. The ones that only have IT sponsorship tend to produce pretty limited results and then need to go out and recruit business sponsorship. First point, thinking through what are the possibilities and how do you get that quick win? There’s nothing like some quick success and starting to show people, “Look, we were able to improve sales” or “We were able to predict a problem, identify malicious bot-man for security problems,” those kinds of quick wins that can be done in weeks and not months, or years change the conversation. You see that, and the way to build competency is some combination of hiring experts, like us, working with good products and services that give you a platform to build on instead of trying to do it all yourself. And combining that with the strategy of training people that have the aptitude, you’ll typically find like for data science, you’ll see that there’s some fraction of statisticians that are customers that have the knack to be creative and learn new skills and new tools and want to do that and are able to rotate over, and others are more resistant and want to keep doing what they’re doing. A combination of training people on the new tools, mentoring them, having best practices brought in by an organization that specializes in it, as well as going out and hiring. But you’re not going to hire for skills. The thing I think a lot of people make the mistake of, they say, “I’m going to go out and [?] people with five years of Hadoop.” And there’s just not enough of them and they’re probably not going to move everywhere else in the world. Instead, hire people that have the right capability and teach them, let them learn.

Tim Moreton 22:16

What do you recommend to your customers, Adam, and who do you hire for your engineering team?

Adam Fuchs 22:19

From the start up perspective, Sqrrl is a little over 20 people now, so if we had started Sqrrl five to 10 years ago, we would have had to hire heavily into the infrastructure side, would have had to build a lot of internal services just to support the company, but we’ve actually been able to outsource a huge amount of the day to day business aspects.

Bhaskar Ghosh 22:44

Because of the external cloud sharing?

Adam Fuchs 22:46

Because of external clouds, but just because of people moving to more service-oriented systems. I think that in big data, there’s huge opportunity showing. A lot of this conference has been about cloud services, that’s going to continue to evolve. Do you need to hire the same people that you used to need to hire? No, it’s a much smaller group. It still does include the data scientist, you still need somebody that understands security models and privacy models. Our customers tend to be in healthcare where they’ve got legal restrictions on how they can use their data, or various cyber security solutions where they’ve got a lot of policy restrictions on the use of that data. Those are the types of critical hires that you need, people that understand those policies so they can make the data available to the right people.

Tim Moreton 23:41

Now you’ve been through two generations of hiring Hadoop experts? Are you going to be able to start standing on the shoulders of giants, Bhaskar?

Bhaskar Ghosh 23:49

To build an [?] ecosystem, depending on what your core business value is, you can hire infrastructure engineers, business analysts, data scientists, ops, apps engineers, it depends on the place of your company, if you’re small. For us, yes we will continue to build platforms. We will continue to hire infrastructure engineers. We will continue to hire data scientists, because that is where our core competence is.

Tim Moreton 24:13

Great, and I’m afraid that’s all we have time for today. Let’s thank our panelists and wrap up. Thank you!

One Response to “When data lakes become landfills: how to avoid drowning in surplus information”