Sean Knapp is the founder and CEO of Ascend.io. Prior to Ascend.io, Sean was a co-founder, CTO, and Chief Product Officer at Ooyala. At Ooyala Sean played key roles in raising $120M, scaling the company to 500 employees, Ooyala’s $410M acquisition, as well as Ooyala’s subsequent acquisitions of Videoplaza and Nativ. He oversaw all Product, Engineering and Solutions, as well as well as defined Ooyala’s product vision for their award-winning analytics and video platform solutions. Before founding Ooyala, Sean worked at Google where he was the technical lead for Google’s legendary Web Search Frontend team, helping that team increase Google revenues by over $1B. Sean has both B.S. and M.S. degrees in Computer Science from Stanford University.
Andrew Brust: All right, welcome to this GigaOm podcast. This is Andrew Brust from GigaOm. I’m a lead analyst focusing on data analytics, BI, AI, and all good things data. With us, we have Sean Knapp, who is the founder and CEO at Ascend. Sean, welcome.
Sean Knapp: Thank you.
I’m hoping we can talk a little bit about data pipelines but gosh, if that’s all we’re talking about, that can be a bit dry. I think what, hopefully, we can talk about is how to get beyond what has been and get to maybe a better place where data pipelines feel less like a necessary evil and more like something that can really help and perhaps are less burdensome. I may have loaded all the questions there, but hopefully that’s okay.
Can you maybe just introduce yourself a bit and the premise and the rubric of what Ascend is up to? Then maybe we can drill down on a few questions from there.
Yeah, happy to. As I introduce myself, Sean Knapp, founder and CEO of Ascend.io, I’m a software engineer by training. I’ve been building data pipelines for actually15 years now. I wrote my first MapReduce job in Sawzall back at Google in 2004 and since then have gone on to build companies and teams really heavily focused around data, analytics, ETL, and more. About four years ago, I founded Ascend.io to really help automate and optimize a lot more of how we build data pipelines and make it a lot easier for us to, frankly, do more of them faster and more efficiently.
You’ve got your work cut out for you. I forget what date you said you said you wrote your first MapReduce job, but were you really coding MapReduce in Java? Was that the substance of what you were up to at that time?
Yeah, so back when I started at Google in 2004, Google had their MapReduce framework. You usually wrote your jobs in a language called Sawzall, which I was the tech lead for the front-end team on web search. We were always writing a lot of large MapReduce jobs to analyze how our users were engaging with content on web search, what links they were clicking on, the efficacy of our various experimentation systems. We wrote a lot of analysis on usage even 15 years ago.
I know from a little bit of dabbling I’ve done with Hadoop and that flavor of MapReduce, that can get pretty in the weeds pretty quickly. Gosh, I guess even before there was MapReduce, there were all kinds of ETL packages. Pipelines are not a new concept; they’ve been with us for quite a long time. I go back 20 years in doing BI work, and we were writing SQL scripts then to do a lot of this stuff.
What have pipelines been good for, and where have they hindered us, if I can ask you those two oppositional questions?
I think pipelines have been of tremendous value and also pain for most companies for decades now. They are certainly not a new concept. I’d say the technologies that they employ and that they leverage have changed tremendously, but the core concepts of really doing data movement and transformation, the classic ETL approach, really hasn’t changed tremendously.
I’d say what pipelines have been incredibly valuable for is pulling data out of one or many systems, doing complex and interesting transformations on it, and then of course, loading it back into others, standard ETL. Where they have been increasingly valuable for companies and organizations is as those transformations get more and more complex, the pipelines get more and more valuable.
I would say where the pain really starts to emerge is frankly, as you have more data sets, higher data volumes, and more people tapping into those pipelines, this is really where the pain emerges as we get this exponential increase in complexities tied to the interconnectedness of these systems. That’s really where most of the pain points are felt is simply trying to maintain and sustain these systems as they become increasingly complex, interconnected highways, if you will, of data flowing through the enterprise.
I guess one thing I’ve seen over the years is that we’ve written scripts really with the very tactical goal of just getting data from A to B and from Form X to Form Y. Once we got it to work, we were pretty happy. That’s very ad hoc, and then to go from there to something that is operationalized is another thing. It seems like some of these scripts just became operationalized almost accidentally and they weren’t necessarily up to the task…
Yeah, I totally agree.
Go for it, yeah. I’m listening.
Yeah, I totally agree. I would say the vast majority of pipelines that have been built were built in isolation, oftentimes scripted to meet a very specific and certain need. Then lo and behold, all of a sudden, you found ten other teams were building on top of that data set that was derived and what’s probably somebody’s weekend or hack-a-thon project now is a critical and core piece of the business.
Maybe not a passion project but maybe a project under duress to meet an immediate requirement may be a little bit more focused on getting something done and not making something that’s really well-engineered and built to be repeatable and last. Here’s the thing: I’ve already shown my age. I’m coming from the old data warehouse days and the PI days, and along comes all this big data stuff and notions of ‘We’re not going to do ETL anymore; we’re going to do ELT. We’re going to do Schema on Read. We’re not going to worry so much about transforming the data until it’s time to query it.’
Also, [with] this notion of data virtualization and bringing the compute to the data rather than the other way around: the implication has been that eliminates the need for pipelines. What do you think of that prospect? Is there a kernel of truth to that? Is there no truth to that? Is it just that things are more complex than we sometimes make them out to be in taglines?
Yeah, I’d say there’s certainly some truth to it. I would say the challenge with the ELT approach is frankly, you’re repeating yourself. If you’re performing the same operation on data over, and over, and over again, it becomes inefficient and it becomes expensive. The same core principles behind why ETL is really valuable is if you know you’re going to be doing something to a piece of data, the same kind of transformation, to get it into a shared, aligned system where you can do more complex things with that piece of data – for example, pipelines are a really good fit.
What we saw and the reason that we’ve seen a surge of the ELT approach is frankly, we have a lot bigger hammers than we used to. When you get bigger hammers, a lot more things start to look like nails. There’s use cases where we can take an ELT approach where previously we just simply couldn’t. It makes life easier for a little while.
What we find, however, is eventually those approaches do start to break down a little bit as either your data volumes or your data complexities increase. Then we see similar to what we’ve seen with some of the most progressive data engineering teams today is: they still have a very strong need to pull data from disparate systems back into unified systems to do complex transformations on those data sets and actually store and persist those derived data sets as they’re optimized, and curated, and available for those downstream teams to really iterate all. That’s where we figure there’s a fit, but I do believe that over time, I would say the industry evolved not to just be ETL or even just ELT but a cascading approach to ETL, TL.
I think maybe we need to acknowledge a nuance, which is that in the world of analytics, some things are done repeatedly and operationally, and some stuff is more exploratory. Maybe for more exploratory stuff, leaving things less structured in storage makes sense. Then doing the transformations at query time makes sense.
Once we really get into productionalized questions and analyses and necessary insights, that’s where we need a greater deal of engineering and structure, and that really was the rubric of data warehouses all along. Maybe that requirement hasn’t really gone away with that in mind. That’s more of a statement than a question, I guess. Do you agree?
It is a statement, and I do agree. I do think there’s overhead to construction of pipelines that if you have an exploratory ad hoc use case, it really is important to be able to just ad hoc export data. We do think that that is why there’s this really powerful combination of that desire to do ELT and how pipelines work together. We think they both fit complementary needs.
That makes sense. Now we’re starting to get to some nuance instead of extremes. I think that’s good. We know that there’s lots of folks out there – arguably, I used to be one of them – that felt a lot more in control and comfortable just writing their own code for something every single time. That may not scale so well, so we’ve had systems for some time and we have a lot of them in the market now that let us construct pipelines visually through some kind of schematic approach where things are declared, even where we’re not coding. Does that address the issue? Does that make things less brittle? Why isn’t that good enough?
That’s a great question. We certainly have seen these systems pop up that make it easier to architect and design pipelines. In fact, even Ascend has a visual interface that we know our users really enjoy using. I think the thing that makes this harder is it’s not just about the construction of the pipeline, but also the operation of the pipeline. What we’ve seen is: most of the tools in the industry to date help you describe declaratively in far less code the operations you want to make to your data.
When it comes to the actual operation of that pipeline itself, it’s still hard. You still have to answer questions like ‘what sets of data should I persist? How do I intelligently back-fill data? How do I retract previously calculated statements? Should I partition my data?’ All these things you had to worry about when you wrote a ton of code to go to pipeline, you still actually have to worry about even in these higher-level systems.
I’d say they’re a step in the right direction, but this is also where we believe in having a fully declarative system, one that is less task-centric but far more data-centric, that understand the data far closer to how a database engine or a data warehouse engine understands the data is really where things are going.
We’re big fans and believers in this notion of declarative systems similar to what we’ve seen in infrastructure technologies like Kubernetes and similar to what we’ve seen in these warehouses with their database engines. If you can build a system that really deeply understands how the pipelines work and the nature of the data and the dependency of transformations, you can offload a lot more of that complexity to the underlying engine itself.
You segued into my next question, perhaps unwittingly, but for a while after I worked on a number of these things, I started to see patterns. As I went from one project to the next, I was able to apply some rules of thumb or heuristics to get these things done a little faster. That was just me doing it, in effect, manually.
As an industry, it feels like we’ve been building these pipelines for a long time. Are there learnings that can be applied in an automated fashion such that almost like we have rules or an expert system or something that really understands the generalized prospect of creating pipelines such that maybe a lot of this stuff doesn’t have to be reinvented from database to database and project to project? First, let me just ask that about the initial authoring of a pipeline, and then I have a follow-up.
I think we, as an industry, are right on the cusp of doing so. It’s interesting because when we look at adjacent industries, very few folks are building a new database engine or a new data warehouse engine. It’s generally accepted that there’s some pretty standard technologies and winners in that space.
What’s interesting that we observed is – and even my team and previous companies have done the same – you always end up with a team grabbing some of the existing open-source technologies like Spark, for example, but then building, as we all have, these abstraction layers on top to try and better structure and automate a lot of these repeatable patterns; we’ve seen this time and time again. Even at Ascend as we engage with dozens and dozens of different companies, we see every company trying to abstract away these complexities to make it easier for the rest of their company to self-serve and create data pipelines.
What’s really interesting is: everybody is following these same patterns but all implementing them slightly different[ly]. They all tend to be very bespoke. To your question, I do believe the industry is right on that cusp of tipping into this exciting new era of ‘there are standard, well-informed ways of having a standardized intelligence system that knows how to essentially be the automated engine for pipelines, that does similar to what a database engine and a query planner and a query optimizer does for databases’ that we will soon start to see an emergence of these really intelligent layers for how pipelines themselves work that understand far more of the data as opposed to the tasks and as a result, offload huge amounts of the developer burden required to build and maintain pipelines.
That’s interesting, and I have a follow-up, but I also wanted to dial back to something you said earlier. My follow-up is: can this approach, in addition to being applied to create a pipeline, also deal with – that’s the stuff that always caused loss of sleep on projects I was on was sometimes some flat file that was supposed to get put somewhere didn’t get delivered, or gosh, schemas can change, either the schema in the source data or the schema perhaps in my warehouse or my OLAP cube or something like that. Again, I’m dating myself, but there it is.
Can we anticipate some of that and automate the adjustment of the pipeline? That’s my follow-up; then I have one more follow-up on the follow-up, but we’ll go one at a time.
The short answer to your question is ‘yes.’ I think the core key notion behind a lot of this is: ‘Do you have that intelligence layer that understands where your data’s coming from and where it’s going to, as well as how it’s being used?’ At Ascend, we call this our control plane and various other teams and companies may implement it and call it something slightly different.
The key concept behind this is: ‘Do you have a system that isn’t looking at data as it moves through just in context of one transformation, or one query, or one stage, or even a whole pipeline but instead to look at the entire ecosystem and do two things?’ One, can it detect changes and even potential breaks way far upstream before they even trickle down through systems? Then two, to your specific question, could [it] also catch a change that somebody may be wanting to make to that OLAP cube or to a data set and say if you do this, it could have this impact downstream because that system, that control plane, is monitoring the entire data ecosystem?
This is a whole new level of intelligence that we’re certainly capable of building. I think it tends to just be hard technology, which takes a lot of time. So did, honestly, database engines to relate to and then refine as well, but the benefits are astronomically high as a result.
You’re getting me excited because it’s almost like – gosh, now I have this idea in my head that we could almost have a trigger in the pipeline as opposed to a trigger in the database so that if something changes in the schema or something changes in a requirement, a source file, or source system that isn’t responding to a query or where a file’s not present, that we have a way of dealing with that. I don’t know if I’m extrapolating too far or if that’s somewhere in the right neighborhood, but that’s intriguing.
You mentioned before the whole notion of whether we have to partition the data. That’s been important for a long time, too, but it also just seems like in the world of data lakes and when we’re working with these columnar file formats like ORC and Parquet, there’s umpteen opportunities to partition the data just in arbitrary number of chunks or across certain dimensions or certain levels within the dimension, partitioning across year, partitioning across month, and then just saying “oh, partition it down into 200 pieces.”
Of course, the hope there is when you run a somewhat founded query, you can skip over a lot of the data that’s not important. The little bit of that that I’ve done, I feel like I’m guessing a little bit. What can you do to really automate not just the action items for implementing partitions but for developing a partitioning strategy and helping me with that so that I don’t have to guess at it?
It’s a really good question. The general thesis and position here is nobody should ever really have to worry about partitioning. That should ultimately be the pipeline engine, if you will. Just as when we have a database engine, it worries about how it’s laying out the data on disk and indices are happening to optimize how you pull it off of disk or out of memory. The very same thing should be happening with that pipeline engine.
When we think about where we can and should be evolving to as an industry, we have these really powerful technologies like our cloud BLOB stores and our processing engines like Spark that are really powerful, capable, and scalable, but they don’t maintain that same context around the data as it persists across the life cycle. You just don’t know the context of usage.
It really comes down to that pipeline engine when it has a broader context. For example, how Ascend’s control plane maintains context. It knows everything that you’re doing downstream on a data set and is smart enough to say you’re doing things like daily, hourly, or monthly roll-ups of data and analytic-style use case downstream from this data set. It can be really smart and partition that data dynamically for you so you don’t have to worry about it.
This is really, I think, just one of many categories where an intelligent control plane can help over time. As a user, I get to focus more on those individual use cases and that control plane can intelligently move that data around on the data lake and tune the pipelines for me based off of how I’m accessing and using the data downstream.
I was going to say maybe there’s even a component that can observe the kinds of queries that are getting run so that we can understand the most advantageous partitioning approach. Sounds like you answered that question before I even asked it, so that works well.
Also, the notion you brought up before of how ELT, while it has its advantages, it also can put us in a place where we’re just repeating a lot of the same transformations because we didn’t do it upfront. Actually, at a higher level, at a meta level, it seems like – it just occurred to me while you were talking – that’s kind of what we’re doing with partitioning.
It’s not that we’re partitioning the same data over and over again but as we go from data domain to data domain, we’re thinking through those same questions and applying the same stuff somewhat manually over and over again. Why not generalize that? Why not factor that out into a platform that’s dedicated to that?
We whole-heartedly agree and we do believe that as these intelligent control planes continue to evolve, we’ll even start to see this really exciting hybrid world of warehouses and pipelines where we should be able to get a far more seamless fabric where even if I am trying to do my ad hoc exploration, if there’s a control plane that is observing and monitoring this, that control plane can and should even be able to take what was more of an ELT approach and dynamically construct pipelines, move them to an ETL-optimized system and simply help optimize downstream. For me as a user, I shouldn’t have to worry about that. That intelligent control plane really can help hybridize this model and give me as a user a higher level of benefit from it.
To use your term, which was bespoke, there’s a place for bespoke effort, but at some point there’s a threshold where the bespoke effort is getting replicated frequently enough that it really becomes more of something that should be engineered and not hand-crafted. Maybe just mapping out where that threshold is has been a little bit difficult.
It sounds like what you’re saying is if we can identify it, then we can say “We’ll take the baton from here and handle the drudge work of making this really bulletproof and managed and with the exception management in there and just making it work in production.”
What we’ve seen over time – and this is really the history of technology – has been the things that a few years ago really differentiated your company from a technological advantage really become table stakes and commonplace as the industry evolves. Five, ten years ago, it may’ve been of a strategic advantage for your company to be able to run a bigger Hadoop cluster than everybody else. As that’s become commoditized, it’s evolved. We saw it move from how big can your Hadoop or HDAP cluster store to how tuned can you make your Spark pipelines be to really now what we’re seeing is: how well do you construct, tune, and optimize your pipelines that the code that you write to marshal those underlying resources is also now being commoditized and should be heavily automated as it’s not core to your business.
What’s really core to your business is how you apply your data to your business logic and your business understanding. It’s the outcome and the output of your pipelines is what differentiates. We’ll continue to see this from technologies that really help people focus on the differentiated part of their business.
Just so we can end on a high note, I think some people who are really sophisticated engineers and developers in this sphere hear about automating a bunch of the work that maybe was bespoke beforehand. As we automate, does that mean we’re sun-setting the need for the engineers, or does it mean that maybe what’s become ‘old hat’ for them can be standardized and then they can move on to even higher level stuff where their talents can really be leveraged?
Oh, yeah, I whole-heartedly agree with the latter. I don’t think there’s many engineers that like getting paged at 3 in the morning because some JBM process ran out of memory and you had to go tune the cluster or repartition the database. I’d say what we’ve seen and especially with data engineering, we’ve been solving a lot of really painful problems and just grinding through the muck for years.
I think this lets us really start to free up from the data movement part of the problem and free up far more interesting and exciting parts of how do we apply these large data volumes and incredible insights to far more automated and intelligence layers that really fuel great and exciting new products for the business. Hopefully we see a surge of innovation over the next few years as frankly we’re taking some of the best and brightest and freeing them up to go tackle bigger and more interesting challenges.
We never get to the point where engineers’ skills are not in demand. Just a question of how high up the value chain we want them to apply their skills. With that we’ve actually come to the end of our half-hour conversation. I know we’re also going to have a webinar where we can focus even more. I will look forward to that, and Sean, I will thank you very much for your time today. This has been a great discussion.
Fantastic. Thanks so much, Andrew.
For GigaOm, this is Andrew thanking Sean and thanking all of you who listened in. We bid you good day, good evening, good night, depending on your time zone and so forth. Thanks very much.