Voices in DevOps – Episode 6: A Conversation with Andrew Phillips of XebiaLabs

:: ::

In this episode of Voices in DevOps host Jon Collins speaks with Andrew Phillips of XebiaLabs on DevOps processes and pitfalls throughout the tech industry.


Andrew Phillips is a frequent contributor to the Continuous Delivery and Devops space. He's been a software engineer, team lead, infrastructure builder (a.k.a. head of duct tape) and community evangelist and now works on product management and strategy. Andrew contributes to a number of open-source projects, is a regular speaker and author and co-organizer of ContainerDays Boston and NYC.


Jon Collins: Hello and welcome to this episode of Voices in DevOps, where I'm going to be speaking to Andrew Phillips. We've had a bit of a discussion already about how to describe you Andrew, because sure you work for Google but at the same time, you've got lots of background all around DevOps and all around open source and all around all these topics for various people. And that's really what I want to pick your brain about. But how would you describe yourself and how did you get to the point where you would describe yourself like that?

Andrew Phillips: Thanks very much for that, John. Hi everyone! So my name is Andrew Phillips and I now work for Google Cloud. I guess the jokey statement that I usually make is that I kind of suffered around all sides of the software development and DevOps space. I've been a developer for most of my career. I was then a sales engineer frontline, ended up being the head of PowerPoint, a.k.a. VP of strategy for a company that commercially sold application release automation software into the enterprise. So seeing a lot of that, and obviously that discussion ends up being not just a product discussion but very much a transformational like ‘how do we actually make this stuff work?’ discussion; then went and did a stint as the head of duct tape, a.k.a. head of infrastructure for an AI company which was all very lovely hands on cloud...

I love these descriptions.

They're trying to accelerate teams, and yeah now at Google with a very interesting dual perspective of... obviously what customers trying to adopt the cloud are doing and how we can help them with that, but also of course some of the things that Google does internally, and what one can take from that and learn and distill and maybe take it out into the market in terms of some of the practices Google has. I know lots of people in the space. For what it's worth, I organized for my sins a couple of events called ‘container days’which all I can say is that if you ever tried to organize a community event, it's much more work than you think it is, and it is gratifying if you're into that kind of stuff, but it's a huge amount of work.

And I mean I guess as a final wrap up maybe is you know I've seen and then I have opinions one way or the other on the product versus process versus culture triangle which is obviously not a new triangle. It applies to DevOps as much as it applies to Agile, as much as it applies to any one of the many initiatives that we try to go through. I think if you like, 'the nugget' that we'll hopefully explore a little bit in this conversation is that it's easier to take a kind of dogmatic stance and the real challenge is: how do we find a good balance between what we would ideally like to achieve in any company, which is like magically transform you to be super efficient and great and whatever, given the constraints of the industry that we live inwhich is that there are only so many experts to go around and you know there are only so many companies that can hire hundreds of new people and pay them large salaries to just sit and fix problems; and then there are just very many different ways in the industry of tackling IT and how it's addressed; and that you have to take that into account as you try to figure out what things like DevOps mean. I don't mean to cut to the chase too much. I hope that we can explore those in more detail.

At the risk of starting a massive digression, I'm always fascinated by the skills thing because on the one hand, we're told OTTs are more and more automated and we're all going to be out of jobs, and you know [whether] blue collar/white collar, it doesn't matter. Automation is the name of the game.

On the other hand, we've got this sort of global skills shortage of programmers, of data scientists, of anyone that can get their heads around all of this stuff that apparently is automating everything. I don't think it's a paradox, but certainly it's two levers playing against each other.

Yeah, and again yes, at the risk of a digression, but I think a fascinating one. I mean I think there's a... I don't know, there's the small version and the large version of this and that sounds very cryptic. What I mean is that if we don't take a very DevOps or specific example and if you look at like DORA's state of DevOps research, and it's fun I get to work with Jez [Humble] and Nicole [Forsgren] quite closely now because obviously DORA now being part of Google. But I've known both of them for a while.

I think there's this elite category and we know elite has lots of good outcomes and lots of good options and so on. But what of course is the realistic question is: does everybody need to get to elite? In an ideal world, yes, but you know how well can you do, how much bang for your buck, if you like, if you only get halfway and what is the realistic tradeoff there?

I mean the sort of continuous delivery or the DevOps version of this is that in an ideal world, every commit would shift to production flawlessly in five seconds. But I think that's the wrong goal to have. I mean my phrase is something like ‘you shouldn't ship as fast as you can, but as fast as you need.’ And that raises a much more interesting question, which is: how often do we actually need to? And because that really depends on your industry on your competitors and all these kind of things and I think the same goes for this DevOps or digital transformation as Gartner would call it. Sorry to name drop one of your co-...

...some of my best friends...

There you go. And Gigaom, I'm sure has their own phrase for this particular transformation. I think at one level that is this idea that every company must become an IT company, and as you said, taken to the extreme that would mean that we would presumably need 10 times the amount of skilled IT professionals that we have today in the world. And it's clear that the pipeline isn't growing by 10x to produce those. So at some point, there's either a choice between that for any company, for any CEO: do you want to be in that rat race? Do you want to pay ever increasing salaries, do you want to do whatever it takes to get those people? Or do you make some kind of determination that you say “Well maybe I'm not going to get there. I'm not going to have the fancy innovation lab with a cappuccino machine or scooters and the Segways or whatever (obviously I'm stereotyping a little bit), I will thus lose some potential as a result. But is that an acceptable tradeoff?

I think there will just have to be some kind of spectrum of you know IT whiz kiddery/excellence whatever we want to call it. And I think only exploring the extreme end of the spectrum and saying “well that's where you could go…” It's not a really helpful guidance for the market because you're basically saying everybody has to fight over what is, at the end of the day, an incredibly small morsel.

And I think if you look at the path or if you look at it more as a sort of spectrum there are definitely points on the spectrum that are not quite in the elite wizardry space where simple adoption of a bunch of relatively well-known processes and tools that you can somewhat dogmatically apply can get you a lot of the way. It won't get you to the top end but I mean I think it is giving people more choices about making smart tradeoffs between what they can achieve and what they need to get out of this transformation and allowing them to say: “well maybe we only need to get to level 3,” you know, whatever that may mean.

The phrase just popped into my head 'chief tradeoff officer' which is horrible.

Something like that, yes. And I think that is a big missing step, but we have a reasonably good understanding and description of (I wouldn't call it Nirvana necessarily, but) really sophisticated practices. We have some good data nowrelatively solid, to indicate that that's really worthwhile if you can get there. But I think we don't have a good ‘trade off matrix’ or a cost/benefit function of various points along the way.

So this does make me think actually because before we hit the record button on this podcast, I said I was talking about those cloud native people who've got it all sussed. Enterprises can struggle though they've got some pockets of excellence, then as you were speaking I'm thinking, ‘Am I just being completely prejudiced? Am I being too kind of categorizing in terms of those cloud native... are these trade offs that we're talking about here, as much of a challenge for people to get their heads around in a world where people are doing DevOps as a standard thing?’ Or are the challenges the same wherever you go? Or is this more of an enterprise ‘end of the scale’ kind of thing where they need to understand trade offs?

Well I mean I think part of the challenge with some of these questions is doing DevOps is a standard thing of course, opens up the enormous can of worms of what DevOps actually means. But I think to the sense that there is a difference between enterprises and maybe smaller companies, it's largely a factor of the volume and existence and complexity of the systems that you already have.

I mean I think one very obvious unsurprising data point is that a lot of these practices are much easier to adopt in a greenfield scenario because there's still lots of challenges. I mean I guess the biggest and most standard one (and this is again not saying anything new to the DevOps community), there's been tons and tons of books written and seminars held and the DevOps enterprise forum does a good job of covering this topic around like security and audit, you know how at the end of the day and in a lot of enterprises there is some level of... Five years ago we went through the audit process and we convinced someone that this particular checklist of stuff was acceptable and that our current setup met this checklist, and the last thing we want to do is to move our architecture into the cloud where we would have to go through the same exercise again.

And that's just one of those sort of classic... almost an externality... the things that sort of stop making those barriers in the way of adopting some of the new stuff. I think yes, in a greenfield situation like if you've got a new app that you're building and a bunch of relatively sufficiently skilled engineers, you can totally make that work. And yes you still have some security stuff to go through and so on and so forth, but you can adopt all these nice practices, you can adopt a bunch of modern tooling and get lots of nice benefits out of that.

The reality of course is that new or greenfield work in any large enterprise represents usually but a tiny fraction of the system's landscape. They might represent where you're innovating if you like, the frontier of your application portfolio. But you know most companies don't know... or what I don't hear people saying is like, and other analysts have talked about multimodal, bi-modal kinds of things, but the reality is saying “OK we want to scope our ambitions down to a particular subset of our work and try to calculate an ROI based on that, as opposed to imagining this world in which you've kind of moved everything that you have into the cloud except for the mainframe pretty much. Those are two totally different things. And the idea of saying, “Well we also want to do DevOps for an ERP system that is some huge monolith of manuals and instructions and we've got bunches of consultancies into that data” and so on, that's like a totally different kettle of fish from: “Well we know that we have a portfolio of new work that we need to do, coming up in the next whatever six months or a year or whatever, and we want to optimize that bit.

I don't want to interrupt you but, at the same time, I have a question in there, which is linking back to what you were saying before about trade offs, because one of the things that I hear quite often (and I'm sure you hear quite often) is the notion of culture change. If you're going to want to do any of this, sooner or later the conversation turns to: “and of course you've got to do the culture change” and it kind of implies that we could insert anything into this conversation... like eating better.

Let's say you know you're gonna have your 5-day [diet]. You just have fun with it. But sooner or later you're going to have to have a culture change. And what you're saying, what you're suggesting is that when you were talking about tradeoffs, that sure, there is a need for organizations to get their heads around what's going on here in terms of agility from the top level, etc.

We can have that conversation, but equally it needs the tools, the approaches, the methodologies. The DevOps-y stuff also needs to flex to fit the needs of the organization. It isn't a kind of monolithic DevOps that the organization has to change in order to accommodate. I think it needs to be flexible.

Yes that's true, and I mean I think this goes back before we started the recording, we were talking about our respective backgrounds and things like ITIL and Agile. I think those are two interesting examples because you know ITIL [is] very prescriptive, like there's a book and you read it and you follow it and you implement it. And then you typically buy some very expensive tool kit from some vendor that has all the way too complicated, several million forms and so on.

And then you think you've done ITL because you installed the tool kit and forced everyone to use it and then there's Agile or DevOps for that matter, which in essence are basically a set of guidelines for making tradeoffs. That's not the common phrasing but that's really what they arethey are intentionally much less prescriptive. You can become an Agile certified scrum master, but you know you can't or... what ended up happening where people said Agile is downloading an issue tracker and standing up at 9:00 in the morning and having a meeting with post it notes and then making the post it note providers rich, that's not actually Agile. That's Agile in Name-Only or whatever you want to call it. It's very easy. It's very fun. It's kind of a bit childish to be skeptical about it and go “oh well they don't get it…” and so on and so forth.

The reality is that embedding so deeply in a philosophy and then having both the skill but also the freedom of action like the authority to both adopt the philosophy to your organization, and where necessary, to adopt your organization to the philosophy, that's a really complicated and hard thing; and that requires somebody at the top to basically give you the flexibility to do that.

Most of the time what can be done pretty easily in organizations is approving the purchase of particular tools and instituting relatively detailed, but not hugely transformational processes and eeking as much benefit out of that as you can possibly get. And yes, if it's the only thing you do out of Agile is you try to think in more short term iterations and you don't do much else with the stuff like this or that wrangle between the business and then and engineering and making trade offs every now and then, it's like you still get some benefit.

And I think that the same kind of goes with DevOps like this. There's a number of checkbox items that you can do for DevOps like: think about better automation for your infrastructure, basically recognize that the big shift in the industry has been that you can now do all the platform work as codeit's no longer rack and stack, it's actually software development. Just software at a different level, and all the practices that we've learned about software engineering come with that. And because everything is now software, it makes more sense as well to assume that the people who write the software that are a little bit higher up in the stack, have a little bit more understanding and insight of the software that's a little bit lower down in the stack et cetera et cetera et cetera.

So there's lots of things that you can do there that are relatively checkbox-y that don't require you to like radically change for instance the way your reporting structure works, whether developers and operations people ultimately end up reporting to the same person, whether they're measured by the same metric, all the things that are probably necessary to get to a more advanced stage, because at the end of the day, fundamentally what DevOps is about is recognizing that it's not about safety and speed, choose two, there is an essential tension between those two, which is understandable. And your job as a company is to try to figure out where the balance should lie. And at the end of the day how does that work?

Well typically it does not work if you take two different groups of people, make each one responsible for one or the other, and go even further and incentivize them with a performance system or a bonus system or whatever, for one team to look after one and for the other to look after the other, but to get to a point where you are able to understand that there is this tradeoff between how to be quick and scrappy and how to be safe and reliable. That is an organizational decision that you need to make and somebody needs to say “OK well at some point I'm gonna join these two organizations up and I'm going to make them report to the same person and that person's job is to make the tradeoff.”

Oftentimes you'll go into organizations and the join point is the CEO, who can't make that call because it's way too technical at the end of the day. And so there are some practices... you know Google does error budgets, which I think is a very nice low level technical way of doing this which is basically about saying “We can agree upfront that we want to hit this kind of metric for reliability, availability, latency whatever we care about.” And if you exceed that as the development team, you can do whatever you like, but if you exceed this limit, then you stop working on free features and you start fixing stuff so that it's safe.

But I do recognize that we've probably got five minutes left. And one of the things I wanted to cover with you (and we can do another one of these) was the role of open source within all this. So from the point of view of freeing up resources, so everything we talked about making a balance [between] safety, speed et cetera et cetera and the things in name-only versus actually adopting the principle.

One of the things that has given people a real lead is the ability to just pick up and run with open source based platforms, and so they haven't felt they've got the legacy because they've just been using stuff ‘off the web’ if you like, but maybe I'm overstating [it]. How important do you think opensource has been as a kind of catalyst for more flexible practices? Just because people haven't had to worry about waiting for procurements to deliver tools that were already out of date? Or am I overstating this?

Well, so I think open source has been a huge contributor but also a correlated factor. Like it's been a huge contributor in the sense that just as with Agile this has largely been, well it's taken a while for this to become a top down decision, and arguably you could say that it would never become a top down decision if there hadn't been a decent amount of companies that could demonstrate how successful it was what they were doing.

And if you look at very, very many of the original success cases, this was not some enlightened CEO or CTO turning around saying “We're going to do it this way.” This was individual teams playing around with stuff, solving problems for their particular environment with their particular niche, discovering that worked really welland then that spreading within an organization. That kind of stuff typically doesn't work or it's much much easier to do if you can try things out, if you can learn from others by downloading and using the same tools; if you don't have to go through complicated procurement processes or build complicated ROI use cases or those kind of things.

So I think open source was instrumental both in terms of allowing teams to bootstrap and the flip side, allowing them to contribute their learnings back into the community in a sort of snowball procedure, because what happens of course is like you try this stuff with the first set of tools, and then you discover that they don't work really well and if you're really lucky, you can sneak off some time and you can build another tool that either glues things together better or that's just a better version of the tool that already exists. And then that goes back into the community to feed more improvement et cetera et cetera.

I think now we've obviously got to the point where there's a recognition that we will just do this and now in some sense open source is a bit of a blessing and a curse at the same time because there's so many tools out there that companies are now faced with a sort of the paradox of choice: it’s difficult to figure out which the right ones are.

I think the other thing about DevOps and maybe Agile also is that they were not vendor driven initiatives. Right? Typically one of the things that the commercial software does... (let's add one element here: there are still some pockets even in a very DevOpsy shop where commercial software is very common), I would say APM is one of those examples. Alerting, paging, chat messaging, Slackthose kind of tools like there's still lots of space for services but they're typically SaaS services, plastic ‘pay as you go’ cloudy type stuff. So I wouldn't necessarily say the DevOps works with OSS only. At the end of the day DevOps works with all kinds of tools; DevOps works well with point tools, like HashiCorp is a great example of that.

And what we're seeing now of course with the enterprise now it's becoming more of a ‘oh give me some DevOps’ the sort of top down mandate style adoption of DevOps, there is definitely a space opening in the market for vendors who say “Well I'll give you DevOps in a box.” No matter that that's actually a contradiction in terms in some sense, it's good enough for a lot of people because it gives you some basic good practices that you can adopt pretty easily without having to think too much about the philosophy. As we said earlier, that's not going to get you to an elite level (no way). But it’s going to get you somewhere from where you are now.

That ‘good enough’-ness. I'm feeding this backing because from the top down there's as you know business consulting world talks about ‘fail fast’ as kind of almost the imposition of Agile on people that don't feel comfortable with it, so they've had to come up with ways of talking about it that make it sound sexy.

But what you're talking about with open source, and the philosophy of adopting open source is that kind of try it out, learn, see if it's the right thing. And it's not that people are doing it because they're told to, it's because that's what you do when you've got a smorgasbord of options out there. You just have a go with a few things and then you find one that works. And so you end up in this state of mind where you’re testing stuff out and striking those balances as you say, and understanding the tradeoffs. You're doing it from the ‘get go’ because well you're just in the middle of a big sandpit full of toys. So that's that's how it is, how you're going to end up.

Yeah exactly. And I mean there's a failure mode here too, right? So now if we inject a little bit of Google lessons. There is a... I like to call it “the local optimum.” So OK. I mean this unfortunately is a topic that probably deserves another entire discussion, but I want to maybe just inject this because I think it's very important.

Having too much diversity in your tooling landscape also applies to your process, but your tooling landscape effectively as well is as a local optimum which is easy to get stuck because there's a very, very common pattern we see and we see lots of enterprises trying to get out of it and it's very unclear how to get out. You start with a situation in which you're stuck, whether it's because it takes eight months to provision a VM or whether it's because you don't have the right people or whatever, and you can kind of unblock this by either getting in some consultants or by giving development teams access to the cloud or whatever; and you see an immediate benefit because you know they've got access to a bunch of stuff they didn't have before. They're starting to use tools that are just faster in terms of turnaround time, virtualization, they use so many practices, but they're not so heavyweight etc. etc.

And those development teams will typically optimize for whatever their particular problem is they'll choose a stack that works for their app or they'll choose a process that works for the audit that they're in or whatever, and that starts out very successful and that sort of spreads like a little bit of a wildfire‘a thousand flowers bloom’ kind of thing, and then you end up in a situation where lots and lots of teams have created local optima for themselves and you're now in a situation where things are pretty good, but you know you still have 800 different ways of deploying your applications, which means that they're all failing in slightly different ways and you have 150 or 200 instances of Jenkins that are all configured differently etc. etc.

What happens at the companies that go beyond there, the vast majority of them then try to standardize and generalize and they try to look at what they've created. They try to come up with a sort of golden path that should work for 80% or 90% of their use cases. And so they tried to create this sort of, I wouldn't say centralized, but standardized approach that tries to optimize for what they've learned in their company. But of course it's very hard to persuade people to get onto it because I've got something that works for me now and while I might theoretically accept that what you're giving me is better, it requires work from me to migrate, which, why would I do it, because I've got something already? And more than likely there's one or two things that I've kind of grown to love, that don't work in the standardized spec because after all, it's a generalization.

And so you're stuck. Nobody wants to come down off their little local optimum hump and invest the time and effort to migrate. And the problem then is is that if you look at things like what Google or Goldman or Twitter or a bunch of these other elite companies do, they can afford to invest serious engineering resources in, for instance, improving test results, I mean the time it takes to get return test results to a developer, nice metrics for your efficiency or your speed of feedback. If you can improve that by 5%, for any given development team 5% probably isn't worth the effort. If you can improve 5% across the board and you've got thousands of developers, that's a huge number. But you can only improve at 5 percent across the board if people run all their tests in relatively the same way.

And so coming up with a standardized way of setting up your developer environment, a standardized set of languages to work, a standardized set of conventions about how you write code, how you operate it, where your logs go, what your infrastructure looks like, what your runtime blah blah blah blah, all the things, having a small subset of those gives you huge opportunity to really ramp up on the automation and you can go as far...

Something Google does for instance, like, you can hire data scientists to build ML models to automatically suggest code improvements. That works if your code is sufficiently standardized, the money you’re investing in is teamed to building the improvement model actually pays for itself, and so I think one thing we have to think about as an industry, is how to get off that local optimum hump, and going back to what we were saying earlier, that ‘good enough’ might be fine, maybe the local optimum hump is indeed good enough.

So I think not everybody has to try to get off it, but I think where we can get to as a kind of maturity step, as I say, that we can recognize, this is the standard path to eliteness if you like, and that it's a legitimate decision for everyone to make. We can go with the local optimum stage, (pretty clear how to do that, it's not easy, but there's a well trodden path to get there, and then we can ask ourselves, “Where do we want to go to the next step?” and we can have a little bit of a better picture of what the investment is that's required to get there, now whether we want to be, we can afford to be, the kind of company that's willing to do that.

It's interesting, I mean I'm hesitant to use the term ‘Pareto principle’ but I'm going to anyway, and but what you're talking about as well is this situation where people are in a good place, but it's becoming more and more fragmented as it grows from that good place. And it just feels harder and harder to stay in that good place. So it's like they were already starting with the 20%, but it’s the law of diminishing returns , rather than the Pareto principle in that case.

I feel as an industry, the kind of standardization across the board, it's like tantalizingly ‘over there,’ it's something we've got to agree, I say “we,” not me specifically, but we as an industry have to agree to in broader terms, so that we can start to lock down some of the bits that are seriously standard across everyone in the finance industry doing fraud, or everyone trying to add AI to the testing process, or whatever specific best practice or area we're looking at. It's over there. It's not quite here yet.

Not, I think that is maybe a stretch goal. I mean I think our first worthwhile stepping stone is to say, within the context of my organization, maybe even my department, or whatever the social control is, standardization pays off with it, obviously, so we didn't actually have it where every company has the same standard, that would be obviously even more ideal because then it starts to become worthwhile for people like vendors to come in and try to solve some of these problems for you, because there's enough scope for them to sell into multiple companies.

But even if every company ends up choosing a different, relatively standard programming language or subset of programming languages, or subset of testing tools or whatever it happens to be, even within the scope of the organization that makes a lot of sense. And yes, no I mean at the end of the day, we will never come up with the one programming language that is perfect for every use case, and thus we are unlikely to come up with the software development life cycle that is perfect for every company. But each company for themselves, can at least ask themselves, “What are we going to do here?” And then it’s not as though people don't get this...

Every single large enterprise I've spoke to has some team somewhere called “Developer Excellence” or “IT Architecture” or “DevOps Center of Competence” or whatever they're called, who are trying to figure out what this standard stack would look like, but what is often challenging and missing, is the business plan for how to get teams onto that stack, because it's all very well to have the stack, but then if you have to do this complicated carrot and stick game, where there's usually a lot of carrots and very little stick because you're afraid to use the stick, because developers might leave, and the last thing you want to do is this very precious resource to leave. And how big is the carrot? Well it kind of depends. Of course for new teams, the carrot is really big because otherwise they'd have to do it themselves, but for existing teams that have stuff, the carrot doesn't look all that interesting, and so it's like coming up with a more standardized template for this particular path, like find a local optimum, build a standard, and then here's how you actually get that standard adopted within your company and these are the types of carrots you can use; and these are the types of sticks you can use; and you have to mentally sign up for that kind of...

I feel every company goes through that sort of somewhat agonizing, soul searching process, over and over again, and they end up kind of rediscovering... and we could make that meta pattern a bit more of a standard procedure.

And I think that's a really good point to finish on if I may: this notion of a business case for standardization is that there's so much in there that we could unpack. Probably a better way to unpack it is literally to play this whole podcast in reverse, but if you start from there, I think is what you're saying is what's the business value of standardizing? Then a lot of things kind of come out of that question, which really deserve an answer.

Yes I think that is absolutely true, and I will resist the temptation to dive in a little bit and talk about the metrics you might need. That's for another time.

Okay, thank you very much, but that's certainly for another time. I will certainly pick up on that point. Thank you so much Andrew for speaking to me, I look forward to that.

Absolutely, thanks a lot Jon.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.