Input sound file:
1005.Day 2 Batch 4
Session Name: Dark Architecture And How To Forklift Upgrade Your Infrastructure With Zero Downtime
All right, we’ve still wrapping up, thanks for hanging in there. We’re almost done, but we still have lots of cool stuff on the program. We now have Dave Connors from Dyn, who’s going to talk about global traffic optimization, based on DNS and it promises to be very excited, so let’s welcome Dave to the stage.
Dave Connors 00:31
Great, thanks. My name’s Dave Connors, I’m the VP of Tech Ops at Dyn. We do manage DNS, advanced traffic management and transactional email for our customers. We’re based in Manchester, New Hampshire, and I’m here substituting for our CTO, Cory von Wallenstein – but at Dyn we like to say we always have a plan B, and that’s me today.
Dave Connors 01:03
I’m here to talk about dark architecture, and this is an approach to help manage risk in upgrading your infrastructure, and the risk from upgrading the infrastructure is around a few areas. One is the difficulty in breaking it down into small pieces of work. The second thing is the risk in deploying all that change at once, and dark architecture, the intent is to deliver business value more readily.
Dave Connors 01:34
The classic situation starts with a legacy system. The system’s been around for several years, the people who developed it may have moved on, it’s poorly instrumented. We can safely consider it a black box. We do know its inputs and outputs however, and we do know we need to improve both the input and output – often for reasons of scale, our capacity, we’ve exceeded our capacity, so we’ve got to refactor a new system that provides hopefully an order of magnitude and scale capability. It could be performance – this system could be a subsystem in the larger system that is now the bottleneck in your system, it’s holding back overall system, and the end performance, we need to upgrade the system. Or it could be that this system is just too tightly coupled. It’s a legacy system, hardwired, hard to predict, makes it difficult to add new functions, features, applications. You make a small change here, there’s an unintended consequence there. There’s a lot of business inertia there.
Dave Connors 02:43
Dark architecture really takes aim at two problems. The first problem breaking down the work into small pieces. If we know agile software development, continuous delivery, best practices, minimum viable product, the best way to do pieces of work of any risk is to break it down into small pieces and iterate through those small pieces to get to the end. In a legacy, black box system that’s inherently difficult. Secondly is the deployment risk. There’s been some evolution in deployments, there was the blue-green deployment, or A-B deployment where I have my legacy system A, I develop a new system B, put it out in production, throw a giant switch over – if it works, great, if not I can immediately roll back to system A, the legacy system. There’s canary deployments. Canary deployments are where I have system A, but I have a new system B – I put it out in a very small piece of the infrastructure, say a single server. I put it on the new server, I only allow internal people to see it – mitigate risk there. They say, “Okay, good,” I open it to the public. If that works I put it on a handful of servers, if it works there I spread it across all my servers. And that does mitigate the risk of deployment, but what it doesn’t do is it doesn’t give you that validation of accepting and processing real-live production traffic, and we’ll talk a little bit about how we can do that.
Dave Connors 04:16
At a high level, the legacy approach is, I build the giant system, I can’t break it into small pieces, it’s a three, six, nine, 12 month project, and all those changes add up, and at the end I have a giant bundle of change I have to deploy. In all that time, I’m not delivering anything of business value to the customer. At the end of that though, I’m in a flag day, throw the switch, all hands on deck firefight – very high business risk situation.
Dave Connors 04:51
Dark architecture tackles this by looking at the system in terms of flows. The black box sits in the middle, I have input and output – that’s traffic flow, but it’s also functional streams of flow. I like to use an example here, let’s take a specific one. We have an email system, an email tracking database, call it a MySQL server, and we want to upgrade that to a Cassandra NoSQL server. If I look at it the system components are pretty large single entities – the MySQL and Cassandra database. But if I look at the flow, if I look at the functional segments and streams, I can break that down, I can say, “Well, there’s emails sent, there’s opens, there’s click-through processing, there’s bounces logged, there’s bounces processed.” It’s a series of pieces of functionality, and that is the means by which I can break this down into small pieces of work. The second thing is the deployment risk, I can manage flows and think of the black box – now I have two black boxes, I can have the legacy black box out in production, put the new system out and manage flow across the platform, two inputs, two outputs, I throw one away. By managing the flow, I can minimize the risk, we’ll get to a little bit more detail on that in a second. Again, the legacy approach is, I’ve got a system, I need to upgrade it, I build a new system, I put it in place and then cut over on flag day, and hopefully it works. Not a great experience, huge business risk, very stressful.
Dave Connors 06:40
Dark architecture, how do we tackle it? We’re going to start with flows. The first piece of flow is we’re going to deconstruct the input flow and we’re going to look at the highest priority business value. In the case that I mentioned, email tracking servers, the main customer pain point is actually bounces, and bounces was made up of two parts – bounce logging, bounce processing. We want to start with something of high business value and addressing customer pain points, so we’re going to pick bounce logging, a very small piece. It’s about 2% of the overall system functionality, but it’s pretty low-risk, and my managing flow, I’m going flow production traffic onto that system and then throw away the output. The customer doesn’t see it, it’s not live in production, but it’s there for my inspection, and I use that output to validate the system. I look for things like row-count matches, errors in logs, performance time, processing time – things like that that help me validate the full functionality while it’s in production, in the live stream of traffic, but not processing functionally in terms of what the customer sees. It’s a very low risk to do it. And by the way, since I only had to create 2% of the functionality, I can deliver that a lot more quickly than trying to do the whole ball of wax over six months. That’s the value, and then when I get confident, I go live with that. Here we are, I’ve got 2% of my functionality live in the new system, I haven’t tried to do the whole thing and the legacy system is still processing everything except for bounce logging.
Dave Connors 08:22
As I said, the next step of customer pain and business priority for us was bounce processing. Let’s call it 18%, and we put that through the same cycle. We develop that, put it out, accept traffic, throw away the output, but validate the output. Once we’re confident that it’s processing appropriately, we turn that live. Now we’re in a state where we have the legacy system processing 80% of the functionality, and the new system is processing 20% of the functionality. We talk about deliver the maximum value, the business in a position now to look at it and say, “Okay, ideally I’d finish that project, but there may be higher priority items that I want to address, and I may take those higher priority items and make those the top priority, take the completion of this infrastructure upgrade project, put it into a lower priority workstream, like a swim lane for tech debt reduction, which I’ve allocated 15-20% of my resources for, and that project – completing the 80% functionality transfer to the new system – make take six to nine months, as opposed to the three months that it would have if I’d it as a high priority.” That said, if I look at business value, I’m back delivering business value to the customer. While it’s not ideal operationally to have both systems hand in hand, it’s invisible to the customer, and I am working off that backlog in the right business priority.
Dave Connors 09:58
Dark architecture as an approach addresses the two major challenges of infrastructure upgrades. One is breaking it into small units of work. Secondly, it takes away the deployment risk by putting it in live production stream, and it does it by deconstructing the input flow to functional pieces, and managing flow through live systems and throwing away outputs where needed. That’s pretty much it, I have a little time for questions if anyone has any questions. I thank you very much.