Todays Leading CIO minds talks with host Steve Ginsberg
Colin is the proprietor of 7 Hills Consulting which helps companies with global network and datacenter architecture and design. With a networking and datacenter background spanning over 25 years, Colin has built and led the entire network and datacenter infrastructure for early stage startups including PayPal, YouTube and Dropbox. Additionally he has had various networking and datacenter roles for growth companies such as eBay, Google, Twitch, and Netflix among others.
Colin has rolled out infrastructure within 5 continents for various sized deployments. He has also designed and built numerous large networks, both in terms of egress traffic, and locations deployed. Additional specialties include: making repeatable cookie-cutter scalable deployments, site selection, lease/contract negotiation, backbone/transit/peering discussions, hardware selection, vendor negotiation etc..
This is the first part of a two part interview, find the second half here.
Steve Ginsberg: Hi, welcome to the podcast. I'm Steve Ginsberg, your host and my guest today is Colin Corbett. Colin has been a leader in internet engineering, with principal roles at PayPal, eBay, YouTube, Google, Dropbox, Netflix, Twitch and others. Colin, thanks for being on the show today. You've helped build key infrastructure for these powerful web scale companies. Any stories you'd like to share?
Colin Corbett: Thank you very much for having me, Steve. Yeah, I've got quite a few issues over the I guess... 25 years of doing this, and so yeah, I've got a few good anecdotes here and hopefully they'll help some of the folks listening.
Well thanks. Thanks Colin. You know I've always enjoyed speaking with you and I know a lot of your work began at PayPal. Maybe you want to start there?
Definitely. So yeah, many years ago at PayPal, we used to spend a lot of time bringing up infrastructure on site, spent a lot of long days at the ’co-lo.’ But when we were doing some of our first things, we also tried using an integrated full portion of our server build. And so after weeks of doing some of our on site infrastructure that we were doing, the integrator showed up and then they they brought their racks in. They were fully assembled, fully testing and they were up and running within an afternoon—whereas I'd kind of been sitting there, tracing and running cables and everything else on the floor.
So one of the things I really became a quick believer in was was things like pre-assembled racks and pre-integrated infrastructure. And what that turned into was it would allow for the turn up of a whole POP eventually in less than a week, as opposed to a month of time, or you know sending a large group of people out to go and do a deployment. You can now do a decent size deployment with a much smaller crew and have them back at the office in a much shorter amount of time.
And when you say that, what would you have the integrator integrate? What gets pre-built first? What do you do on site?
So usually it's everything at the rack level. So it'd be things like pre cabling, so it'd be slotting the servers, burning in the servers, blending switches, testing the switches. And if you do it really well, you can even say “here's all the cables that will go from this rack to the to the next rack over, or back to the core” and they can either ship that has a separate harness or they can have it spooled and pre-plugged into that one rack and then you just basically unfurl it and run it off to the core. Or if you have a distribution frame in the bottom of the rack (or I should say either above the rack or below the rack) you can have that all there ready to just go and deploy.
So then you're really rolling in kind of fully built racks at that point?
Very much so, yes, and done well especially now with things like ZTP and that you can be up and running in a couple of hours if it's going into an existing infrastructure, or if you have to deploy actual networking gear that's also fully deployed with racks of servers, you can be up and running in a couple of days.
Can you explain ZTP for the audience?
Sure. ZTP is the idea of Zero Touch Provisioning. So you connect your networking devices and they go out and they start ZTPing and they'll pull down traffic from a centralized configuration server and start—basically be online without any interaction so they'll actually come up and be on the network and upgrade themselves and be ready to go.
So was that a principle that you then carried to the other roles that you had?
Oh very much so. When I moved on to YouTube, we ended up with everything being pre integrated and just having all of our cabinet show up fully assembled and ready to go, including at that time all the server racks would show up and so we didn't do full integration for the network racks. We would actually go ahead and pre-install those, but we would leave them pre-wired for just adding more video and storage and web racks later.
But it got to the point that we would just—and this was the part which was really nice—was that we could just add video racks. We'd have the cables sitting above them and we could just have the racks placed. The cables were clearly marked and the racks were clearly marked as to which position they'd go into, and the racks would just roll in there; you'd connect the cabling in and they would just come up and online which was great. It wasn't true ZTP at the time because ZTP wasn't around, but what was great was we'd be able to get to serial. We'd be able to configure everything and then yeah, get everything running.
Let me ask... so with YouTube, our audience may or may not be aware that YouTube was brought up before it was part of Google, and then Google bought YouTube, acquired it at a certain point and were you in on day one at YouTube? I know you were in very early in the game when it was scaling and then the real question is...tell me a little bit about that scaling.
Sure so YouTube originally started in managed hosting, using a couple of managed hosting providers, and then I joined right as they decided to go and bring up data center and basically move out of the managed hosting stuff. So that was kind of wild and crazy. We ended up with a lot more traffic than anyone had really ever seen. And so we ended up basically having to scale the network out at rates people had never really seen before.
We went through a lot of really interesting things, so we picked a lot of data centers because yes, they had space, yes, because they had power, but also because they had really good network connectivity. But sometimes the places that have really great connectivity are not necessarily recently built data centers,— they've either evolved from being old telco hotels etc. So in one case we actually had a rack that showed up in it because of the way you had to load it in that you ended up loading it in on the street. So we actually had a rack fall off the back of a truck, which was kind of a hamper to our scalability there, which was kind of a bit of a mess.
But as one of the cool results of that... we ended up working with the integrator to get not just a replacement rack, but also having them pre stage and pre hold additional racks of all types, so that we can actually have capacity ready whenever we needed [it]. We had extra racks of web and DV and storage ready on a moment's notice to go. And that was fun, but also we ended up having a very steady capacity planning cadence where we'd say “I need x racks of video, y racks of web and z racks of DV, and they'd always show up at a steady state. And we would adjust it from time to time, but we would always have capacity that we could always grow into—which was a lot of fun because there were definitely times where we were in the beginning, where we were really short on capacity.
Actually I remember one night we had a rack arrive on a Friday night with the understanding that if the rack wasn't online by midday Saturday, the site would probably go down and not be able to keep up. So we were definitely... there were a lot of... in the beginning there were a lot of crazy kind of firefighting, but we very quickly got to got out of that with good capacity planning.
We worked very similarly at Pandora even during parallel times. Once we get to like on a 2007 timeframe where I think YouTube—you were always a little bit ahead of us, both certainly in how much traffic that was being delivered and then also I think because of that, you had gotten into it a little bit ahead. But similarly we would look at the graphs on Thursday and realize sometimes we would actually have to buy servers, get them built, get them sourced, get them in, populated, up online in monitoring within a week or less time sometimes, to keep the service running.
Exactly. Yes it was a wild time and it was a fun time at the same time.
So you built YouTube, and again at that time, it was the site on the Internet pushing the most traffic. So kind of the way people might think of both both Google, YouTube and Netflix today or Twitch. Some of these really... Bill Norton refers to them as 'hyper giants' and you start locating in network cores—that kind of thing. When YouTube was acquired by Google, did that fundamentally change the business or was it basically maintained in the same way?
So when Google acquired us, what was interesting is YouTube had more traffic than Google, so in most acquisitions,usually it’s the parent company comes in and says “OK we want you to move on to our network.” Well, we had more capacity, we had more egress, so it wasn't a quick migration. YouTube ended up still expanding post Google acquisition, for about another year and a half before everything kind of moved underneath the Google umbrella and Google ended up rolling out what is now their GGC platform to kind of help offload and move that infrastructure under.
But, yeah even in the beginning of YouTube for instance, a lot of their data center expansions were because there was no more capacity from the network vendors in those markets. So we basically took out all the free capacity for about... I think eventually in like four or five markets. We still kept expanding across the US, and then we were acquired by Google and that kind of helped us get a lot more peering relationships and other things too, which helped defer some of that too.
Sure. Do you want to talk a little bit about peering for the enterprise audience?
Sure, so peering is the idea of basically interconnecting network to network and exchanging traffic—ideally in what is called a ‘settlement free’ interconnect, which is: ‘I will send you my traffic for free and you'll send me your traffic for free and we'll try to do it at beneficial locations across the US.’ And by doing that, we both will not pay the transit provider. So usually when you pay a transit provider, I will pay for whichever way is more expensive: either my traffic in or my traffic out. A content network was much more outbound heavy and an eyeball network being inbound would pay for a lot of the inbound, and the cost of transit year over year has has continued to go down.
Right about the time when YouTube was there, it was about between $10-$16 a MB. But I do remember at other places paying $300 a MB. But even then when you're talking about sending you know a gigabit—that's 1000 megabits times $10, that's $10,000 a month. So the idea of interconnecting with either a 1GB or a 10GB, made a lot of sense because you would reduce cost on both sides, that you didn't end up sending to the transit provider which was very helpful for reducing your bill.
Right. Across an entire program, hundreds of thousands or even millions of dollars could be saved on transit?
Where did you go after that?
So after that I ended up leaving YouTube and did a bit of work at Google, where I helped define a few standards, make things cookie cutter , but then I kind of wanted to go back into startups, and I ended up going to Dropbox and building infrastructure for them there. And that was pretty interesting.
And what was that like? Dropbox (I would think most people will be familiar with it) is certainly one of the first of the online storage plays that really grew up and has continued to succeed.
Very much so. So that one was a bit different. Like YouTube, they were very much heavily in managed, but not managed hosting at that time. That was actually the cloud and AWS. But the idea was to help them build infrastructure to come out of the cloud and in this case, instead of doing a lot of small remote POPs and just crazy non-stop building like it was a YouTube, this was a few edge POPs but also large wholesale facilities.
And in this case it was for me... that my role changed from being a lot of network and data center, to being a lot of—still doing a good amount of network, but having to understand a lot of the wholesale side of what a wholesale data center looks like. And in our case we ended up making a couple of bad decisions and having to educate our wholesale data center provider about how to operate a wholesale data center.
And so in your view, what things would you expect to be different at a wholesale data center than what we would refer to as the retail data centers?
So at a wholesale data center, the idea is the size is more so. Usually a retail data center is good for somewhere between zero to about 250 kW about up to 25 racks, but at a wholesale data center, that's generally anywhere from about maybe about a quarter of a megawatt 250 kW. But I've seen deployments all the way up to about six megawatts. But instead of the retail where you pay a fixed price for an actual power circuit, in wholesale you're paying the actual power draw and either a direct pass through for the PUE or a fixed uplift for PUE.
You also have a lot more control about how your room's laid out, about how your power is distributed. So whether you want something like a star line bus way or certain types of power whips or things like that, you also end up in a different kind of lease, so whereas in retail, you're basically paying like a 1, 2 or 3 at least, with a wholesale you're paying for like usually something on the order of like a five to seven year lease with two to three potential re-ups kind of thing, so you can be in there for a long time if you wanted to.
Yeah, and just to add: a PUE is really a shared cost of the data center. It's kind of a ‘pass through’ of what the facility costs in general are... how those are being built up and then being distributed based on how much of the data center you're renting out.
So you said you said there were a few lessons to be learned there in terms of making sure that you get the service that you're that you're paying for. And later I want to talk a little bit more about vendor management, but any kind of quick takeaway specifically from that about how to deal with wholesale data centers?
So for me one of the things I really learned was: it's really a lot about the contract. So one day the most important people I suggest you find, is you find a realtor, a lawyer who understands real estate with data centers as part of it. So things like... I guess I'll start with the first thing I would always do is I start now by [writing] a very detailed RFP. And then you'll get the responses from the vendors and you'll ask very detailed questions in there. But you always make sure that your lawyer says, “That's very nice—all the responses you've put in this RFP are actually going to be binding in the contract.”
So I'll give you a couple of examples. One of the questions I ask in my RFP is, I say “Hey I'd like to bring racks into my data center. I plan for them to make it all the way from the loading dock without being tipped or tilted. And I expect that I can get them from the loading dock to my cage in less than an hour.” And you know usually in the RFP they'll say “yep we guarantee that” and it then becomes “OK but can you guarantee that for the life of this contract?” If this is a seven year contract, I expect that to be true the whole time whether you block off a corridor, or whether your freight elevator fails or something, we still need to be able to actually get my equipment in and my equipment out should that ever happen.
So I've definitely had cases where I've had buildings where they've shut down the freight elevator for five months, and you couldn't get equipment in or out or you could only through the side elevator, but then it wasn't ready for the same amount of weight or it wasn't ready for the same amount of clearance, and so you kind of have to refer back to the contract for something like that. But also one of the interesting things about the contract is that the contract is everything that the data center vendor is really obligated to do [for] you. And if you look at it as... in a negative view as, this is really all that they were obligated to do, so if it does not say in the contract something like ‘there will be 24/7 technicians,’ even if they say that in the RFP, the vendor will try to go and pull that out unless it actually says this binds back to the RFP.As a good nitty gritty example, I've seen data centers where they say “Yes we have coffee” but then you show up there [and] they don't provide coffee cups or straws or stirrers, so there's just kind of like the ‘well we're obligated to just give you coffee’ kind of thing.
You know this is a great point in general too I think, for those who don't have deep data center experience. If you have an intensely building or growing site, you know as in the case for the examples of YouTube and Pandora and Dropbox, for these web scale companies where you're moving very quickly. Managing the data center really is about keeping operations moving and about anticipating what can go wrong and expecting that just about anything could go wrong, and kind of being prepared for it.
For those of us who were building our data center experience over the last decade or so, sometimes we'd be surprised by the level of things that can happen. Certainly it's a small example, but having coffee and not cups seems like a crystallizing example, albeit not one that's as mission critical maybe as some of the other things. But in order to keep these things moving, the data centers are very dynamic environments. And if your website demands that you move quickly and consistently, you can really stop, if you miss any of these.
Yeah. On a much more serious note, I've had data centers that say “yes we meet...” So there's a data center standard called the TAA-942. It’s basically what all the tier standards derive from, but I've had companies say “yes we meet all the tier three,” which is the idea of any one thing can fail, but the data center keeps running, and it also defines things like fiber diversity etc.
But I've had locations where they've actually said “Yes, we meet all the diversity requirements” but the fiber ends up not being diverse, so you end up with one or in this case two potential points where the fiber actually crossed itself, before making out to the street. And you know one bad fiber cut, instead of taking down just one path and your site keeps running, you end up with the exposure of one bad fiber cut and the entire site could end up offline.
My worst example of that was I had a vendor at one point try to tell me on a call that a fiber was diverse because there was transmit and receive. That did not end up being the end result.
Yeah. I think the standard is that it should be 20 meters, you know physically separate... is what's supposed to be expected. But yeah I've had one data center tell me it was diverse, and you walk the path and yes it was diverse. It was about four feet diverse.
Right. Right. So the likelihood of something causing trouble to one side and not the other becomes pretty low.
Yeah, so after Dropbox I believe you ended up at Netflix as well. Do you want to talk about there? I think your role is a little bit different there, is that right?
I did. So after dealing a lot with data centers, I ended up moving to Netflix because I wanted more of a corporate role and I ended up going back and working on the corporate network there. We were doing a lot of build outs and a lot of expansions and that was a pretty fun time. But one of the cool things we had there was trying to erase a lot of the (I guess) technical debt that had built up over time, and to kind of figure out what were some of the burning issues and just to kind of solve that.
So the biggest hurdle we had there was: because the buildings there had grown organically over like I guess like seven or eight years, as everyone would move, as cubes would move and things like that, everyone would kind of go in and just kind of get their desk plugged or you'd get one ethernet port lit up and things like that. And it ended up with a lot of you know nobody really knew what was plugged in where, or how everything went.
So we ended up doing a lot of clean up there and doing a lot of like, great IDF free work where everybody gets their desk patches. We ended up adding new switches, new power infrastructure, new ‘out of band’ to everything so that should one building go down, you could still get to it by dedicated fiber and separate serial and things like that. So that was really cool and we also ended up rolling that out across all the other locations across the US too. As well as new wireless infrastructure and things like that.
Cool. And did you feel like many of the concepts from the data center translate well to corporate networking or did you find them pretty different? I mean obviously there is a difference of having end users directly online and locations are different, but...
There's a lot of things that come from the data center, a lot of things I took for granted, went there that I ended up incorporating things like you always have a spare of all your common parts. That really helped because in the corporate world, we ended up with a lot of third party vendors [who] would bring in things that you wouldn't expect to see in... like in the data center, since you can control all the parts that you see. But when especially in the corporate world, you end up with things like “oh this one vendor’s bringing in a special cable with a different type of fiber connector, and that's great, but if it's really important to us, we'll go and start sparing that, so we ended up having a lot of spares of everything that could fail—which was really helpful.
One of the things you run into both in the corporate world and data centers is anything that's really important, you should have a spare of because it's usually worth more to have the spare on hand than wait one to two days (ideally) if it's in stock, or up to six to eight weeks if it's not in stock, to have it. So, just stocking all the parts is great. The best part though is nobody really complains if you want to do a late night maintenance because there is a good time to take maintenance whereas in the data center, well there's never really a good time to take an affecting maintenance.
Right, you hopefully have enough HA that you can fail over to something and fail back, but that's not always the case if that's something that's more circuit infrastructure or something.
Exactly. So yeah in the corporate world you can actually put up a sign going “This building will be closed on the weekend,” or on a three day weekend and people will accept that and they'll work from the other building, and you can get away with that, but yeah. Not so much in the in the data center world.
Yeah. One one thing at least in data centers I tended to appreciate over enterprise office locations, was generally people knew the path to build circuits into the building. Not always in data centers, but generally that was more clear; where in offices it felt like for most most of the time when you were lighting things up, there was always a question of where the DMARCs really were.
Yeah, when we took over a space in Dropbox, the building was a relatively well-connected building, but there were five different fiber points where a vendor could land a circuit; and so trying to figure out where this vendor would land it eventually just resulted in us building to all five different possible locations just so that as we ordered different capacity, we would be ready for them.
We've discussed contracts in the past Colin, and we touched upon it a little bit earlier, but maybe you want to bring together the things that you think are really important for data center contracts?
Sure. So as I mentioned earlier I think the most important person you really need is a good lawyer. The data center contract can be a very large amount of money. So if you look at like a wholesale contract, let's say it's for about a megawatt. Let's say you're paying about $140 per kW, that puts you at roughly about $1.7million a year for about five years, not including all your escalators, renewals, your power, your PUE and also all the costs of the hardware you're putting in there. So if something goes wrong, the contract is usually your only way out.
So for me, we had a very bad experience and so now I want to do a few things. So first is I always send out a few... Whenever I'm looking for space I ask a lot of what I think are basic questions. So I do things like I confirm that there's 24/7 technicians, 24/7 security, that there's shipping and receiving staff and what their hours are, that there's annual maintenance and inspections of all the gear. I also ask for confirmation of what the temperature and humidity cooling ranges are as well as rate of change. And also if I do things like getting meeting room presence, and if my staff are allowed to run cross connects and if not, who is allowed to run them? What's the escalate time around that? And also if there's costs around running the cross connects. Traditionally in your wholesale data center, cross connects are usually like an NRC. It's usually with a no monthly recurring charge, although there's talk of that changing over the past few years.
Also some of the other things: just again making sure that your rack can always be brought in and brought out and that there's enough clearance so it never needs to be tipped or tilted. As a quick side note, even though most data centers claim to be 2N or N+1, you'll usually find out that there's only ever one freight elevator, and if that freight elevator does go down, you can be stuck getting equipment in or out for however long it takes to fix it.
So back to the contracts. Some of the questions, some of the other things they ask is: if there ever is something like a power outage, even if it's just one leg of redundant power on like a two power supply server, does that count as an outage? Are there credits to it and also, do they view it as an outage... definitely had data centers where they say “if you don't lose complete power, that's not really an outage.” There's also the corollary to that: if they don't count it as an outage, then there's no incentive to fix it. And there's no SLA for them to go work on it, so I believe it is an outage. It's something that's not expected and I expect them to go fix that.
To tie it all together, when I do send the RFP I make it very clear that all the answers in the RFP will be held to be true for the lifetime of the contract. So if they say that there is 24/7 staff and technicians, then I'm expecting that to look like that all the way through the entire contract. And the most important thing we ended up adding [is] what we called a... well the non technical term for it was a ‘clown clause,’ which basically said if the data center turns out to be not well run, do we get to walk if they show themselves to be just completely negligent?
So as an example, if a data center vendor went and walked into my cage and turned off the A-feed and then turned off the B-feed to my equipment and my power went down with, let's say no ticket, no maintenance, do I get to leave, or do I just get like a one day credit or a half a month credit or some something that doesn't say “hey this is completely negligent and you should be allowed to leave?” So we've started to ask for that and we were getting that in a lot of our contracts.
Yeah. That's a great point. Again I think some hearing this list, if they're not familiar, might think ‘wow that's a lot of kind of detail about things.’ and I think with data centers, some of the things individually when you look at them are gonna be unlikely, but collectively we know especially if we not only take our own experience but the experience of our peers that we talk to, that these things do go wrong in data centers, odd things happen over time.
I know you were mentioning heating and cooling. I know at one point you had a cooling system in the data center that caused some trouble that would have been unexpected. So these things really do come up over time.
Yeah very much so. I had one data center where it wasn't sealed from the outside as much as you would expect, so whenever the clouds would roll in, the humidity in the data center would basically swing massively completely outside of the rate of change even though it was... and if we actually had ties where we'd held it to the actual rate of change that would have been you know that would have been very helpful for us.
We also had a humidity system at one data center where to add humidity you're supposed to use filtered water, but the data center was using unfiltered water and we ended up with a white haze over our data center, that was basically highly conductive calcium that ended up shorting out a good amount of our gear. So yeah that was not fun, and we ended up having to go and basically, integrator again had to help us with that we ended up having to basically clean it all with compressed air and we still ended up with some parts that were just completely shorted out and non recoverable.
Yeah, this kind of business interruption is certainly the last thing that you need when you're moving at you know what we used to call Internet time, trying to keep operations moving.