How Amazon is building substations, laying fiber and generally doing everything to keep cloud costs down

amazon server design

If there’s anyone still left wondering how it is that large cloud providers can keep on rolling out new features and lowering their prices even when no one is complaining about them, Amazon Web Services Vice President and Distinguished Engineer James Hamilton spelled out the answer in one word during a presentation Thursday at the company’s re:Invent conference: Scale.

Scale is the enabler of everything at AWS. To express the type of scale he’s talking about, Hamilton noted an oft-cited statistic — that AWS adds enough capacity every day to power the entirety of Amazon.com when was it was a $7 billion business. “In fact, it’s way bigger than that,” he added. “It’s way bigger than that every day.”

Seven days a week, the global cycle of building, testing, shipping, racking and deploying AWS’s computing gear “just keeps cranking,” Hamilton said. AWS now has servers deployed in nine regions across the world, and some of those regions include multiple data centers. The more you build, the better you get and the less risk-averse you get, he explained, and “the best thing you can do for innovation is drive the risk of failure down and make the cycle quicker.”

Hamilton working the very full room.

Hamilton working the very full room.

The cost of delivering a service at scale is all in the infrastructure. The software engineering costs “round to zero,” Hamilton said.

That’s why he thinks he’s seen more innovation in the world of computing in the past 5 years than in the previous 20 years — because companies like Amazon, Facebook, Google and Microsoft have gotten so good at scaling their infrastructure. He was on a team (presumably at IBM) that set a world record for online transactional database performance at 69 transactions per second and “the party was long,” he joked. Today, a single region of Amazon’s DynamoDB service is handling more than 2 trillion requests per month. The Amazon S3 storage system peaks at 1.5 million requests per second.

Here’s a little taste of what AWS is doing to ensure it keeps its costs down and its innovation level high.

Servers

Like Google and Facebook, Amazon is designing its own servers, and they’re all specialized for the particular service they’re running. Back in the day, Hamilton used to lobby for just having one or two SKUs from a server vendor in order to minimize complexity, but times have changed. Once you master the process, going straight to server manufacturers with custom designs can lop 30 percent off the price right away, not to mention the improved performance and faster turnaround time.

Today, “You’d be stealing from your customers not to optimize your hardware,” he said.

IMG_20131114_163354

Storage

Hamilton didn’t talk a lot about AWS’s custom-built storage, but he did share one tidbit. The densest storage servers you can buy commercially today come from Quanta, and a rack full of them would weigh in at about three-quarters of a ton. “We have a far denser design — it is more than a ton,” Hamilton said.

Networking

Networking is a huge problem today as prices keep rising and force many companies to oversubscribe their data center bandwidth, Hamilton said. In many typical scenarios, only 1 out of every 60 servers could transmit at full bandwidth at one time, and that works fine because they’re usually not doing much. Of course, that doesn’t really work for AWS, which can’t control the workloads its users are running. If they’re running something like MapReduce, he explained, every server in the cluster is probably transmitting at 100 percent bandwidth capacity.

So, like Google and, soon, Facebook, AWS is building its own networking gear and its own protocol stack. “We’ve taken over the network,” Hamilton said. “… Suddenly we can do what we normally do.” (Although, a skeptic might argue, you wouldn’t have to ask too many AWS users before you found one who has experienced inconsistent network performance.)

Outside the data center, AWS is also investing serious resources to guarantee its gets the bandwidth it needs. “Absolutely … that’s happening,” Hamilton told an audience member who asked about whether the company is building its own longhaul fiber infrastructure.

IMG_20131114_164229

Power generation

AWS also builds its own electric substations, which is not a minor undertaking considering that each one requires between 50 and 100 megawatts to really be efficient, Hamilton explained. “Fifty megawatts — that’s a lot of servers,” he added “… [M]any tens of thousands.”

The equipment can be pretty expensive (although not exceedingly high when spread across so many services) and the company even has firmware engineers whose job it is to rewrite the archaic code that normally runs on the switchgear designed to control the flow of power to electricity infrastructure. The latter part might seem like overkill, but Hamilton pointed to the Super Bowl XLV power outage as proof of what can happen when an electric emergency happens and the switchgear does what it’s normally programmed to do — drop offline fast to avoid potential damage to the expensive generator.

Rather than protecting a generator, Hamilton said, “Our goal is to keep the servers running.”

Resource utilization

Companies of all types have been struggling with the issue of efficiently using their resources for years, because they buy enough servers to ensure they can handle peak workloads and then keep them idle the rest of the time. And while turning off servers when they’re down saves a little money on power, it doesn’t change the fact that they were purchased in the first place. In fact, resource utilization is by far the biggest lever that AWS has when it comes to reducing costs, Hamilton said.

When you’re running at web scale, he added, “Anything that can change this number, even microscopically, is worth a lot of money.”

Luckily, being a cloud provider lets you get well above the usual 20 percent utilization number just by nature. For starters, because AWS is constantly running “a combination of non-correlated workloads,” Hamilton explained, resource utilization just naturally levels itself out. (Think, at a high level, of a chart showing showing peak workload times for various industries through the year, where retail would spike around the holidays, accounting firms would spike around tax day and other users would be spiking at other times of year.)

For when demand starts adding up, tracks and automates its supply chain.

For when demand starts adding up, tracks and automates its supply chain.

And then AWS threw in Spot Instance pricing to make sure that whatever resources weren’t naturally being used would be discounted and hopefully sold at a smaller profit. Any amount customers pay that’s above the cost of powering the servers is worth it in terms of recouping the capital expense, Hamilton said. It’s especially worth it for AWS, which has cut prices 38 times in 7 years and follows the Amazon.com model of making money.

If some analysts still can’t recommend buying Amazon stock, he joked, they probably wouldn’t be keen on AWS either: “We think the cloud computing market looks the same way [as e-commerce] … very high volume with very low margins.”

You're subscribed! If you like, you can update your settings

loading

Comments have been disabled for this post