13 Comments

Summary:

Amazon Web Services VP and Distinguished Engineer James Hamilton explained during a session at the AWS re:Invent conference how the cloud provider keeps costs as low as possible and innovation as high as possible. It’s all about being the master of your infrastructure.

amazon server design

If there’s anyone still left wondering how it is that large cloud providers can keep on rolling out new features and lowering their prices even when no one is complaining about them, Amazon Web Services Vice President and Distinguished Engineer James Hamilton spelled out the answer in one word during a presentation Thursday at the company’s re:Invent conference: Scale.

Scale is the enabler of everything at AWS. To express the type of scale he’s talking about, Hamilton noted an oft-cited statistic — that AWS adds enough capacity every day to power the entirety of Amazon.com when was it was a $7 billion business. “In fact, it’s way bigger than that,” he added. “It’s way bigger than that every day.”

Seven days a week, the global cycle of building, testing, shipping, racking and deploying AWS’s computing gear “just keeps cranking,” Hamilton said. AWS now has servers deployed in nine regions across the world, and some of those regions include multiple data centers. The more you build, the better you get and the less risk-averse you get, he explained, and “the best thing you can do for innovation is drive the risk of failure down and make the cycle quicker.”

Hamilton working the very full room.

Hamilton working the very full room.

The cost of delivering a service at scale is all in the infrastructure. The software engineering costs “round to zero,” Hamilton said.

That’s why he thinks he’s seen more innovation in the world of computing in the past 5 years than in the previous 20 years — because companies like Amazon, Facebook, Google and Microsoft have gotten so good at scaling their infrastructure. He was on a team (presumably at IBM) that set a world record for online transactional database performance at 69 transactions per second and “the party was long,” he joked. Today, a single region of Amazon’s DynamoDB service is handling more than 2 trillion requests per month. The Amazon S3 storage system peaks at 1.5 million requests per second.

Here’s a little taste of what AWS is doing to ensure it keeps its costs down and its innovation level high.

Servers

Like Google and Facebook, Amazon is designing its own servers, and they’re all specialized for the particular service they’re running. Back in the day, Hamilton used to lobby for just having one or two SKUs from a server vendor in order to minimize complexity, but times have changed. Once you master the process, going straight to server manufacturers with custom designs can lop 30 percent off the price right away, not to mention the improved performance and faster turnaround time.

Today, “You’d be stealing from your customers not to optimize your hardware,” he said.

IMG_20131114_163354

Storage

Hamilton didn’t talk a lot about AWS’s custom-built storage, but he did share one tidbit. The densest storage servers you can buy commercially today come from Quanta, and a rack full of them would weigh in at about three-quarters of a ton. “We have a far denser design — it is more than a ton,” Hamilton said.

Networking

Networking is a huge problem today as prices keep rising and force many companies to oversubscribe their data center bandwidth, Hamilton said. In many typical scenarios, only 1 out of every 60 servers could transmit at full bandwidth at one time, and that works fine because they’re usually not doing much. Of course, that doesn’t really work for AWS, which can’t control the workloads its users are running. If they’re running something like MapReduce, he explained, every server in the cluster is probably transmitting at 100 percent bandwidth capacity.

So, like Google and, soon, Facebook, AWS is building its own networking gear and its own protocol stack. “We’ve taken over the network,” Hamilton said. “… Suddenly we can do what we normally do.” (Although, a skeptic might argue, you wouldn’t have to ask too many AWS users before you found one who has experienced inconsistent network performance.)

Outside the data center, AWS is also investing serious resources to guarantee its gets the bandwidth it needs. “Absolutely … that’s happening,” Hamilton told an audience member who asked about whether the company is building its own longhaul fiber infrastructure.

IMG_20131114_164229

Power generation

AWS also builds its own electric substations, which is not a minor undertaking considering that each one requires between 50 and 100 megawatts to really be efficient, Hamilton explained. “Fifty megawatts — that’s a lot of servers,” he added “… [M]any tens of thousands.”

The equipment can be pretty expensive (although not exceedingly high when spread across so many services) and the company even has firmware engineers whose job it is to rewrite the archaic code that normally runs on the switchgear designed to control the flow of power to electricity infrastructure. The latter part might seem like overkill, but Hamilton pointed to the Super Bowl XLV power outage as proof of what can happen when an electric emergency happens and the switchgear does what it’s normally programmed to do — drop offline fast to avoid potential damage to the expensive generator.

Rather than protecting a generator, Hamilton said, “Our goal is to keep the servers running.”

Resource utilization

Companies of all types have been struggling with the issue of efficiently using their resources for years, because they buy enough servers to ensure they can handle peak workloads and then keep them idle the rest of the time. And while turning off servers when they’re down saves a little money on power, it doesn’t change the fact that they were purchased in the first place. In fact, resource utilization is by far the biggest lever that AWS has when it comes to reducing costs, Hamilton said.

When you’re running at web scale, he added, “Anything that can change this number, even microscopically, is worth a lot of money.”

Luckily, being a cloud provider lets you get well above the usual 20 percent utilization number just by nature. For starters, because AWS is constantly running “a combination of non-correlated workloads,” Hamilton explained, resource utilization just naturally levels itself out. (Think, at a high level, of a chart showing showing peak workload times for various industries through the year, where retail would spike around the holidays, accounting firms would spike around tax day and other users would be spiking at other times of year.)

For when demand starts adding up, tracks and automates its supply chain.

For when demand starts adding up, tracks and automates its supply chain.

And then AWS threw in Spot Instance pricing to make sure that whatever resources weren’t naturally being used would be discounted and hopefully sold at a smaller profit. Any amount customers pay that’s above the cost of powering the servers is worth it in terms of recouping the capital expense, Hamilton said. It’s especially worth it for AWS, which has cut prices 38 times in 7 years and follows the Amazon.com model of making money.

If some analysts still can’t recommend buying Amazon stock, he joked, they probably wouldn’t be keen on AWS either: “We think the cloud computing market looks the same way [as e-commerce] … very high volume with very low margins.”

  1. This is a great example of the kind of things you can only do at massive scale, and is one reason to use the infrastructure as a service providers for software and actual infrastructure. But once you hit a certain level it’s still more expensive to run on the public clouds like AWS than running your own hardware for most use cases. It only really makes sense for truly elastic workloads which require bursting or large processing for a short period.

    Otherwise long running high performance instances e.g. for databases are significantly more expensive to run than if you bought your own equipment and colocated in a proper DC.

    Share
    1. David –

      I have heard that line of reasoning many times but I wonder if there is lack of appreciation of the ability to scale-down services on the public cloud. In my experience there are really very few truly constant workloads. Even in so called steady workloads there are periods of “lull” perhaps on weekends and nights etc. With elastic computing you can use a lot more levers to save money including switching off instances when not being used; using Spot instances etc. I am curious if your comments are based on hard numbers?

      Thanks for sharing your thoughts.

      Share
      1. Hi GP,

        David was mentioning (relational) databases as an example of cloud service/resource. And these are the kind of resources that you cannot scale down. You have to provision for spikes, which is why they are much more expensive on the Cloud (and often represent a significative amount of Cloud spending).

        We will eventually see relational databases able to scale down (the RDS read replicas are a good example). But right now, the only databases that you can scale down are the NoSQL ones. And not all of them. DynamoDB is a good example of a DB you can scale down.

        So even if you want to, you cannot always scale-down services (but the fact that there is a lot of automation that you can put in place saves a lot in terms of human hours not spent monitoring the systems, even though you still need some monitoring/admin, but much less).

        Share
      2. If you can scale up and down then that’s an elastic workload, which is the ideal use case for the cloud, but there’s always going to be a baseline level of traffic and whatever that is, it’s cheaper to buy your own equipment to handle it.

        Some workloads can’t be scaled up and down easily, as Nicolas mentioned. Databases are a good example because they have state and have a processs to failover and shut down – you can’t just pull them out of a load balancer rotation as you could with a web server, for example.

        With Server Density, my company, our traffic is consistent 24/7 and only increases because we take data posted back to us from customer servers. It’s a consistent workload that would be very expensive to run in the cloud.

        Share
    2. Not really. Well if you compare the price of a standalone server or DB and ignore all the data center hardware, facilities, power, cooling and personnel costs then yes the prices could be higher (in some cases) in AWS. But when you add up all these costs (and you should, for apples to apples comparison), and compare the costs to run in similar high quality data centers with high levels of durability, reliability etc, then AWS always comes out way ahead – it is by far the cheapest. Part of the reason is because it has the lowest cost base, due to the engineering and OEM work that James alluded to in his presentation. Another big reason is because of Amazon’s mode of running on low margins and passing on the savings to customers. It isn’t entirely altruistic – it is more to deter new entrants to the market. But nevertheless it means the best prices for customers. No one else can come close on an apples-apples basis. Certainly not old school vendors like HP or IBM, who do not have the low cost basis and are used to charging as high prices as they can get away with. How many times have you seen on-prem vendors lower prices in the absence of competitive pressure?

      Share
      1. I wrote an article with actual cost figures showing how much cheaper dedicated servers are when you have long term instances. http://gigaom.com/2013/11/29/hey-startups-dedicated-hardware-or-gasp-colocation-may-be-better-than-cloud/

        And the followup next week will look at how colo is even cheaper.

        Share
  2. Derrick, methinks AWS doth protest too much on this high volume, very low margins story… It seems like high volume, high margins to me, riding Moore’s Law beautifully. Hamilton said we should expect to see spot pricing coming to S3 and other services in the foreseeable future, which would be cool.

    Share
  3. no wonder the CIA likes them … and they keep mum on NSA issues…

    trust? hmm .. eroding like everyone else’s?

    Share
  4. Low Margin? Can’t anyone do math?

    If you want to explore gross margins of AWS go to the Amazon calculator at: http://calculator.s3.amazonaws.com/calc5.html

    1. On the Compute: Amazon EC2 Instances: Click the green + icon to add a new row. Call it test. Leave 100% utilized in the drop down.
    2. Then click on the Type Gear. Here you can see the hourly charges that Amazon gets for each Virtual machine for IAAS.
    3. Select the OS as Windows or leave Linux (your choice).
    4. Select the Cluster Compute Eight Extra Large. This is 32 Virtual Cores, basically equivalent to a current generation 2 socket server with 64GB of RAM.
    5. Now select Save and Close.

    The price Amazon gets for a fully utilized SINGLE Windows Server image is $2174.04 / month.

    If a fully configured late model server costs Amazon $3600 ( to keep the math simple, they are much less, at Amazon’s scale, but let’s be generous), and a server lasts 36 months (3 years). Then the monthly cost for the server is $100.00. Also according to Amazon the sever represents about 60% of the total infrastructure costs (documented in many Hamilton studies), then the total costs (DC, Network, Power, Etc) bring the monthly cost for hosting a single cloud server at about $160.00 / month.

    $2174 in monthly income. $160 in monthly costs -> ($2174 – $160) / $2174 = 93% gross margin.

    Clearly this is at 100% utilization. Run the server at 50% utilization and the income drops to about $1087 / month. Still 85% gross margin. Oh yea, this does not include the prices they charge for networking and storage, which can be added in the AWS calculator to further move the margin north.

    Amazon is making a killing, and because they are a retail business most speciously assume AWS margins are low too…. At reasonable utilization the margins are extraordinary!

    Share
  5. Matthew Geesling Thursday, November 21, 2013

    To bad the Obama administration didn’t look at this model when building Obamacare

    Share
    1. Michael Slavitch Monday, December 2, 2013

      He did. It’s called Single Payer.

      Share
  6. AWS’s vision of delivering compute per vm instance is simply a very costly model for large enterprises due to their use of very expensive application software (e.g., Oracle, IBM, MS, etc.). These costs usually dwarf server costs.

    If AWS sold physical servers as a private cloud for base workloads and vm’s for marginal loads, that would go a long way to solving two big problems for enterprises: security (no shared compute) and ELA costs.

    With a private cloud, you’re in control of the physical servers. With sound strategic capacity management and supply-chain management practices, private clouds enable operators to capture the value implied by Moore’s Law — and reduce software costs. AWS doesn’t solve the most pressing issues for big enterprises — cost, control (i.e., line-of-accountability), and security.

    Share
  7. AWS is usually a rip off compared to the other VPS services, unless you need a large amount of storage.

    Share

Comments have been disabled for this post