9 Comments

Summary:

At the AWS Re: Invent conference, engineers from Pinterest, Flipboard and Yelp detailed some of the strategies their companies employ in order to keep costs low as computing demand increases. The keys are keeping an eagle eye on usage and using the right types of resources.

Yelp chart
photo: Derrick Harris

Amazon Web Services can be a great platform for startups when they’re small, but costs can outpace revenue growth pretty quick — especially if you’re offering a a free consumer service. At AWS’s Re: Invent user conference last week, engineers from Pinterest, Flipboard and Yelp shared their impressive and sometimes ingenious techniques for keeping costs under control and their bottom lines healthy.

Pinterest Operations Engineer Ryan Park had the stage to himself for a session on Wednesday, while Flipboard Chief Architect Greg Scallan and Yelp Engineering Manager Jim Blomo teamed up with Kleiner Perkins Caufield Byers Partner Ray Bradford to form a trifecta of wisdom on Thursday.

Know — and measure — your costs

Flipboard’s Scallan had a paradoxical lesson for the audience when it comes to managing cloud-based infrastructure: Embrace the cloud, but be afraid of the cloud. Yes, it’s flexible and affordable if done right, but all it takes is poor planning or a handful of servers left running ad infinitum, and the costs can begin to grow out of control. That’s why Flipboard assigns members of its engineering team the title of “chief miser,” which means they’re the ones who decide that applications are using the right resources and using them wisely.

Thanks to a variety of practices, including its miserly ways, Scallan said Flipboard is now running about 900 instances at any given time. That’s down from a peak of about 1,500.

Some stats on Flipboard's AWS usage

Some stats on Flipboard’s AWS usage

One way to help ensure this sort lean operation is to understand your business inputs and outputs, Kleiner Perkins’s Bradford explained. He suggests companies ask, for example, what it costs them to serve a free user on their platform and how does that change with scale or affect the experience they can offer premium users. Pick metrics that really matter, he said (e.g., infrastructure cost per user per month) and then consider how long your current  architecture can sustain that cost before it’s time to retool.

The secret weapon: Source your instances wisely

Pinterest, Yelp and Flipboard all swear by AWS’s pre-paid Reserved Instances in order to save money over the long haul. In fact, Flipboard’s Scallan said, the e-reading startup sees cost savings of about 80 percent over three years by using heavy-duty Reserved Instances instead of on-demand instances for its base workloads, and the break-even point might be only eight or nine months. Pinterest’s Park cited savings of about 70 percent over three years using them.

20121129_154538

The trick is queuing another job to take up the waste.

Yelp’s Blomo said his company is a heavy Elastic MapReduce (EMR) user, peaking at more than 350 Elastic MapReduce instances when many developers run their Hadoop jobs simultaneously or when it’s doing nightly analysis of its log files. In order to keep costs in check, Yelp uses Reserved Instances whenever possible to save on hourly bills and has implemented a job-flow pooling system to keep Hadoop jobs running continuously as resources become available. This helps avoid the situation where a job completes in 61 minutes, for example, thus triggering the charge for a full hour of resources even though it only used a minute worth of the second hour.

In order to best gauge when it should use what type instance, Yelp created a tool called EMRio that analyzes past usage to determine what resources are the most-efficient choice for any given job.

emrio

The results of EMRio

When it comes to optimizing costs on AWS, though, Pinterest appears to have it all figured out — even how to make use of the somewhat tricky Spot Instances that are priced based on demand and can be terminated without notice if the market price outgrows a user’s bid. Park explained how Pinterest uses the heck out of Reserved Instances and created its own auto-scaling “watchdog” service that decides whether to use Spot Instances or on-demand instances when more resources are required.

Ryan Park dropping knowledge -- and graphs

Ryan Park dropping knowledge — and graphs

Although Spot Instance prices occasionally spike through the roof, Park’s experience is that they typically remain stable and can result in “massive” savings if you know how to use them effectively. Using Spot Instances to power Pinterest’s approximately 80 front-end servers costs only about $20 per hour, he said. All told, Pinterest has reduced its daily computing bill to about $440 from about $1,200.

All this being said, though, Park, Blomo and Scallan all acknowledged that the flexibility of being able to mix on-demand, reserved and spot servers might not be all it’s cracked up to be if you don’t understand how they all work. Reserved Instances are inflexible in terms of size and region once you reserve them, and Spot Instances must be used wisely for jobs or applications that can handle their easy come, easy go nature. And now there’s even more to consider because Reserved Instances can be resold via AWS’s spot marketplace.

“It gets a little tricky,” Blomo said.

Pick your challenges

Although decisions such database type and structure are largely architectural, there might be elements of cost efficiency at play, as well. Maybe Kleiner Perkins’s Bradford put it best while leading off the session with Scallan and Blomo. Bradford presented a slide containing a simple quote from Instagram Founder Mike Krieger: “Your users around the world don’t care that you wrote your own database.” Sometimes, Bradford added, it might be best to use what works — maybe even a managed service — rather than whatever’s trending highest on Hacker News.

Pinterest’s Park expressed a similar sentiment during his session, citing a lesson his team learned about trying out too many new databases. The site used to use MongoDB, Cassandra, Redis and other databases simultaneously, but learning all the new technologies and managing them became burdensome. Now, he said, Pinterest uses good, old-fashioned MySQL (granted, it sharded MySQL 4,000 times) and memcached — as well as Redis — because they have strong communities and new engineers are more likely to know how to work with them.

After explaining EMRio and some other custom-built Hadoop tools to the crowd, Yelp’s Blomo noted that companies should carefully consider whether the time and money it takes to build stuff will actually result in commensurate savings once those tools or systems are in production. That can require some tough balancing of criteria such as cost, performance, flexibility and user experience.

But it’s important to use human resources wisely. As Bradford said during his presentation, “There’s no free lunch when it comes to developer time.”

You’re subscribed! If you like, you can update your settings

  1. Vinod Shintre Monday, December 3, 2012

    Not surprised with the proposed strategy but not sure if all companies have the muzzle like the big one’s to do it in-house , it has to be engraved in the engineering culture

    http://www.attribo.com

    1. Would be great to see more details about that “custom autoscaling” solution mentioned in the article. Even the built-in AWS autoscaler could hook-in to your preference toward spot instances — THAT would be something.

      1. I believe Ryan from Pinterest mentioned something about AWS auto-scaling getting better all the time –perhaps even on this front. Either way, I assume more fine-grained controls will come in time.

  2. It would be great to have the ability to autoscale into spot instances built-in to the existing AWS API.

    If AWS doesn’t provide the service, it could open a niche market for a would-be provider to offer it for the rest of us.

    1. In order to exploit the price differential between spot and on-demand instances, there are platforms coming up which provide a configurable solution. For example, I did this blog post for using Spot Instances with Hadoop ecosystem http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/

  3. Great point about not picking the “sexy” thing. I’ve used Redis and other memory key-value stores in the past, and it’s very easy to ignore it and let Redis do its thing. But before you know it, you’re using 100s of gigs of memory AND even more in storage b/c of persistence.

    I’ve recently switched from using Redis to just using good ole PostgreSQL as my queue and key-value store. Makes development easier, and gets the job done.

  4. Great tips. Using spot instances on a daily basis would be a little nerve racking for some. Knowing your servers could be gone at anytime!

    htp://www.stackify.com

  5. Great tips. Using spot instances on a daily basis would be a little nerve racking for some. Knowing your servers could be gone at anytime!

    htp://www.stackify.com

  6. Ali Khajeh-Hosseini Friday, December 7, 2012

    I blogged about Pinterest’s use of AWS S3 a while ago, their growth rate is crazy: http://highscalability.com/blog/2012/11/1/cost-analysis-tripadvisor-and-pinterest-costs-on-the-aws-clo.html

Comments have been disabled for this post