Netflix lost 218 database servers during AWS reboot and stayed online

14 Comments

Credit: Netflix

Netflix has mastered the art of keeping its Amazon Web Services infrastructure online over the years, but even it was afraid when it learned AWS would be rebooting a significant number of its physical servers in order to fix a bug in the Xen hypervisor they run.

Mainly, as explained in a blog post Thursday night, [company]Netflix[/company] engineers were concerned about their massive Cassandra database cluster. That database is one of the most-critical pieces of Netflix’s infrastructure for its video streaming service, and was one of the last to be programmed for automatic failover. But the work to make Cassandra resilient paid off during the AWS reboot:

Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.

Downtime, scheduled or not, is one of the unfortunate realities of cloud computing and probably one of the areas where cloud providers will seek to distinguish themselves in the coming years. Thus far, the various open source tools Netflix has released are some of the best methods for failure-proofing AWS instances, but I suspect AWS will have to automate some of this for its users in an attempt to keep up with what Google is offering. Third-party software such as the increasingly popular Apache Mesos could help mitigate downtime issues, as well, by balancing the workloads of failed nodes across the rest of a cluster.

14 Comments

Adam Mesenbrink

Your title is misleading. I would not have read the article if I had known 218 servers were rebooted during maintenance and 22 stayed down. That a little on the high side for failures on reboot, but not unbelievable or even impressive. Your article includes no technical details about what automation was used, or how long they were down those 22 servers.

PRETHOUGHT

Again the ignorance factor creeps in. When you are in a cloud it’s a shared resource the point was Netflix had no time to prepare for the maintenance window. They use Cassandra why don’t you go to their website and pull yourself out of ignorance. Hello they use AWS which is known to utilize Puppet and Chef and possibly Rundeck.

mynitor

I’m curious what was used to automate the replacement Cassandra nodes? Do they use Priam or some other automation tool/OSS tool?

Derrick Harris

They use their own “simian army” of homemade OSS tools.

ElasticJohn

“Downtime, scheduled or not, is one of the unfortunate realities of cloud computing”
Oh yeah, because in non-cloud computing downtime is a non-reality?

PRETHOUGHT

If you are a customer in a large scale cloud environment you have no control over downtime, scheduled or not. Where as if you have everything in house you alone can control when to take things down or not because you have 100% ownership of the infrastructure. That is the point he was trying to make.

ElasticJohn

Really? So on your own data center you get to choose when disaster strikes? Can you also choose when a disruptive security patch needs to be applied? Isn’t design for resilience needed either way if you have anything business critical?

Derrick Harris

For what it’s worth, I’ve never advocated not using cloud for fear of downtime. But it is a little different in that certain things such as the timing and duration of outages or downtime is almost entirely out of your control. The point of highlighting what Netflix does is to show that good planning can help mitigate any negative effects.

Gerardo Dada

The title is incorrect. According to the post 218 servers rebooted. Only 22 were lost.

It would be interesting to know how the Netflix team used replication and backups to achieve consistency in spite of losing 22 DB nodes (out of 27000+)

Derrick Harris

Thanks for the comment. The 218 were lost as production nodes while rebooting; the 22 that didn’t come back online had to be programmatically rebooted.

What Netflix didn’t note is how long those 22 were down in comparison to the average reboot time for the ones that rebooted successfully in the first place.

Christos Kalantzis

The nodes that rebooted successfully came back within 10 minutes each. The 22 that were lost were not rebooted, but replaced completely. Those took anywhere between 30 mins to 2 hours each, depending on data size, to rejoin the cluster.

pradipshankar

Presumably the 30min or 2 hours did not matter? As the data on those servers was replicated?

Anonymous

“entirelye”

Before the hounds tear you apart for “failing to proof read”

Comments are closed.