Github’s outages early this week emanated from what was to be a “rather innocuous” database migration that turned out to be anything but. The company was updating older MySQL databases with a new 3-node MySQL cluster, according to a Friday afternoon post to the Github blog, when things started to go sideways.
The goal of the work was to streamline failover. In the old setup, failing over from one database to another required a cold start of MySQL — the new architecture does not require that. What happened instead was that Github’s site went down for just under 20 minutes on Monday and again Tuesday.
The blog, written by Githubber Jesse Newland, provides a detailed post mortem of the events leading up to the snafu, but here’s the gist:
… three primary events contributed to the downtime of the past few days. First, several failovers of the ‘active’ database role happened when they shouldn’t have. Second, a cluster partition occurred that resulted in incorrect actions being performed by our cluster management software. Finally, the failovers triggered by these first two events impacted performance and availability more than they should have.
A complicating factor was that Github’s status site, which runs independently on Heroku, experienced availability issues on Tuesday when traffic spiked. Github worked with Heroku to add a production database to handle the load and then a database slave to safeguard against similar occurrences.
Github’s distributed nature means that an outage at the mothership doesn’t mean all works grinds to a halt. “You can’t pull or push, but you can still make commits and branch to your local repository, then push when it comes back online,” said one GigaOM commenter. “That’s the whole key behind a [distributed version control system.]. And even if you need to share files … you can create .patch files and send them through other means.”