Updated: Hopes were high leading into Saturday’s Comic-Con International ticket sale launch that TicketLeap and its cloud-based ticketing platform would be an availability superhero after two failed rounds of ticket sales in November. Those hopes were dashed nearly immediately, however, as would-be buyers were greeted with over-capacity error messages. Despite speculation that the issue was caused by TicketLeap running too few web servers in its Amazon Web Services infrastructure, I have confirmed with TicketLeap that a known issue with the MySQL database is to blame. The news doesn’t exactly wash away the stain on TicketLeap’s reputation — especially among thrice-scorned Comic-Con fans — but it actually goes a long way toward confirming the wisdom of TicketLeap’s decision to utilize cloud computing.
TicketLeap Vice President of Engineering Keith Fitzgerald explains the issue in great detail in a blog post that will go live at 8 p.m. EST, but the gist is that under heavy Comic-Con load, nearly all of TicketLeap’s database connection got tied up doing DNS resolution. Update: The post-mortem post is live here. As Fitzgerald explains in his post, DNS lookup is a “blocking task” that can slow performance during heavy traffic periods, but TicketLeap uses security features of AWS’s Relational Database Service that negates the need to perform the lookups. Unfortunately for TicketLeap, RDS does not support the standard workaround, called the “skip-name-resolve” flag, used to avoid DNS resolution when it isn’t necessary. Fitzgerald believes the issue might be resolved in MySQL version 5.5, for which AWS just announced support. TicketLeap was using MySQL version 5.1.
The real kicker of Saturday’s failure is that scalability wasn’t an issue at all and, in fact, just exacerbated the problem. Fitzgerald explains:
As it turns out, the issue was exacerbated by the number of servers. We decided at 9:13 AM PST to drop the number of web servers to 4 and orders began to flow at that time. This worked because the number of DNS lookups MySQL had to perform were reduced and we were able to process ~200 tickets a minute under extremely heavy load. This is certainly not our ideal level of throughput, but we were thrilled to start selling tickets to Comic-Con.
As I reported on Friday, TicketLeap scaled its AWS infrastructure up to 64 web servers in preparation of Saturday’s sale, and a test run in December led to the successful sale of 1,000 tickets in a minute against a traffic load of 50,000 buyers. Demand was so high on Saturday, however, that a decision to add more servers would have meant more DNS lookups and an even slower experience for customers. The ability to automatically scale down actually saved the day, and tickets sold out despite the performance issues.
Assuming TicketLeap is able to upgrade successfully to MySQL 5.5 on Amazon RDS and put this issue to rest, the question then will be whether its reputation can recover. Foursquare and Digg didn’t suffer much lasting damage after their decisions to use NoSQL databases MongoDB and Cassandra, respectively, led to lengthy outages last year. But the big difference in this case is that events rely on TicketLeap for serious business. Of course, it’s also arguable that sticking with the tried-and-true MySQL database on the proven AWS platform was hardly an imprudent decision. In fact, it looks a lot better after cloud computing saved the day by letting TicketLeap scale down its infrastructure as an ad hoc fix, and that it still remained operational.
For its sake, I hope TicketLeap gets another chance to prove that it can handle a Comic-Con-scale launch, and that it that it does its homework in advance to make sure nothing goes wrong.
To hear all about the cutting edge of strategies for handling big data, be sure to attend our Structure Big Data conference on March 23 in New York City.
Image courtesy of Flickr user permanently scatterbrained.
Related content from GigaOM Pro (sub req’d):