If you run applications on Amazon Web Services wouldn’t you love a tool that could predict when you need to add or subtract resources in advance — before you run into trouble?
Netflix, which has built a cottage industry out of filling gaps in AWS, is at it again, touting a new predictive auto-scaling engine called Scryer that it says improves upon what Amazon Auto Scaling (AAS) does. AAS is very useful, but it watches metrics in real time and scales instances accordingly. It is less able to see a spike coming down the pike, so to speak.
Netflix says it can be more proactive with Scryer, which takes into account historical workload patterns and builds on that knowledge to scale fast. According to a Netflix blog post (quoted below in list form), Scryer is better than AAS in the following situations.
- Rapid spike in demand: Instance startup times range from 10 to 45 minutes. During that time our existing servers are vulnerable, especially if the workload continues to increase.
- Outages: A sudden drop in incoming traffic from an outage is sometimes followed by a retry storm (after the underlying issue has been resolved). A reactive system is vulnerable in such conditions because a drop in workload usually triggers a down scale event, leaving the system under provisioned to handle the ensuing retry storm.
- Variable traffic patterns: Different times of the day have different workload characteristics and fleet sizes. Some periods show a rapid increase in workload with a relatively small fleet size (20% of maximum), while other periods show a modest increase with a fleet size 80% of the maximum, making it difficult to handle such variations in optimal ways.
So Scryer, backed up with historical data and analytics, can sort of see into the future when it comes to predictable web peaks and valleys.
Some AWS watchers agreed that Scryer could come in handy in some, but not all circumstances. Peter Eddy, a software engineer at Boston-based Gazelle, said Scryer could be very useful for companies — and there are many — that have predictable load variations over time. For example, Gazelle has high traffic during the day but very little between 9pm and 6am. “We could save money by lowering capacity on some of our customer-facing servers,” he said.
David Mytton, CEO of Server Density, said Scryer demonstrates the benefits of using predictive learning to deal with scaling as opposed to auto scaling which deals in absolute numbers and/or monitoring data.
“The problem is it takes time to build up history and analyze the patterns so you’d need to have a mature infrastructure with regular traffic patterns to use it. This is why they back it up with the Amazon Auto Scaling. Netflix is big enough and has been around long enough to have enough history to work with. Startups and applications with less traffic might not find it so useful.”
Indeed, Netflix itself uses both Scryer and AAS as backstop.
“If we are able to predict the workload of a cluster in advance, then we can proactively scale the cluster ahead of time to accurately meet workload needs. But there will certainly be cases where Scryer cannot predict our needs, such as an unexpected surge in workload. In these cases, AAS serves as an excellent safety net for us, adding instances based on those unanticipated, unpredicted needs.”
Netflix promised more info to come on Scryer and it’s probably a safe bet that Scryer, like Asgard and Chaos Monkey before it, will end up getting open-sourced as part of the Netflix Open Source Software (OSS) effort.