Austin, Texas-based startup Spanning has embraced the concept of cloud computing so much that its product is a backup service for Google Apps (s goog) — completely hosted and run from Amazon Web Services (s amzn). The idea of backing up one cloud service via another was intriguing enough that I asked Mike Pav, the VP of engineering at Spanning, how he does it.
Spanning charges people or businesses $30 a year to back up Google Apps, including email and documents, against the user somehow deleting or losing them. Google will support users if it loses their data, but it won’t go searching for your files if you mess up.
Spanning CEO Charlie Wood is confident enough that Google won’t get into extended services like backup that he’s sticking with this, although he’s also looking for new lines of business as the company continues to grow. To that end, the company is seeking its next round of funding after having raised a $2 million Series A round last April.
Building a backup cloud in the cloud
Building a cloud-based backup for a cloud service requires a devotion to reliability and planning for worst-case scenarios. Creating a backup service in Amazon Web services is never done, said Pav, as he explained some of the techniques he’s used to support Spanning while also trying to keep costs in line. “For example, a single point of failure for us was our database, but we just finished up a big project to partition our database,” Pav said. “We have to focus on the path and not the destination, because as far as scalability is concerned, we’ll never be done. That’s our real barrier to entry.”
Spanning adds terabytes of storage each month, and it uses Amazon because it makes automatic scaling seamless. “It would be terrible if we had to rack our own drives into an array to deal with that,” Pav says. Spanning stores all the content on S3 because it guarantees high reliability, but the getting the data to S3 can be slow. To address this, Spanning uses parallel access, which helps address the speed of S3, but also provides an added benefit in terms of scalability and reliability.
Designing messaging so dying VMs won’t take out your data
Spanning uses Amazon SQS to queue work to a pool of virtual resources that grows and shrinks based on load. Pav’s team has set up Spanning’s application to track the incoming flow of data to EC2 and make sure each time the system is about to back up new content, it checks to see if the EC2 instance is about to shut down. If it is, the in-progress backup requeues its work-in-progress so another server can pick up this work when AWS adds another server from the pool. That way, the backup doesn’t have to start all over again.
This is important when dealing with potentially large sets of data. Pav says Amazon offers several different models for queue management, but simplicity and scalability are the driving features for Spanning. “When you’re dealing with large data sets for a large number of users, you can’t afford to do anything twice.”
Don’t do anything Amazon will do for you
Spanning uses Amazon Relational Database (RDS) for its persistent database storage, although it does impose limitations on how much data Spanning can store and the throughput it can support on any single database instance. Pav admits this limits his partitioning strategies, but he’s willing to work within those limits, because it cuts his need to support and build his own data store.
“We want to get out of the business of spending time managing these things. We can solve this problem at the application’s architectural level to make sure it scales,” he said. “RDS may not be the highest-performance option, but we are able to reduce investment into something that’s not core to our business and by making good application level architectural decisions we can render the RDS performance issue moot.”
Amazon has changed not just the economics of building an IT service, but also helps make his product better and faster at less cost to him and his team. Pav notes that because of the reliability of Spanning on Amazon and his confidence that user data won’t be lost, he deploys new code when features are ready, and often in the middle of the day when his team is fresh. This is a big shift from the older days of waiting until late at night when theoretically fewer users are online to feel any disruptions.
Of course, with a large customer base all over the world and a growing one in North America, Pav points out that in today’s distributed world, there really is no more middle of the night.