Cycle Computing and Schrödinger have broken their own record for running a massive HPC cluster on Amazon Web Services. Last year, the two software companies spun up a 50,000-node Amazon Web Services cluster for a complex computational drug design application. Now, Cycle Computing, which divvies up workloads to run across AWS regions and zones, has been able to run and manage Schrödinger’s quantum chemistry software on a whopping 156,000 cores across 8 AWS regions.
The result? Researchers were able to sort through thousands of compounds in less than a day and for about $33,000 in total infrastructure cost. The goal was to help Dr. Mark Thompson, a professor of chemistry at the University of Southern California, find the right compound to build a new generation of inexpensive and highly efficient solar panels.
The problem of pinpointing the right material for the right job is immense. “If the 20th century was the era of silicon, the 21st century will be all about organic compounds but there are so many of them, to find the right compound, it could take the whole century,” said Jason Stowe, CEO of New York-based Cycle Computing in an interview. (To hear more from Stowe about this project check out the Structure Show podcast.)
The combo of Cycle and Schödinger’s software running atop AWS infrastructure cuts that turnaround time and the considerable expense of conventional compute power. This project harnessed 156,314 cores and 16,788 instances to sort through 205,000 compounds in less than 18 hours, according to the companies.
Cycle had to get a bit creative in meeting the new workload requirements. “In the past we used open-source schedulers like Condor, but we hit scale limits so in this case we wrote our own task distribution system called Jupiter. It slates millions of tasks to hundreds of thousands of cores and our goal is to extend that,” Stowe said.
Jupiter was built to handle failures at these scales — a crucial factor. If you have tens of thousands of instances and more than 100 thousand cores, you can be sure they won’t all configure correctly. There will be outages and failures somewhere so Jupiter takes all that into account.
“We intentionally killed some regions to make sure it would still work — it’s sort of like Chaos Monkey,” said Stowe, referring to the popular Netflix tool to test the resiliency of AWS services under stress.
This search for the right compound represented 264 compute-years of work. If the researchers had to buy the resources needed for this project it would have cost $68 million compared to about $33,000. Or, if the same numbers were crunched on an existing internal 300-core cluster, it would have cost $132,000 and taken more than ten months, according to Cycle.
Of course this news is music to Amazon’s ears — coming probably not coincidentally just as the company kicks off its second annual AWS re:Invent conference in Las Vegas. This show is meant to showcase AWS capabilities in all manner of serious computing use cases and HPC is most definitely one of those scenarios.