If you’re tired of hearing about cloud computing and big data, you might want to wear earplugs for the next year or so. These two trends are only going to get hotter, in large part because they’re also becoming ideal bedfellows. This is especially true in the world of science, where the cloud provides an ideal platform for crowdsourcing scientific problems across the whole world of researchers, giving them access to data sets and the computing resources to analyze them.
Generally speaking, we’ve already seen how crowdsourcing can be an effective method for solving big data problems. The Netflix Prize challenge in 2009 attracted more than 50,000 participants trying to improve Netflix’s Cinematch algorithm, and today we have Kaggle — an entire company dedicated to hosting competitions for companies trying to crowdsource their own analytical challenges. And it’s the cloud, with its centralized nature, virtually unlimited and on-demand resources, that makes it possible to have so many people access and work with the same data sets at the same time.
It’s true, of course, that big data doesn’t necessarily connote scientific workloads, but scientific workloads do increasingly rely on big data techniques. Some refer to data as the fourth paradigm of science because the sheer amount of data available and the new technologies and techniques for working with it are fundamentally changing how scientists go about their research. This has been going on for quite a while, actually, hence the massive research networks connecting supercomputers and research centers across the world. Researchers needed a way to transfer massive data sets to their peers to run on their systems, so they built networks such as the National LambdaRail, XSEDE and CERN’s Large Hadron Collider network.
However, while this arrangement might work fine for researchers working on projects for national labs or universities, who also happen to have time reserved on supercomputing systems, it’s not entirely democratic. Enter cloud computing. Now, anyone can have access to supercomputer-like processing power and, equally important, centralized data sets that don’t require a 40 Gbps connection to download. Companies such as DNAnexus rely on the cloud to host massive genomic data sets on which scientists can collaborate, and also to power those scientists’ computations on the data.
And although companies such as DNAnexus focus more on collaboration than on crowdsourcing, the tools for crowdsourcing are in place. Today, for example, I read about a company, Life Technologies, which makes semiconductor chips that actually carry out a variety of genome-sequencing workloads. Life is hosting a competition within its online community to improve the speed, scalability and accuracy of chips. Contestants will have access to the raw data as well as cloud-based resources for running computations.
Critics can call cloud computing overblown until they’re blue in the face — they might even be right when it comes to certain business applications — but there’s no denying the effects it could have in the scientific world. By giving virtually anybody access to relevant scientific data sets and the resources necessary to analyze them in a timely manner, cloud computing could result in real answers to some previously perplexing questions.
Feature image courtesy of Flickr user Kennisland.