The science world was rocked Wednesday by the discovery that the 80 percent of the human gene that scientists throught was “junk” actually contains genetic regulators that can lead to diseases and certain genetic traits. It’s the scientific equivalent of discovering that ugly old dresser is actually a Louis XIV original, except that in this case, that dresser would also be filled will priceless books that might provide even more discoveries.
From my perspective, what was amazing about the outcome of the ENCODE project wasn’t just the results, but the infrastructure that supported it. According to the press release discussing the findings, ENCODE generated more than 15 terabytes of raw data, and the data analysis consumed the equivalent of more than 300 years of compute time. For those living on the edge of the hyper scale world, these numbers may not be all that impressive — after all Facebook says it takes in 500 terabytes of data a day. But the ENCODE data is shared and accessed by scientists around the world.
And that’s what’s worth thinking about, as we try to build economies and organizations around big data. The ENCODE project didn’t just come up with some new truths about our genetic material: It was a global collaboration that required 32 labs to gather and perform more than 1,600 experiments on more than 147 tissue samples to generate data that would then be further used to make more discoveries.
Jim Kent, director of the UCSC Genome Browser project and head of the ENCODE Data Coordination Center laid out some of the challenges associated with making sure experiments were independent, worthwhile and still generating accurate data. From a release announcing the recent findings from ENCODE:
“For Kent and his data coordination team at UCSC’s Center for Biomolecular Science and Engineering, the scale of the project presented many challenges. To start with, they had to coordinate a small army of researchers who were producing data in labs around the world. “We had five data wranglers who traveled around to the labs, probably four conference calls a week at the height of it, plus large group meetings twice a year, and countless emails and Skype calls,” Kent said.”
“The management of data and the processes / QA around it are the biggest problems out there and are tough to follow. Most people struggle massively just with managing all the data, let along keeping it up to date, etc.,” said Sultan M. Meghji, a VP with Appistry, a company that manages genetic data.
The project doesn’t just offer ways to think about and organize the hunt for and use of gigantic data sets. Researchers also developed software tools to analyze their results. These include new databases for tracking specifics associating with genetic analysis such as HaploReg or RegulomeDB. There’s also a pre-configured virtual machine available designed to host and analyze data generated by the ENCODE project. And of course, the data is open for researchers and the project’s participants are actively encouraging interested parties to learn how to use the data and contribute via a portal.
So while this is a big data story, a cloud computing story and a science story, it’s also a peek at the future of collaboration and management challenges driven by our connected world. And if that’s not enough to get you excited about the project, let’s return to the science and discoveries about the secrets locked inside the genome.
“The quality and scientific depth of the data is the bigger import. As we move primarily into clinical operations, this kind of data, if it is of high quality and repeatable from a process perspective is massively useful,” Meghji said.
This article on the genetic link that could help doctors predict what antidepressants are most likely to help cure a person with depression — doing away with the trial and error approach in place today — shows what could become the future of medicine. And that’s awesome.