Google, Stanford say big data is key to deep learning for drug discovery

0 Comments

Credit: Thinkstock

A team of researchers from Stanford University and Google have released a paper highlighting a deep learning approach they say shows promise in the field of drug discovery. What they found, essentially, is that that more data covering more biological processes seems like a good recipe for uncovering new drugs.

Importantly, the paper doesn’t claim a major breakthrough that will revolutionize the pharmaceutical industry today. It simply shows that by analyzing a whole lot of data across a whole lot of different target processes — in this case, 37.8 million data points across 259 tasks — seems to work measurably better for discovering possible drugs than does analyzing smaller datasets and/or building models specifically targeting a single a task. (Read the Google blog post for a higher-level, but still very-condensed explanation.)

But when talking about a process in drug discovery that can take years and cost drug companies billions of dollars that ultimately make their way into the prices of prescription drugs, any small improvement helps.

This graph shows a measure of prediction accuracy (ROC AUC is the area under the receiver operating characteristic curve) for virtual screening on a fixed set of 10 biological processes as more datasets are added.

This graph shows a measure of prediction accuracy (ROC AUC is the area under the receiver operating characteristic curve) for virtual screening on a fixed set of 10 biological processes as more datasets are added.

Here’s how the researchers explain the reality, and the promise, of their work in the paper:

The efficacy of multitask learning is directly related to the availability of relevant data. Hence, obtaining greater amounts of data is of critical importance for improving the state of the art. Major pharmaceutical companies possess vast private stores of experimental measurements; our work provides a strong argument that increased data sharing could result in benefits for all.

More data will maximize the benefits achievable using current architectures, but in order for algorithmic progress to occur, it must be possible to judge the performance of proposed models against previous work. It is disappointing to note that all published applications of deep learning to virtual screening (that we are aware of) use distinct datasets that are not directly comparable. It remains to future research to establish standard datasets and performance metrics for this field.

. . .

Although deep learning offers interesting possibilities for virtual screening, the full drug discovery process remains immensely complicated. Can deep learning—coupled with large amounts of experimental data—trigger a revolution in this field? Considering the transformational effect that these methods have had on other fields, we are optimistic about the future.

If they’re right, we might look back on this research as part of a handful of efforts that helped spur an artificial intelligence revolution in the health care space. Aside from other research in the field, there are multiple startups, including Butterfly Network and Enlitic (which will be presenting at our Structure Data conference later this month in New York) trying to improve doctors’ ability to diagnose diseases using deep learning. Related efforts include the work IBM is doing with its Watson technology to analyze everything from cancer to PTSD, as well as from startups like Ayasdi and Lumiata.

There’s no reason that researchers have to stop here, either. Deep learning has proven remarkably good at tackling machine perception tasks such as computer vision and speech recognition, but the approach can technically excel at more general problems involving pattern recognition and feature selection. Given the right datasets, we could soon see deep learning networks identifying environmental factors and other root causes of disease that would help public health officials address certain issues so doctors don’t have to.

Comments are closed.