Summary:

There’s no shortage of great minds using big data techniques to improve the quality of our medical treatments, but sometimes they can’t get access to the data they need most. Improving access to genetic data, for example, might just help cure cancer.

Double Helix
photo: The Cancer Genome Atlas

The health care industry might have embraced the big data movement with open arms, but embracing it with open data probably would be more effective. Hospital organizations, researchers and the tech companies serving them have lots of great ideas — and have achieved some great results, too — but, ultimately, efforts to use big data to transform the industry will only be as good as the data these stakeholders have to work with. Right now, that isn’t always everything they need.

Have access, will innovate

Wired published an interesting profile Tuesday morning that exemplifies what’s possible when smart people have access to good data. The piece showcases the work of a man named Fred Trotter who has accessed reams of buried Medicare data via a Freedom of Information Act request and is uncovering some potentially valuable information. Already, the article explains, he has built a “Doctor Social Graph” by analyzing some “60 million relationships between doctors, and how often they refer patients to one another.” His next mission is to build a doctor rating system based on data he’s uncovered about credentials, nursing home inspections and other relevant info.

Elsewhere, companies such as Palo Alto, Calif., startup Apixio are trying to make hospitals more efficient by using semantic analysis to connect the dots between patient charts, electronic medical records, billing data and whatever other sources of information that hospitals generate. (We covered Apixio in early 2011, although the company has significantly expanded its services since then.) In health care, everyone seems to have their own way of doing things, as Apixio natural-language-processing scientist Vishnu Vyas told me recently, so “the variety of the data becomes as important as the volume of the data.”

Linda Drumright, GM of the Clinical Trial Optimization Solutions group at IMS Health, agreed. She explained that her company is able to do its job because it has access to mountains of data from pharmacies, insurance claims, medical records, partners and other sources. All told, it houses 17 petabytes of data spread across 5,000 databases. Her division’s clients, which generally include pharmaceutical and biotech companies running patient trials, need all this data in order to ensure their trials will actually be successful.

One recent customer wasn’t able to recruit test subjects fast enough, she noted, and IMS helped it comb through its criteria about who to or not to include in the trial only to find “that the patient population they were looking for didn’t exist.” As IMS went back and began eliminating criteria and iterating design, it realized that trial never should have begun in the first place.

There are a million ways to think about how to use this data, Drumright said, and as more customers begin to fully understand what they can do with it, her goal is to “make this information accessible in a way where it’s easy at the point where it’s needed, and consumable where it’s needed.”

The key to curing cancer might be more data

But whatever Trotter, Apixio, IMS and others accomplish will have been made possible because they have access to some valuable datasets, albeit not always with great ease. Many individuals who’d like to improve the health care system — if not our health, generally — aren’t so lucky. Take, for example, the world’s genetic researchers. It’s very possible the data they need to discover the medical Holy Grail of a cure for cancer is locked in gene sequence data that only very few people will ever see.

David Haussler

According to University of California, Santa Cruz researcher David Haussler, the limited access that many geneticists and computer scientists like himself have to valuable genetic data is “a crime.”

“We are on the brink of a real new understanding of cancer by being able to sequence cancer genomes,” he told me during a recent interview, but big data will be the key to unlocking it.

There are 1.6 million cases of cancer in the United States every year, Haussler explained, and most of the information from those tumors is being ignored. This is partially because of privacy restrictions about who can access personal medical data and for what purposes, and partially because there isn’t yet a concerted effort to collect the necessary genetic samples. As genome sequencing gets faster and cheaper, he says researchers need access to healthy and cancerous samples from the same person — and as many of these samples as possible — in order to analyze the “astounding” number of molecular changes that occur in every type and variation of cancer.

“We can’t completely understand what we’ll find, but we know we the only way we’ll pull out signal from the noise is to [analyze all these genes],” Haussler said.

Haussler understands the need for privacy regulations, but thinks there’s an opportunity to at least ease some current restrictions on how researchers access data. Even when there are relatively large (if not ideal) datasets available such as with the Cancer Genome Atlas project, researchers must apply to the National Institutes of Health for access, and the data must always remain behind an organizational firewall. Every cancer patient in the country could agree to having their data available to researchers, he said, but as long as that data isn’t accessible over the internet it’s only of limited utility.

He — along with others in the field — thinks cloud computing could be the solution because it gives genetic researchers a central location where they can access and perform computations on the data. Haussler and his team that house the Cancer Genome Atlas and a couple other projects currently have more than 400 terabytes of data and expect to have around 5 petabytes of data eventually. Downloading that is infeasible save for access to high-speed research networks, so “we need a place where people can experiment with these big data problems,” Haussler said.

In the meantime, Haussler and his peers will keep on collecting and accessing genome data however they can. And they’ll keep building software packages and algorithms that analyze that data better and faster than ever before. However, he lamented, “If we had the big data out there in an unrestricted setting, then all the best minds in the world would already be crunching on it.”

You’re subscribed! If you like, you can update your settings

Comments have been disabled for this post