AncestryDNA’s service might seem like magic from the outside: pay $99, spit in a vial, stick it in the mail, and then wait for a report telling you where from where your ancestors hail and that you have a fifth cousin in St. Louis. It’s very much a scientific endeavor, though — and one that requires generating a lot of data. At this point, every advance the genetic-analysis arm of genealogy service Ancestry.com makes toward pinpoint-accurate analysis is also a step toward outgrowing the cutting edge of commercial big data technologies.
Like the field of genetics overall, the state of the art for AncestryDNA is moving fast. In fact, AncestryDNA Senior Vice President and General Manager Ken Chahine explained in a recent interview, figuring out your ethnicity hardly required even required a DNA test three years ago. All the company could have told customers is whether they were of European, African or Asian descent.
“You don’t need a genetic test to figure that out,” he joked. “You need a mirror.”
He compared that level of granularity to being able to thumb through a book written in many languages and see when it changes from English to Russian to Chinese, for example. What’s more difficult is figuring out when it changes within the same family of languages, say from Italian to Spanish to Portuguese. From a genetic point of view, the latter is “orders of magnitude harder,” Chahine said.
From looking at languages to reading them
Fast-forward a year and a half, and the company had solved that problem. Now, it can tell customers from which one of 26 regions around the world their ancestors were born. Soon, Chahine predicts, AncestryDNA will be able to tell customers what countries they’re from originally and even which regions within those countries.
Part of the solution is its team of scientists constantly working to improve the company’s algorithms, but a big part of it is data. AncestryDNA has tested more than 200,000 people as part its consumer services and it has a pristine, pedigreed set of 100,000 DNA samples from around the world, collected separately over 10 years, for which it can verify the ethnicity of the people who gave them.
Chahine calls it “truly an invaluable data set.” Comparing customer DNA against those profiles helps AncestryDNA figure out where customers’ genealogies trace back, and each new customer the company analyzes also helps it detect patterns among people from more-specific geographic regions. Someone with Mediterranean roots is now someone with southern Italian roots.
Going back to the well of literacy analogies, Chahine compares this next step to learning to read and write in a language rather than just knowing the alphabet. “We’re collecting enough data where we can actually start making words,” he said.
After it figures out someone’s background, AncestryDNA moves onto determining who’s related to whom. It’s easy enough to detect first cousins — they should have long strands of identical DNA, Chahine explained — but the company claims it can determine with 95 percent accuracy if two people shared an ancestor five generations ago.
What’s amazing is how related everyone in the world is: Even within its relatively small database of users — 200,000 people out of billions — “the probability of [discovering] a fourth cousin is pretty amazing,” Chahine said. Someone of Northern European descent, as the majority of AncestryDNA users are, will probably have between 10 and 40 fourth cousins already in the company’s database.
If you’re curious about the privacy aspect of all this, check out the video below, which features Chahine at our Structure Data conference last year (it’s coming up again this year in March) talking about the tradeoffs people are willing to make between privacy and value. (Or just ask in the comments and I’ll share my notes on it.)
Outgrowing its infrastructure and algorithms
But this type of analysis also drives up the requirements for AncestryDNA to be able to scale its data-processing environment. It’s a problem with “really no end in sight” because everyone must be compared against everyone else in an always-growing database, he said. And although AncestryDNA is looking at different parts of the human genome in different ways than a researcher or service trying to discover disease markers would, it’s still looking at gigabytes of raw data per genome and even more after they’re analyzed.
Already, Chahine said, AncestryDNA has gotten to a point where the best-known algorithms don’t work at its scale and for its purposes. It’s possible the company’s Hadoop infrastructure might not be able to scale out enough to handle all the genetic data it’s generating and the company will have to develop its own technology for storing and processing information. This despite the fact that Hadoop can scale into the petabyte — or millions of gigabytes — range, and Ancestry.com itself was holding 4 petabytes of data in mid-2012.
“I suspect it’s gonna be a combination of using really clever science and algorithms with other tools that will allow us to scale this thing,” Chahine said.
Pattern matching versus marker-spotting
If there’s a silver lining to the scalability challenges AncestryDNA is facing, it might be that at least it’s working on a problem where accuracy is a relative term. This isn’t like interpreting DNA for medical purposes, where misreading one specific marker in one specific gene can be life-altering, the difference between a spotting a disease or not. That’s what 23andme was doing and that’s why it recently got taken to task by the FDA.
Yes, AncestryDNA is analyzing 700,000 markers per customer in order to determine someone’s ethnicity but, Chahine explained, misreading one of them won’t skew the results enough to, say, tell a Chinese person he’s Irish. “We’re really looking at your data in the context of the population,” Chahine said. “… If we’re wrong a little bit, no big deal. … I’m never relying on one marker.”