8 Comments

Summary:

Online genealogy service Ancestry.com is trying to become like the Amazon or Netflix of family trees. Much like those companies use customer data to recommend products or movies customers might like, Ancestry.com is using machine learning to make learning about ancestors a lot less work.

family tree

Online genealogy service Ancestry.com  is trying to become like the Amazon or Netflix of family trees. Much like those companies use customer data to recommend products or movies customers might like, Ancestry.com wants to feed its users relevant historical records and other information on ancestors without making them search through its database. And it’s taking in everything from newspaper clippings to your DNA to make this happen.

It you’ve used Ancestry.com recently, you’re probably thankful for its efforts. According to Head of Engineering Scott Sorenson, Ancestry.com has more than 10 billion records that are part of a 4-petabyte (or 4-million gigabyte) data store. If you’re searching for “John Smith,” he explained, it probably has about 60 million for “Smith” and about 4 million for “John Smith,” but you’re only interested in the relative handful that are relevant to your John Smith.

Making models smarter

That’s why Ancestry.com is using machine learning to make sorting through those records a lot less like finding a needle in a haystack and a lot more like having that needle — and any others made from the same batch of steel — delivered right to your door. Here’s how the process works, in a nutshell:

  1. Crawl digital records (e.g., newspapers, birth records, death records, census data, ship manifests, etc.) online and extract relevant data
  2. (Or 1(a)) Scan, upload and index physical records (via a partner in China)
  3. Stitch together new records with user data to add more context
  4. And this is key, constantly analyze user behavior in order to make its algorithms smarter

As users make judgments about the records they’re presented, Sorenson said, Ancestry.com’s algorithms get better at performing their particular tasks. So, a system for extracting data from newspaper pages might be able to better recognize the various sections of the page (so as to ignore the ads, for example) and then be able to adjust for mistakes in the section it is analyzing. And as with Google’s search algorithms, the more that users interact with records, the better Ancestry.com’s sorting algorithms are able to determine those records relevance to any given user.

Spit in a tube, pay $99, learn your past

Oh, but Ancestry.com has decided that merely storing and analyzing historical records is just the beginning with regard to providing accurate genealogy information. It also will sequence your DNA, focusing on 700,000 markers important to determining one’s race, lineage and other factors. That service, which simply requires users to swab their cheek or spit in a tube and send it to the lab, costs only $99 (a full genome sequence would cost at least 10 times that, by the way), but could revolutionize the accuracy of Ancestry.com’s models.

Right now, Sorenson said, the DNA service can tell users their race and what country they’re from, and also connect them with other relatives who share a DNA profile. (If your privacy red flag has gone up reading this, Sorenson did note the following: all communications with relatives are optional and initially anonymous; all DNA information is disassociated from personal information; and users get their sequence results via an encrypted key “that we treat with a higher level of security than we’d store your credit card information.”)

Connecting with distant relatives can be valuable, though. A third cousin, for example, might have ancestral information that you don’t, which will help make your family tree that much more accurate. But Sorenson said when it really gets interesting is when Ancestry.com can combine DNA data with record data in family trees. Someone’s DNA might indicate he’s from France, Sorenson explained, but cross-checking that against that person’s family data will let the service discover he’s actually from the Normandy region.

Going forward, Sorenson said Ancestry.com expects its DNA service to take off like a rocket. The company is investing between $10 million and $15 million into that service over the next couple years, and has bioinformatic scientists on staff trying to scale algorithms designed to handle hundreds of samples to work with hundreds of thousands or even millions of samples. In that regard, though, Ancestry.com isn’t alone — the steady drop in the price of genome sequencing has everyone in the sector anticipating skyrocketing data volumes.

What’s next: Telling stories and making genealogy real-time

OK, so it has billions of records and our DNA, what more can Ancestry.com possibly want or need to provide us information on our ancestors? Nothing, actually.

It just needs to make better use of what it does have and the new technologies available for working with that information. Genealogy has traditionally been “dusty,” Sorenson explained, but Ancestry.com is trying to tell the stories behind those dusty records. If you’ve seen the NBC program “Who Do You Think You Are?”, on which Ancestry.com traces celebrities’ ancestral roots, you have an idea of what Sorenson is talking about.

For example, by improving its image-processing capabilities, Ancestry.com could extract more information than just name, data and location from old records that it already knows how to process. It could tell someone that his grandfather was the only person on the block to own a radio, or whether he owned his home. Combined with socioeconomic and other external data, Sorenson said, Ancestry.com could “create a really vivid picture” of what it was like to live during a specific time.

By using location data from cell phones, Sorenson said Ancestry.com could deliver a mobile experience that’s far more than a translation of the web on a smaller screen by making genealogy a geospatial pursuit. For example, Sorenson, explained, if a user takes a picture of a gravestone, Ancestry.com would like to provide him with relevant historical data related to that place, and maybe even some nearby points of interest.

Some might think Ancestry.com’s practices and plans toe the privacy line, but if someone has to toe that line, this might be the company to do it. In a fast-paced world it’s easy to get tied up in the moment and in our own little worlds — especially with big data being used elsewhere on the web to keep our attention firmly on one site or another. Using personal data to let users dig into decades into their family histories ends up looking very refreshing.

Feature image courtesy of Shutterstock user tovovan.

  1. Unfortunately, user judgment is not always sound, and Mr. Sorenson’s algorithm mucks up the search engine precisely because it “learns” from those user choices.

    Big Data? As the algorithm incorporates inaccuracies, and then prompts more people to confirm those “judgments” in making choices of their own… Big Bad Data is what it becomes.

    Share
    1. Derrick Harris Wednesday, June 13, 2012

      That’s a good point, although how would you solve it to account for poor judgment? It seems like a fair assumption when building a model that user behavior is determinative of accuracy.

      Share
      1. If I spend 2 hours watching a PBS Civil War documentary on Netflix, or plunk down $20 on Battle Cry of Freedom at Amazon, it would indeed be a fair assumption to conclude that I am interested in the Civil War.

        But if I attach a Social Security Death Index to Katherine Compton in my family tree… because the name matches, and I’m excited to suddenly “know” her dates of birth and death, and, hey, the Ancestry hints are telling me that other users have attached the same SSDI to their Katherine Compton too… does that make my choice to attach the record a fair assumption with regard to historical fact? No. It does not.

        *IF* Ancestry’s users were all painstakingly careful, deliberate, well-reasoned constructors of family trees, then Mr. Sorenson’s model would make sense. But they are not. Ken Piper (in the comment below) is right. Ancestry’s user-submitted trees are rife with inaccuracies.

        To answer your question – the only way I can think of to solve this problem is to leave user choices out of the algorithm altogether.

        Share
  2. I agree that I worry about user judgement. Aside from record images, the bulk of “data” on Ancestry is submitted family trees with a horrible record of accuracy. I have given up fighting the wave of citations pointing to other users as a record. The inaccuracies become “fact” and when you truly find the correct fact you are drowned out by the glut of bad data. It will only get worse.

    Share
    1. On the bright side, though, one day we may ALL be descended from Henry VIII and Charlemagne! ;)

      Share
      1. Derrick Harris Wednesday, June 13, 2012

        LOL.

        Share
      2. Maybe Charlemagne, but not Henry VIII since he has no descendants.

        Share
  3. One must keep in mind that not only are there errors in the online family trees, but also in the records used to make them. Though I truly like the automation of trees and data, I am very careful to verify the data before attaching and I rarely allow a tree to become public precisely because of the error factor.

    Automation has lead unfortunately to what a scientist would call bad science. Attach, attach, attach and publish should be changed to verify, verify, verify. Alas, too many want instant gratification.

    J

    Share

Comments have been disabled for this post