Yahoo data scientist: It’s Romney-Christie or Gingrich-Rubio

Sen. Marco Rubio (R-FL)

According to a predictive analysis experiment by a Yahoo (s yhoo) data scientist, U.S. voters can expect to see either a Mitt Romney-Chris Christie or a Newt Gingrich-Marco Rubio ticket to face off against Obama-Biden in this year’s presidential election. The experiment, which author David Pennock explained on Yahoo’s The Signal blog Monday morning, highlights both the strengths and weaknesses of using web data to predict human behavior.

The point of Pennock’s experiment is to determine likely vice-presidential candidates based on what the web is saying. He found there’s a 25 percent chance Romney would pick Christie, currently the governor of New Jersey, whereas there’s a 30 percent chance Gingrich would choose Florida senator Rubio. Interestingly, although Romney and Rubio are anti-correlated (i.e., Rubio’s chance of being VP go down as Romney’s chances of being president rise, and vice versa), Rubio is so popular there’s still a 22 percent chance Romney would choose him. Christie, on the other hand, sees his chances drop to a mere 5 percent if Gingrich wins the Republican nomination.

Essentially, Pennock is using data from prediction services Intrade and PredictWise on who’s the most-likely VP, and extrapolating from there to determine who might get the nod if any given candidate wins the nomination.┬áStatistically, Pennock’s conclusions are probably accurate, but he does make sure to note they’re just the result of “a statistical correspondence, and an extrapolated one at that, not a proven cause-effect relationship.”

For example, Pennock claims the results “are based solely on data unaided–and untainted–by political intuition,” but that’s not necessarily the case.┬áDepending on what data sources he, Intrade and PredictWise are using, political intuition could play a major role in who’s correlated with whom. If my gut, no matter how uninformed, tells me Marco Rubio would be a great vice-presidential candidate and I write it or tweet it, I’ve likely influenced the data set with my intuitions, however data-driven the analysis itself is.

Furthermore, there’s really no accounting for the human brain. Although Intrade had Sarah Palin as John McCain’s likely VP candidate on the day she was announced, its confidence varied greatly throughout the day as rumors swirled, and I’m guessing it didn’t have her rated highly this far out. Who knows what will change between now and August and whose name will start cropping up? Actually, Intrade has the chances of Palin being picked this year at .2 percent as I write this, but perhaps she’ll surge again over the summer.

Predictive models can be very beneficial, and I think Pennock’s analysis (as well as those from Intrade and PredictWise) is very telling about reality as it exists now. But unlike in the world of machine data, where a series of particular events might be highly suggestive of a particular outcome down the road, reality can change in a hurry when fickle humans are involved.