5 Comments

Summary:

A recent study found strong correlations between people’s Facebook likes and a number of personal characteristics such as sexual orientation and intelligence. But relying on correlations as proof of anything is a questionable practice.

You might have heard recently about a study finding that liking “curly fries” on Facebook correlates strongly with high intelligence. Publications such as Wired have written about it. Quid Founder and CEO Sean Gourley cited it during a presentation at Structure: Data last week. A faction of the European Union parliament even pointed to the study as yet another reason to prohibit data mining by web companies.

However, if you’re like me, hearing anybody repeat that curly fries data point as fact likely sends shiver down your spine. It’s not that it’s not true — it very well might be — but that it’s nearly useless information without more background.

That’s right, the old correlation versus causation argument is front and center once again. In all the big data world, it’s probably the biggest fallacy there is, no matter how you look at it. No, getting value from big data always doesn’t require giving greater credence to correlation than causation. And, no, relying on correlation isn’t inherently some sort of an ethically or scientifically questionable practice.

Really, the choice between relying on correlation or striving to find causation probably depends on what you’re trying to do.

When there’s nothing at stake, correlate away

Let’s be honest: If all I’m concerned with doing is boosting clickthroughs, selling more products or predicting the movies you want to see, correlations probably will work just fine. I don’t really care why, for example, Mac users book more-expensive rooms on Orbitz — I just care that they do.

You visit my site, my system sees you’re using a Mac (or that you like curly fries, or any other attribute it can associate with you) and it shows you content that it thinks you’ll want to see. It’s not a perfect approach, but it’s probably a far cry better than the old method of just showing everybody the exact same content.

And when you’re collecting potentially petabytes of user data and trying to serve ads in near real time, strong correlations might be about the best things you can hope to find. It’s a volume-and-velocity business, and heavy examinations of why any two (or more) things are related to one another might not always provide a high return on investment.

A more extreme example of when correlations might suffice would be something like machine-to-machine systems that need to make decisions in real-time in order to prevent disasters. The people charged with running these systems might not know why a certain series of events often precedes a particular outcome, but it’s better safe than sorry.

You can’t make a difference — or real decisions — with correlations

But if you’re trying to use big data to make a meaningful difference in the word or to make decisions that can have significant real-world consequences, mere correlations probably won’t cut it. This is what Evgeny Morozov warns about in relation to crime in a recent New York Times column. It’s what Gourley had in mind when talking about data science versus data intelligence. It’s why the current discussion around machine learning almost always includes a human aspect, as well.

Many of the reasons for not acting on correlations alone are based on privacy and a whole collection of civil, constitutional and human rights. You simply can’t profile and then arrest, for example, people based on what their Likes suggest they might be. You probably shouldn’t make decisions about people’s financial, health or general well being based on mere correlations, either.

Heck, I wouldn’t even serve ads that delve into personal information such as health, sexual orientation or intelligence without a very strong reason to believe I was accurate (and express consent to serve those ads). And the Facebook-curly-fries study is full of correlations that could be potential landmines, a small portion of which are visible in the chart below.

More correlations from the "curly fries" study. Source: Proceedings of the National Academy of Sciences

More correlations from the “curly fries” study. Source: Proceedings of the National Academy of Sciences

But these are all situations where the fear of incorrectly profiling someone occasionally — and being sued as a result — might overpower the desire to do good most of the time. The data Darwinism that my colleague Om Malik wrote about recently extends beyond just peer reviews and social-media ratings, and one shouldn’t take the role of playing God (or catalyst for evolutionary change, to continue the Darwin metaphor) lightly.

Sometimes, though, correlations aren’t enough because you really want to solve a problem or perhaps build a great product. As Gourley explained at Structure: Data, even using correlative data to predict insurgent attacks in a place like Iraq is relatively easy, but predicting the likelihood of events doesn’t stop them. Stopping them requires really understanding and addressing the root causes of the attacks.

The same goes for stopping disease outbreaks, figuring out why programmers make more mistakes during certain seasons, stopping gun violence, or just capitalizing on that knowledge about curly fries or hotel-room bookers in order to build products that touch upon the deeper rationales for liking those things. You can fight the symptoms, so to speak, or you can cure the disease.

So feel free to try selling the next guy you see eating curly fries on a documentary about Dostoevsky, but don’t expect him to care. It might be that there’s some strong connection between curly fries and intelligence; of course, it might also be that intelligent people — entirely coincidentally — tend to live within walking distances of an Arby’s. But no one has asked about that.

Feature image courtesy of Shutterstock user Tobias Arhelger.

  1. This is brilliant!

    Share
  2. Ferenc Huszar Tuesday, March 26, 2013

    I believe the argument presented in this article is flawed.

    I am usually a vocal activist and big proponent of the “correlation is not causation” issue, as evidenced by a few blog posts and comments that I authored online on this topic. See e.g. http://bit.ly/Zp4D61 However, in this case the authors did NOT make a mistake.

    Misinterpreting correlations vs. causation in the context of this paper would only be a problem if someone interpreted the study as ‘if I want to make people smart, I should feed them curly fries’. I think nobody is as stupid as that.

    Correlations and predictive statements can be exploited in decision making (without any statistical inconsistencies) even if they do not represent cause-effect relationships. Correlations are not useles, and in the context of this study, predictability was the interesting question, nobody hopes to uncover causal relationships.

    Here is how correlations can be useful: let’s say you want to target smart people on facebook. But you don’t observe who is smart and who is not. But you do observe wether they like curly fries or not, and you know that the two properties are positively correlated. If you therefore target people who like curly fries, then statistically, you will be targeting smart people. It doesn’t matter if there’s causation in the background or not.

    There is nothing wrong with the statements
    1) ‘Variable X is positively correlated with variable Y, therefore if I select a subsample where X is large I can expect Y to be large’
    2) ‘Observing variable X allows me to predict variable Y accurately’
    The causation-correlation issue comes into the picture when you say
    3) ‘Variable X is positively correlated with variable Y, therefore if I make X larger, Y will increase’

    (1) and (2) is perfectly valid, valuable and coherent, (3) is the problematic.

    To give you an example of a really flawed argument, look at the bitly blog post about when is the best time to tweet: http://bit.ly/J37zdu Here the article observes there is correlation between when you tweet and how many clicks you get. Thus far this is OK. Then it concludes you should tweet at certain times and expect an increase in clicks you get on your links. This is the problematic part, as pointed out by many commenters.

    Share
    1. Ferenc, I believe your passionate and detailed defense of correlation vs causation is unwarranted in this case. I really don’t see the author making the claim that correlations are useless or confusing the two. In fact there’s an entire section devoted to WHEN correlations are useful (When there’s nothing at stake correlate away). He prefaces this section with, “Really, the choice between relying on correlation or striving to find causation probably depends on what you’re trying to do”
      and proceeds to give a very valid explanation of when you’d rely on correlation and when you have to dig deeper to find causation.

      To me, this just proves that data interpretation is not entirely an objective, scientific activity because it is done by humans and not by robots. Everyone brings their own experience and cognitive biases into how they interpret data (text in this case, but could be charts or numbers too).

      What I took away from reading the article:
      Big Data can indeed provide correlations, some useful, and some that make you go Huh? This has been true of Data Mining techniques as well, so nothing new here. Correlations are no doubt helpful and may further our understanding of the underlying relationships, but outside of targeted ads, they may not be actionable in many cases. So to solve problems or make decisions you do need something more…like an experiment that would test your hypotheses and correlations. Even then, there would always remain areas such as medicine and law, where we would require nothing short of causality. In the case of the individual, who your crime detection model predicts as being as a possible criminal, you’d keep them on observation, rather than locking them up right away.

      Share
  3. Ferenc Huszar Tuesday, March 26, 2013

    You raise interesting moral questions about whether acting on data when there is no provable causal relationship is unethical.

    So if statistically, say, drivers of rainbow-painted cars are more likely to carry an illegal gun, the police is supposed to ignore this knowledge, because causality cannot be proven?

    Mathematically variable X can contain information about variable Y even if there is no causal relationship between them. So you’re suggesting that whenever the police observes a variable X about someone, and infers that therefore Y may be true, they should ignore the information if there is no provable causal link between the two variables.

    You could equally argue ignoring information and results of statistical inference is immoral and unethical from the society’s perspective. So take for example your proposition ” You probably shouldn’t make decisions about people’s financial, health or general well being based on mere correlations, either.” One can probably argue that it is both immoral and ineffective.

    Statistical inference does make sense and should not be ignored even if it is not based on causal relationship.

    Share
    1. Derrick Harris Tuesday, March 26, 2013

      Ferenc,

      Regarding correlation vs. causation, I’d say there needs to be very strong correlative evidence before acting on things that could affect someone’s life. That’s why, like in the post I wrote today about using Markov models to map the spread of cancer cells, the study’s authors note that they’ve just provided a framework that should be analyzed against more-targeted data sets and other techniques to explain the results they found in any given situation.

      Regarding your legal/ethical questions, I’d say discretion is absolutely the right answer. The alternative is called racial (or pick some other descriptor) profiling, and that kind of stuff gets police departments sued and the Justice Department brought in. You don’t have to ignore the statistical inferences — they probably should put someone on notice — but they’re not grounds for presuming wrongdoing and then proactively searching/pulling over/whatever.

      But if I were just serving ads, then, yeah, it’s probably enough to show someone an ad for Fritos.

      Share

Comments have been disabled for this post