Happy ZIP codes, longer lives

Scientists say tweets predict heart disease and community health

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

University of Pennsylvania researchers have found that the words people use on Twitter can help predict the rate of heart disease deaths in the counties where they live. Places where people tweet happier language about happier topics show lower rates of heart disease death when compared with Centers for Disease Control statistics, while places with angry language about negative topics show higher rates.

The findings of this study, which was published in the journal Psychological Science, cut across fields such as medicine, psychology, public health and possibly even civil planning. It’s yet another affirmation that Twitter, despite any inherent demographic biases, is a good source of relatively unfiltered data about people’s thoughts and feelings, well beyond the scale and depth of traditional polls or surveys. In this case, the researchers used approximately 148 million geo-tagged tweets from 2009 and 2010 from more than 1,300 counties that contain 88 percent of the U.S. population.

(How to take full advantage of this glut of data, especially for business and governments, is something we’ll cover at our Structure Data conference with Twitter’s Seth McGuire and Dataminr’s Ted Bailey.)


What’s more, at the county level, the Penn study’s findings about language sentiment turn out to be more predictive of heart disease than any other individual factor — including income, smoking and hypertension. A predictive model combining language with those other factors was the most accurate of all.

That’s a result similar to recent research comparing Google Flu Trends with CDC data. Although it’s worth noting that Flu Trends is an ongoing project that has already been collecting data for years, and that the search queries it’s collecting are much more directly related to influenza than the Penn study’s tweets are to heart disease.

That’s likely why the Penn researchers suspect their findings will be more relevant to community-scale policies or interventions than anything at an individual level, despite previous research that shows a link between emotional well-being and heart disease in individuals. Penn professor Lyle Ungar is quoted in a press release announcing the study’s publication:

“We believe that we are picking up more long-term characteristics of communities. The language may represent the ‘drying out of the wood’ rather than the ‘spark’ that immediately leads to mortality. We can’t predict the number of heart attacks a county will have in a given timeframe, but the language may reveal places to intervene.”

The researchers’ work is part of the university’s Well-Being Project, which has also used Facebook users’ language to build personality profiles.

map plot - FINAL

5 Responses to “Scientists say tweets predict heart disease and community health”

  1. Gil Bashe

    Seems a stretch clinically. Twitter and conversation may reflect other factors in life – from diet, exercise, access to medical care and stress. The tweet is the external expression of life and not causal. Hotspots are known.

  2. Sunil Bajpai

    Is there a correlation between spread of infectious disease and words on Twitter? Between rates of unemployment and negative sentiment? If so should we infer that unhappiness causes infection and joblessness? Maybe unhappy people get fired or have their immune system compromised?

  3. Peter Fretty

    Great example of how unstructured data can make an impact through organizational data models. According to a recent IDG SAS survey, there is still a lot of room for growth when it comes to leveraging unstructured data. Specifically, one 31 percent of firms who have integrated unstructured data into their strategies are currently leveraging social media sources.

    Peter Fretty, IDG blogger working on behalf of SAS

  4. snuggles

    The Pearson r isn’t that different when comparing geotagged tweets vs no twitter at all. I’d love to see that original data set because I think there’s some shenanigans afoot.