Blog Post

How two scientists are using the New York Times archives to predict the future

Researchers at Microsoft (s MSFT) and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.

The new research is the latest in a number of similar initiatives that seek to mine web data to predict all kinds of events. Recorded Future, for instance, analyzes news, blogs and social media to “help identify predictive signals” for a variety of industries, including financial services and defense. Researchers are also using Twitter and Google (s GOOG) to track flu outbreaks.

from "Mining the Web to Predict Future Events," Horvitz and Radinsky,
from “Mining the Web to Predict Future Events,” Horvitz and Radinsky,

Eric Horvitz of Microsoft Research and Kira Radinsky of the Technion-Israel Institute describe their work in a newly released paper, “Mining the Web to Predict Future Events” (PDF). For example, they examined the way that news about natural disasters like storms and droughts could be used to predict cholera outbreaks in Angola. Following those weather events, “alerts about a downstream risk of cholera could have been issued nearly a year in advance,” they write.

Horvitz and Radinsky acknowledge that epidemiologists look at some of the same relationships, but “such studies are typically few in number, employ heuristic assessments, and are frequently retrospective analyses, rather than aimed at generating predictions for guiding near-term action.” They outline the advantages that software has over humans in this area:

  • Learning: Software “has the ability to learn patterns from large amounts of data, can monitor numerous information sources, can learn new probabilistic associations over time, and can continue to do real-time monitoring, prediction, and alerting on increases in the likelihoods of forthcoming concerning events.”
  • Tireless researching: Software, with its “long tentacles into historical corpora and real-time feeds,” can dig up data that humans might never find because they’re too focused on “knowledge that is easily discovered in studies or available from experts.”
  • Lack of bias: Software can assist “when inferences from data run counter to expert expectations,” or when “there is a significantly lower likelihood of an event than expected by experts based on the large set of observations and feeds being considered in an automated manner.”
  • Greater access to news: “A system monitoring likelihoods of concerning future events typically will have faster and more comprehensive access to news stories that may seem less important on the surface (e.g., a story about a funeral published in a local newspaper that does not reach the main headlines), but that might provide valuable evidence in the evolution of larger, more important stories (e.g., massive riots).”

One of the problems that the researchers faced in developing their software model is the fact that tragic events in poor African countries are often not widely reported. So they taught the software to generalize somewhat: “Instead of considering only ‘Rwanda cholera outbreak,’ an event with a small number of historical cases, we consider more general events of the form: “[Country in Africa] cholera outbreak.” We turn to world knowledge available on the Web…[that] maps Rwanda to the following concepts: Republics, African countries, Land- locked countries, Bantu countries, etc.”

Horvitz and Radinsky also taught the software what to ignore: It “was able to recognize that the drought experienced in New York City on March 1989, published in the NYT under the title: ‘Emergency is declared over drought’ would not be associated with a disease outbreak…The system estimates that, for droughts to cause cholera with high probability, the drought needs to happen in dense populations (such as the refugee camps in Angola and Bangladesh) located in underdeveloped countries that are proximal to bodies of water.”

“I truly view this as a foreshadowing of what’s to come,” Horvitz told the MIT Technology Review. “Eventually this kind of work will start to have an influence on how things go for people.” He said Microsoft isn’t commercializing the research yet, but that it will continue, and he wants to get more “data further back in time.”

8 Responses to “How two scientists are using the New York Times archives to predict the future”

  1. clayusmcret

    The NY Times only covers issues supportive of their editorial board´s political agenda. Their software will miss large amounts of unpopular diseases and disasters in their research.

    • Cory Mountz

      “Researchers at Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.”

  2. Vicki McLean

    These hackers must be connected to the black hat HAARP ( high frequency active AURORA research program) which is a US military weapon located in at least 13 sites across the world that created hurricane Katina, Rita, Ike, Irene and Sandy, Joplin Tornadoes Alaskan tornadoes, Japan earthquake , 2010 lightning strikes in Covington , La in just 30 minutes on the day if the Aurora shootings. Hackers trying to stop the destruction of HAARP all over the world was the reason for Aurora, newtown and Dunblane, ScottLand massacres !
    Linked to USPatent 6715341 which originated out of Heroit-Watt University which was falsified and used to develop HAARP along with the Theft of the Apple operating system as early as 1993! Both true owners being James McLean and Steve Jobs were murdered with parasite intentionally misdiagnosed as cancer and then the biological weapons being chemo and radiation introduced in the US as early as 1940 by nazi spies” Operation Paperclip to hasten these owners murders! How can the White hat hackers of this world unite to crash the evilly used HAARP systems located in 13 sites across the world

  3. Gary Roberts

    This is a first step by companies to respond to our dictator. Providing research based on the needs of this planets inhabitants hopefully will lead to an inventory of our planets finite resources in an effort to know how to best distribute resources to all the inhabitants of our planet. Forty percent of our planets resources are controlled and absorbed by one percent of our planets inhabitants. The hold of resources by the forty percent is based on a delusional monitory Ponzi scheme. The reality that is eluding everyone is that currency has no tangible value. Money lost it’s value when it was divorced from the gold standard in the 70’s. I know I’m nuts, but not as nuts as the millions of people buying an illusion that is our monitory system, that today is used to distribute resources on our planet, Our planet is the dictator we need to recognize and to whom we need to respond.

  4. Steve Ardire

    Horvitz and Radinsky’s approach is similar to a few others I know and Chris Ahlberg of Recorded Future is spot on saying turning prototypes into a viable product requires quite a bit of development. Nonetheless this current ‘small market’ for predictive tools is growing fast !