In the past year, big data has emerged as one of the most closely watched trends in IT. Organizations today are generating more data in a single day than that the entire Internet was generated as recently as 2000. The explosion of “big data”–much of it in complex and unstructured formats–has presented companies with a tremendous opportunity to leverage their data for better business insights through analytics.
Wal-Mart was one of the early pioneers in this field, using predictive analytics to better identify customer preferences on a regional basis and stock their branch locations accordingly. It was an incredibly effective tactic that yielded strong ROI and allowed them to separate themselves from the retail pack. Other industries took notice of Wal-Mart’s tactics — and the success they gleaned from processing and analyzing their data — and began to employ the same tactics.
While data analytics was once considered a competitive advantage, it’s increasingly being seen as a necessity for enterprises–to the point that those that aren’t employing some kind of analytics are seen to be at a competitive disadvantage. Driven by the rise of modern statistical languages like R, there’s been a surge in enterprises hiring data analysts–which has in turn given rise to the larger data science movement. Data is a huge asset for enterprises, and they’re beginning to treat it accordingly.
For all the talk about the need to effectively analyze your data, though, there’s been relatively little written about how organizations are using data to achieve actionable results. With that in mind, here are five use cases involving analyses of large data sets that brought about valuable new insight:
- NYU Ph.D. student conducts comprehensive analysis of Wikileaks data for greater insight into the Afghanistan conflict: Drew Conway is a Ph.D. student at New York University who also runs the popular, data-centric Zero Intelligence Agents blog. Last year, he analyzed several terabytes worth of Wikileaks data to determine key trends around U.S. and coalition troop activity in Afghanistan. Conway used the R statistics language first to sort the overall flow of information in the five Afghanistan regions, categorized by type of activity (enemy, neutral, ally), and then to identify key patterns from the data. His findings gave credence to a number of popular theories on troop activity there–that there were seasonal spikes in conflict with the Taliban and most coalition activity stemmed from the “Ring Road” that surrounds the capitol, Kabul, to name a few. Through this work, Conway helped the public glean additional insight into the state of affairs for American troops in Afghanistan and the high degree of combat they experienced there.
- International non-profit organization uses data science to confirm Guatemalan genocide: Benetech is a non-profit organization that has been contracted by the likes of Amnesty International and Human Rights Watch to address controversial geopolitical issues through data science. Several years ago, they were contracted to analyze a massive trove of secret files from Guatemala’s National Police that were discovered in an abandoned munitions depot. The documents, of which there were over 80 million, detailed state-sanctioned arrests and disappearances that occurred during the country’s decades-long civil conflict that occurred between 1960 and 1996. There had long been whispers of a genocide against the country’s Mayan population during that period, but no hard evidence had previously emerged to verify these claims. Benetech’s scientists set up a random sample of the data to analyze its content for details on missing victims from the decades-long conflict. After exhaustive analysis, Benetech was able to come to the grim conclusion that genocide had in fact occurred in Guatemala. In the process, they were able to give closure to grieving relatives that had wondered about the fate of their loved ones for decades.
- Statistician develops innovative metrics tracking for baseball players, gains widespread recognition and a job with the Boston Red Sox: Bill James (he of Moneyball fame) is a well-known figure in the world of both baseball and statistics at this point, but that has not always been the case. James, a classically trained statistician and avid baseball fan, began publishing research in the early 1970s that took a more quantitative approach to analyzing the performance of baseball players. His work focused on providing specific metrics that could empirically support or refute specific claims about players, be it the amount of runs they contributed to in a given season or how their defensive abilities contributed to or detracted from a team’s success. James’ approach became known as sabermetrics and has since expanded to incorporate a wide range of quantitative analyses for measuring baseball metrics. Over time, sabermetrics has gained wide recognition in baseball to the point that it’s now employed by all 30 Major League Baseball teams for tracking player metrics. In 2003, James was named Senior Advisor of Baseball Operations by the Boston Red Sox, a position he holds to this day.
- U.S. government uses R to coordinate disaster response to BP oil spill: In the early days of last year’s Deepwater Horizon disaster, the flow of oil rate from the spill was of primary concern; estimating it accurately was key to coordinating the scale and scope of the U.S. government’s response to the emergency. The National Institute of Science and Technology (NIST) was charged with making sense of the varying estimates that existed from both BP and independent third-parties. To do so, NIST used the open source R language to run an uncertainty analysis that harmonized the estimates from various sources to come up with actionable intelligence around which disaster response efforts could be coordinated.
- Medical diagnostics company analyzes millions of lines of data to develop first non-intrusive test for predicting coronary artery disease: CardioDX is a relatively small, Palo Alto, Calif.-based company that performs genomic research. One of their major initiatives over the past several years was developing a predictive test that could identify coronary artery disease in its most nascent stages. To do so, researchers at the company analyzed over 100 million gene samples to ultimately identify the 23 primary predictive genes for coronary artery disease. The resulting test, known as the “Corus CAD Test,” was recognized as on of the “Top Ten Medical Breakthroughs of 2010” by TIME Magazine.
These are but a few brief examples of the exciting work that’s being undertaken in the rapidly growing discipline of data science. More and more, data analysis is being relied on to provide context for critical business decisions, a trend that promises to increase as data sets grow larger and more complex and scientists continue to push the limits of statistical innovation.
David Smith is vice president of community at Revolution Analytics, a company founded in 2007 to foster R analytics by creating programs to make it easier for data scientists to analyze large amounts of data.