Big data is often talked about as a phenomenon that lets organizations create narratives from their volumes of data. That is an apt characterization when we are talking about connecting the dots among disparate and possibly disconnected data sets. However, when we are talking about anything involving human beings — customer behavior, the spread of disease, attitudes toward products or people — all that many current analytical efforts deliver is the end of the book, or what happened.
The end is a fine place to start with regard to big data, but working backward — that is, figuring out why people made the decisions they made — might prove more valuable for everyone involved, even if it requires some manual labor beyond capturing data or obtaining interesting data sets.
Three recent findings, all from entirely different fields, serve as prime examples of what I am talking about.
1. Winter storms mean more illness. So what?
A team of researchers from the University of Central Florida recently won SAS’ Data Mining Shootout by analyzing weather data, hospital check-in data and diagnostic data to determine that winter storms have the most significant impact on our health when compared with other types of storms such as windstorms and floods. The results of the research were insightful and probably most so to hospitals, which might want to increase staff or otherwise prepare if they know a winter storm is coming. But those results don’t tell us anything about why winter storms have such an effect.
It might be useful to know what patients were doing during that storm, or immediately before or after, that resulted in their getting sick enough to require hospital care. Was it because they were stuck indoors in close proximity to other sick people? Maybe it was because many people are more willing to spend time in a hospital during the doldrums of winter than they are during the three more-enjoyable seasons of the year.
Without some qualitative data to go along with the numbers, it is difficult for individuals to know how to change their behavior to avoid hospital trips or for health agencies to advocate for changes in citizens’ behavior. So in this case, knowing the end of the story is helpful, but knowing the beginning could actually affect public health.
2. Why do Mac users book more-expensive hotels?
In November, Orbitz CEO Barney Harford noted how his company’s big data efforts yielded the insight that Mac users spend an average of $20 more per night on hotel rooms than do Windows users. The obvious takeaway if you are Orbitz is to present Mac users with pricier hotel rooms by default when they are not specifically sorting by price, location or some other factor. Statistically, they are more likely to pay more, so Orbitz might as well cater to their perceived preferences and make a little more money in the process.
After writing about this on GigaOM, several people commented both on the post and to me on Twitter that this is not surprising at all. Macs cost so much more than comparable PCs, they reasoned, that Mac users must be rich and/or innately willing to spend more on everything. I happen to think this is faulty logic, but it does raise a question worth examining: Why do people — even those who are far from rich — pay the Mac premium? In my own personal experience, it was a combination of form, function and perceived value for my money, although the sticker price did make it a difficult decision.
Assuming other Mac users share those sentiments, it would seem there is an exploitable trend that goes beyond the conclusion of “Mac users spend more.” Perhaps Orbitz and other e-commerce sites — maybe even brick-and-mortar retailers — that want to cater to Mac users (or like-minded consumers) could take extra steps to present them with products that meet the actual criteria that led them to purchase Macs in the first place. I don’t think Mac users are going to buy garbage just because it costs more, but perhaps they will pay for quality, whatever the price.
However, unless Orbitz and other retailers have untapped sources of qualitative data on buying decisions or derive some clever methods for determining a causal relationship other than “Macs equal expensive hotel rooms,” they might need to look elsewhere to figure out the reason behind the result.
3. Do programmers get the holiday blues?
Fergal Glynn at Veracode recently wrote a blog post highlighting his findings that, at least for the code his company tests, the frequency of flaws peaks drastically in October and November before finally settling down again in December. He speculates as to multiple possibilities for why this is:
Maybe the build up to Thanksgiving has developers distracted? Are developers adjusting after the Summer break when “the living is easy” and the roads are quiet? Fall brings the extra pressure of dropping kids at school and rushing in the evenings to pick them up after sports. There is also the added pressure to produce a high volume of code to meet end of year deadlines and releases. Scientific literature is full of studies on the effects of stress on higher cognitive activities, and its reasonable to assume that application developers, like most of us, may respond to added pressure by making more mistakes in the code they write.
Another blog post, one from Fast Company and written by CloudFlare’s Shawn Graham, actually inspired Glynn to conduct his study. CloudFlare’s analysis shows a significant spike in activity around certain holidays — Halloween, May Day, Mother’s Day and Veterans Day — and significant drops on other holidays — National Back to Church Day and China National Day — along with minor drops around Easter and Thanksgiving. Addressing the decreases, Graham suggests the following explanation:
One could be the result of divine intervention from a higher cyber power. The other could be the result of a spike in infected computers in China that are offline and therefore can’t be used to launch attacks at that time.
These explanations make sense, of course, but they lack authority without some qualitative data to back them up. It might be difficult to verify the reasons behind spikes and drops in cyberattacks, but it could be relatively easy to figure out why programmers seem to have a preholiday breakdown. If companies know the why, not just the what, they can take proactive steps to address the problems, increasing efficiency by not having to fix buggy code and perhaps improving morale among their development teams.
A focus group for the 21st century
Although all of these finding can have a meaningful impact by themselves, it seems they could have a much greater impact if interested parties were able to dig a little deeper into why they found what they found. One option is to use different data sets as proxies for actual customer information (as Towncar service Uber did very well using crime data as a proxy for info on where people work and recreate). However, this method might involve making hypothetical assumptions to compensate for direct causal relationships. Getting actual information from the persons of interest would be much more effective.
The answers might be easy enough to find when you are talking about activity within an organization. Trends such as seasonally spotty code or increased sick days could be extrapolated by simply talking to employees and determining what was going on in their work and/or personal lives during any given time frame. If captured digitally, one could even turn analytics tools onto the files to search for trends that might not be apparent to an individual reading or listening to hundreds of responses.
In the case of hospital visits after winter storms, perhaps the answer is to share information among researchers and health practitioners. Made aware of such findings, doctors could start asking patients about more than just their symptoms. If those same researchers went back and viewed doctor-collected data in aggregate, they would likely see trends start emerging about why certain storms, or any other phenomena, have the health effects they do.
It is also probably fairly simple to dig into the underlying reasons for particular behaviors when organizations can tie them to internal promotions or activities. If, for example, sales of corn syrup and social-media engagement both jumped along with a preholiday sale on baked goods and a Twitter campaign based on sharing candy recipes, a grocery store would be able to connect the dots.
But what about those trends that are both anonymous and broad, such as in the Mac users example above? Trends like these present a trickier situation, because it would likely be infeasible to reach out to millions of buyers in order to get their input, and privacy implications almost certainly would restrict such activity anyhow. Open-ended survey questions or other tactics might be a great way to find qualitative data that can be linked to and explain quantitative data, but response rates could be low and possibly skewed, because all the respondents already share the personality trait of being willing to write down their thoughts for merchants to analyze.
Erik Huddleston, the CTO of social-business consultancy Dachis Group, thinks the answer might lie in using Twitter and other social networks as a post hoc focus group. Different forms of social media are already famously used as worldwide focus groups to obtain consumer sentiment on things — Twitter, in particular, is already being used to monitor everything from the popularity of World Series participants to what stocks might get hot — but they also could be used to obtain psychographic information about a particular demographic’s displaying particular behaviors.
Assuming an organization can’t (or shouldn’t) track down specific individuals, it still could create a sort of long-running focus group based on demographic factors it already knows contribute to certain behaviors. At that point, it is a matter of monitoring discussions for information to help explain the behavior that has already been uncovered (e.g., Mac users’ booking pricier hotel rooms). That might be easy enough — individuals might just come out and say why they love something — or it might require some natural-language processing and semantic algorithms to identify related terms and their context or maybe even to uncover relatively unrelated trends among that demographic that speak to their general habits and behaviors.
However it is accomplished, it seems like there should be a lot of organizations in the business of determining the hows and whys of people’s behaviors. No doubt, some forward-thinking organizations are already headed down this path, but I haven’t seen evidence to suggest such efforts are commonplace. Big data has already proved its worth in helping organizations uncover trends that tell the ending of the story — Act A has Effect 1, or Consumer B likes Product 2 — but changing certain behaviors or capitalizing on them across a broader spectrum might require figuring out how we got there.