OK, everyone gets it — Hadoop is great and companies need to start analyzing all their data lest they risk losing their competitive edge. Gotcha. But there’s a heck of a lot more to think about when it comes to big data, ranging from where companies will actually find workers to how they’ll deal with an impending policy showdown over privacy.
Last week, I had a chance to stop by the Churchill Club’s “The Big Data Effect” event, the highlight of which was a panel discussion (watchable here) featuring SAS SVP and CTO Keith Collins; Factual Founder and CEO Gil Elbaz (who also founded Applied Semantics); venture capitalist Ping Li of Accel Partners; EMC Greenplum Co-Founder and CTO Luke Lonergan; and @WalmartLabs leader and SVP of Walmart global e-commerce Anand Rajaraman (who also co-founded Junglee and Kosmix). They offered some great insights into where big data is headed and how organizations might best plan to capitalize on it.
1. Lots of data must change the way you think about data
Aside from learning what questions to ask of the new types of data they capture, companies just getting started with big data have to fundamentally rethink how they go about their analytics efforts. Rajaraman, who also teaches data mining at Stanford University, illustrated this problem with a classroom example. When he tells students they have 10TB of data to work with, the first thing many of them want to do is sample it, he said. Such a strategy harkens back to the legacy world, where time and capacity made sampling data sets the norm. The value in big data, he said, is in utilizing all the data to obtain the truest-possible insights.
2. Data has character and value, however difficult to define
For Elbaz, whose business at Factual is providing public access to large data sets, there’s a fundamental issue around how to differentiate between different pieces of data. For starters, it’s important to be able characterize data in a way that lets its consumers know how much stock to put in it. If it comes from an untrusted source or is questionably accurate, it might be wise to let users know of the low confidence level. Or, if high quality is all that matters, figure out a way to weed out such data within the service itself.
Elbaz also noted a lack of consensus on metrics for valuing data. Generally speaking, fresh data is probably better, he said, although there certainly are older data sets that can be very valuable. And is scarce data necessarily more valuable than widely accessible data? Assuming data marts become common, there’ll likely have to be a way to determine what data is worth what price so that data merchants can distinguish their wares and data consumers can justify the prices they pay.
3. Getting privacy right won’t be easy
When asked about privacy, SAS’s Collins made an observation I’ve made before, which is that attempts by governments to regulate it could hurt the practice of analytics. A big part of the analytic process, he explained, is continuous experimentation to follow new leads or to determine whether data-based applications are living up to their full potential. In areas such as fraud detection and medical research, experimentation is critical. It’s simply not feasible to obtain informed consent or get agency approval every time an organization wants to use data in a new way.
Lonergan seemed to agree somewhat, suggesting that perhaps a better understanding of data practices would help citizens feel more comfortable about privacy and about sharing information. In health care, for example, he thinks people might be more willing to give even personal data based on the understanding that they’ll benefit from it later. I think this is true across many fields, but I also think many people conflate privacy ethics and data security, which is a whole other issue.
With regard to web privacy, Elbaz had a particularly interesting suggestion that perhaps could use some discussion by the greater web community. He envisioned a universal privacy standard similar to open source code, which would be updated in a similar manner (think Privacy Standard 1.0., 1.1, etc.) and that would let visitors know what to expect. No one would be forced to adopt such a standard, of course, but consumers who don’t want to read new policies might view deviations with a skeptical eye.
4. Follow the regulations
Collins doesn’t think regulations are all bad, though. There’s usually a value proposition surrounding particularly burdensome regulations, he said, because the more data companies are forced to retain, the more opportunity there is to analyze it. That seems to go for both organizations trying to make lemonade out of lemons, as well as for big data vendors looking for new industries to target.
5. Data is the fourth paradigm of science
Previously, the widely accepted scientific paradigms were empiricism, theory and simulation (or computation), but Walmart’s Rajaraman noted an increasingly accepted belief that data is now the fourth paradigm. What he meant is that rather than relying on empirical evidence in the form of natural phenomena or even simulating complex phenomena on computers, data — and lots of it — helps provide new levels of truth. One could simply analyze it and let the data tell its own story, or use it to augment a computer simulation to a much higher degree of accuracy. One need only look at the proliferation of high-speed research networks to see evidence of a paradigm shift in which access to and analysis of incredible amounts of data is the foundation of many major scientific efforts.
However, if all this is a little too deep, Rajaraman also shared a more mundane insight on the value of data: Walmart now carries kits for making cake pops (essentially, cake on a lollipop stick) because sentiment analysis Twitter showed that people love cake pops. Go, big data!