For all the talk about using big data and data science to solve the world’s problems — and even all the talk about big data as one of the world’s problems — it seems like we still have a long way to go.
Earlier this week, an annual conference on data mining, KDD 2014 for short, took place in New York with the stated goal of highlighting “data science for social good.” It’s a noble goal and, indeed, the event actually did highlight a lot of research and even some real-world examples of how data can help solve various problems in areas ranging from health care to urban planning. But the event also highlighted — to me, at least — some very real obstacles that stand in the way of using data science to solve society’s problems in any meaningful way at any meaningful scale.
Most of these obstacles have little to do with the data itself. It’s easier to gather and easier to analyze than ever before. Rather, the problem is that data scientists and researchers — even those who really care about tackling important issues — can often have a difficult time overcoming the much more powerful forces fighting against them.
We’ve covered all sorts of research projects over the past few years that looked into how data might be applied to various problems (bullying and HIV prevention are among the more interesting examples) and even a handful of projects that have actually been deployed in the real world. But the reality appears to be that most of them remain as research, promising proofs of concept that are rarely applied to analyzing actual data or helping actual people. Save for a few exemplars and areas with a lot of easy money at stake — there are all sorts of startups and large vendors tackling health care and agriculture, for example — there’s just not a lot of action.
I think there are three big forces fighting against the successful implementation of these data science techniques: fear, politics and the law. And although they’re all distinct in some ways, they’re also very closely connected.
Fear of the unknown
From a consumer perspective, the fear of the unknown is probably the biggest issue facing data mining efforts. Most consumers feel they need to be wary of all the data companies like Google and Facebook are collecting about them (and maybe they should be wary), but many consumers also don’t know exactly how those companies’ data mining efforts work or why exactly they’re so scary.
That attitude carries over into other areas. Dan Wagner, the co-founder and CEO of Civis, a Chicago-based startup that helps organizations (including a lot of nonprofit agencies) solve problems via data analysis, said during a talk at KDD on Sunday that data scientists are currently limited in their ability to work on some of the biggest problems in areas such as genomics, education and crime because people aren’t very willing to hand over the potentially sensitive data it might require to make truly groundbreaking findings.
Indeed, the backlash against a student-data collection company called InBloom was so great earlier this year that the company was forced to shut down in April. There has been a lot of debate about medical privacy, both in terms of the laws regulating what hospitals can share but also about what people should be willing to fork over on their own.
Text on InBloom’s website explained its decision to shut down:
The use of technology to tailor instruction for individual students is still an emerging concept and inBloom provides a technical solution that has never been seen before. As a result, it has been the subject of mischaracterizations and a lightning rod misdirected criticism. In New York, these misunderstandings led to the recent passage of legislation severely restricting the education department from contracting with outside companies like inBloom for storing, organizing, or aggregating student data, even where those companies provide demonstrably more protection for privacy and security than the systems currently in use.
I think part of the fear (which probably has only been bolstered by all the past year’s NSA revelations) is that there’s a trend toward ever-greater collections of data and ever more-detailed pictures of individuals. This is exemplified in a recent blog post by author and technology critic Nick Carr, who suggested that Google and Facebook should be more measured in, well, the number of things they measure. It’s a valid criticism when it comes to advertising, but I think it’s possibly overstated, especially when it comes to data mining in the name of social welfare.
Rather, one might argue the goal of data-collection efforts — whether it’s Google, your doctor or your school doing the collecting — is actually to minimize the number of things being measured. It’s just that we’re so early in this era of being able to measure and analyze so much that no one really knows the one, or handful, of variables in any given instance that someone can tweak to have the greatest effect. Are more police the answer to lower crimes, or is there something else that would easier and more effective? Are there things poor school districts can do to help overcome a lack of money and many children’s less-than-ideal home lives?
The truth is that in many cases we just don’t know unless we’re willing to part with some data early on, even if it’s data we’d rather not share with government or nonprofit agencies. Civis’s Wagner said his company is able to help overcome some of these issues by building trust slowly, perhaps solving an easy problem requiring relatively little or anonymous data first, and building from there.
However, he also acknowledged during an interview after his talk, “A lot of practitioners in this work can do a better job communicating what we’re doing.”
Politics: Where even the best data doesn’t matter
Politics is a tricker hurdle to overcome than is fear. From small towns to Washington, D.C., government agencies often handle issues beyond the scale or jurisdiction of even the largest nonprofit agencies. Unfortunately, all the data and the best studies in the world often don’t seem to matter when elections are at stake.
Guns, drugs, income inequality … the list could go on. Rather than talking about the myriad studies or the new types of data we could gather to find out even more, politicians often fall back on ideological arguments meant to appease voters and campaign contributors. How many times, for example, have you heard a politician talking about minimum wage or economic mobility mention the findings of the Equality of Opportunity project?
During a panel discussion at KDD, Jens Ludwig, a University of Chicago economist and director of the school’s Crime Lab project, described the challenge of getting politicians to react to data as one of figuring out their mental models. For example, the Crime Lab conducted a research project that showed investing in social programs can have a beneficial effect on crime rates, and Chicago Mayor Rahm Emanuel jumped on that finding in order to push for increased spending on such programs.
However, Ludwig noted, it’s still unclear which thresholds and burdens of proof will elicit that kind of reaction. Will only entirely scientific, randomized experiments pass muster? Must politicians think backing the findings will please 51 percent of voters? Editorial boards?
He’s actually confident there’s a point where good enough data can actually sway politicians from their preconceived (or preordained) notions — much more so than was fellow panelist Gavin Schmidt from the NASA Goddard Institute for Space Studies — but thinks doing so will require a change in many projects’ designs. Rather than being about gathering the best data or showing strong correlations, Ludwig thinks they’ll have to be designed to show rigorous causal effects.
Compared with the field of medicine, he said the field of violence prevention is still in the era of applying leeches. That’s in part because data surrounding the effectiveness of strategies such as stop-and-frisk and broken windows policing is too correlative and not causal enough. Tying them to subsequent drops or increases in crime doesn’t really answer the question of why crimes occur and if the policies are necessary to prevent them.
Still, a skeptic would be fast to point to climate change as an area where seemingly all the causal data in the world doesn’t seem to matter much.
However, politicians and civil servants do more than just manage budgets or decide policies; they also make and enforce laws. And here, too, big data faces some major challenges, although more with regard to how we keep it in check than how we use it to solve societal problems.
During a four-hour KDD workshop about data ethics, presenters and participants spoke about some of the perils of big data, particularly around minimizing data collection, protecting privacy and reducing the risk of data-based discrimination. These concerns go hand-in-hand: the more data companies collect about people, the easier it is to draw inferences about who they are and what they’re into. The easier it is do that, the easier it is to start discriminating — intentionally or not — based on factors such as race, sex, income or health. It’s also possible to make mistaken inferences that could negatively affect consumers.
The White House addressed a lot of these same concerns in a May report, but I’m not holding my breath we’ll see much meaningful legislation or regulations come from even from that.
One presenter, Josh Cowls from the Oxford Internet Institute, asked the question of why certain things we take for granted in the physical world now creep us out in the digital world — things such as targeted advertising or subtle attempts to influence our attitudes. (I think there are several answers, for what it’s worth.) Troy Raeder from ad-targeting startup Dstillery noted certain things his company’s models don’t track because of possible legal issues. An attendee expressed a desire for more clear-cut laws about data mining so it would be easier to conduct research and roll out products without a fear of doing something illegal or, at least, fineable.
Mark Latonero, who heads up the University of Southern California’s Technology & Human Trafficking project, spoke about the work his team is doing to try and identify potential victims of human trafficking by analyzing the data of online classified ads and other data sources. It’s potentially very important work, and while some other projects (and Google) are working with law enforcement on similar efforts, Latonero’s group isn’t. This is partially due to the fact that universities typically like to stick to research rather than implementation, but one could also foresee some serious legal challenges to investigations based on data mining rather than evidence or specific complaints.
What’s so troubling is that despite so many valid concerns about big data, and so many good ideas for addressing them, there’s still not a whole lot of momentum toward actually getting them out of academic papers and codifying anything in any uniform way. Mostly, privacy violations are dealt with on an ad hoc basis via lawsuits that reap lots of cash for lawyers and not much else for the plaintiffs, maybe the occasional FTC settlement.
The problem has a lot do with fear, if you ask me. Fear that if we crack down too hard on web privacy, we’ll stifle innovation in an area touted as one of America’s great economic hopes. Fear that if we write rules too focused on today’s concerns (or, more likely, yesterday’s concerns) they’ll be obsolete in a year. Lawmakers are paralyzed because they just don’t know how to address the issue in an effective manner. In the meantime, it’s business as usual.
I don’t blame them. I don’t have the answers either. But if we actually believe that big data can help solve some of our toughest problems, or that big data can create new problems of its own, we do need to find a collective will to figure things out. Otherwise, there will continue to be a lot of good ideas and a lot of interesting studies offset by a whole lot of inertia on the ground.