8 Comments

Summary:

A new research paper shows just how easy it is to identify individuals based on supposedly anonymous mobile-phone data, and this isn’t the first time supposedly anonymous data really wasn’t. But how do we balance the need for privacy with the value of these datasets?

Anonymous data is one of the staples of the big data movement, but there’s a dark side.

In theory, data from mobile phones lets us do things like map traffic patterns, while web-behavior data can be a boon to researchers and others trying to make sense of how people conduct their online lives. The thing is, it’s damn hard to keep that data anonymous. Perhaps all we can hope for is to keep potentially sensitive data out of the wrong hands.

The latest proof of how hard it is to anonymize data came earlier this week, when a group of MIT researchers published a paper based on their analysis of 1.5 million cell phone traces over 15 months inside a “small European country.” A press release highlighting the paper’s publication nicely sums up the findings, which are somewhat startling:

“Researchers … found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them.

“In other words, to extract the complete location information for a single person from an ‘anonymized’ data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person’s whereabouts.”

And assuming you’re concerned about protecting privacy, it gets worse:

“[T]he probability of identifying someone goes down if the resolution of the measurements decreases, but less than you might think. Reporting the time of each measurement as imprecisely as sometime within a 15-hour span, or location as imprecisely as somewhere amid 15 adjacent cell towers, would still enable the unique identification of half the people in the sample data set.”

All it takes to get started is a few pieces of data against which to compare the anonymized mobile data. “For re-identification purposes,” the authors write in the paper, titled Unique in the Crowd: The Privacy Bounds of Human Mobility, “outside observations could come from any publicly available information, such as an individual’s home address, workplace address, or geo-localized tweets or pictures.”

srep01376-f1

Have data, can track. Source: Nature Scientific Reports

We’ve been down this road before

From the Netflix paper.  Source: University of Texas

From the Netflix paper. Source: University of Texas

This news might ring a bell to anyone who follows the world of web data. After releasing anonymous user data as part of its Netflix Prize competition in 2007, researchers were able to de-anonymize it using publicly available movie reviews from IMDB. In 2006, AOL released a bounty of supposedly anonymous search data for research purposes, but it was quickly mirrored onto public web sites and people began picking individual searchers out of the sea of anonymous identification numbers.

There are plenty of non-digital examples, too. The Unique in the Crowd authors point to one case where a medical database was analyzed against a voter list to discover a governor’s health records. In a 2007 post for Wired, security expert Bruce Schneier cited a couple of analyses of census data, including one using 1990 census data and proving that 87 percent of Americans could be identified using just their ZIP code, sex and date of birth.

And then there are those fitness-tracking devices. At out Structure: Data conference last week, Central Intelligence Agency CTO Ira “Gus” Hunt gave the audience — the whole world, really — a scare by noting that it’s possible to identify someone based solely on his gait. That kind of information might not get people lining up for web-connected pedometers and other fitness devices.

Any type of de-anonymization is only exacerbated in an era of social media. The University of Texas researchers who decoded the Netflix data were able to speculate on individuals’ political positions, sexual orientation and other characteristics, but we now give that information away for free on sites like Facebook, Twitter, Foursquare, you name it. If you’re inclined to stalk someone, steal identities or engage in any other malicious undertaking, access to names, photos, interests, location, checkins and other information makes for a hearty personal-data stew, and it just takes one piece to get the rest.

A choice between privacy and a better world?

However, if we can get past the inherent privacy concerns, these types of anonymous, aggregate data sets can be incredibly valuable. Companies such as Google, Apple and INRIX are using smartphones and in-vehicle devices to map traffic patterns and how people move throughout cities in efforts to improve both commute times and urban planning. Social scientists accessing data from companies such as Google and Facebook could learn a lot about the intricacies of online behavior. And predictive analytics platforms such as Kaggle present an opportunity optimize everything from business processes to health care.

Source: INRIX

Source: INRIX

The holy grail of anonymous data lies in genomics and the hope that lots and lots of quality data will help researchers discover cures for diseases like cancer. Because of the relative uniqueness of each individual cancer case, researchers hope a massive pool of data on sequenced genomes will help them spot patterns and commonalities that no amount of traditional lab work will uncover.

Further complicating things is the fact that the companies delivering our favorite web services rely on our personal data to make money. Whether we like it or not, targeted advertising pays the bills for free services, and doing targeted advertising well requires a lot of personal data. One could argue that a major focus of the data science movement that has taken the world by storm is stitching together various pieces of anonymous data from across the web in order to create holistic images of consumers.

In fact, web companies have gotten so good at de-anonymizing data that the Federal Trade Commission has all but abandoned the term “personally identifiable information.” In a 2010 report on online privacy, the agency wrote that any guidelines it proposes will likely apply

“to those commercial entities that collect data that can be reasonably linked to a specific consumer, computer, or other device. This concept is supported by a wide cross-section of roundtable participants who stated that the traditional distinction between PII and non-PII continues to lose significance due to changes in technology and the ability to re-identify consumers from supposedly anonymous data.”

“Going forward,” the Unique in the Crowd authors conclude, “the importance of location data will only increase and knowing the bounds of individual’s privacy will be crucial in the design of both future policies and information technologies.” This rings equally true for every other type of personal data, especially given the relative ease with which they can be analyzed against each other to create a sum that greater than the whole of its parts.

One has to wonder, though, what types of policies and technologies will come about to keep data anonymous and available to the people who need it while still maintaining its utility. Privacy is important, but is it worth the opportunity costs of not trying to solve the types of problems that large, anonymous data sets are ideal for solving? If true anonymization is really that difficult, perhaps the best bet is just to double down on security and try to ensure that valuable data — anonymous or not — doesn’t get into the wrong hands.

You’re subscribed! If you like, you can update your settings

  1. There is:
    Privacy by Design in the Era of Big Data

    http://jeffjonas.typepad.com/jeff_jonas/2012/06/privacy-by-design-in-the-era-of-big-data.html

    Which I think can be broken by differential timing, like MIT used, have to talk to Jeff again but he’s aware of it(since he published, what took MIT that long?). The trick in the long run is to have a machine understand the data needed to answer a question, right now data is to public, or people don’t understand that data has connections.

  2. Statspotting.com Friday, March 29, 2013

    Here is the irony: the content is targeted to a crowd that has willingly shared its identity with a company that would sell their data for billions. Facebook – Why did we do it?

    http://statspotting.com/facebook-why-did-we-do-it/

  3. the answer is no: anonymous means “invisible” – that is, no data. “secure” merely means that any entities that have your data use it *as they see fit*, which is no one’s definition of anonymous.

    there’s no sense in whinging about how non-anonymous our public behavior has become. (why would anyone ever think public behavior was anonymous in the first place?) what we *should* be clamoring for is a better set of laws about the use of data. more specifically, the presentation of incorrect data – what people back in the day used to call “slander” :)

    we can’t now and never will be able to control the information collected about us based on our public behavior. as companies trade more and more data about us, they need to be forced to do it responsibly – that is, massively punitive, *criminal* penalties for trading false data. there is, however, no ethical basis for laws restricting the trade in actually true data.

  4. Consumer access to and control of their personal data is the only solve for this conundrum. Enliken conducted a survey (http://enliken.com/discover/) in which they discovered that data tracking companies have ~50% accurate data on consumers. Despite being a $100bn+ business, the ad infrastructure in the U.S. is wildly inefficient if it operating with such an inaccurate data set.
    With consumers having control over their data (control, meaning the ability to turn on and off the spigot of their data when and to whom they see fit), it would serve the purpose of de-leveraging the ad infrastructure, data trackers, and others that traffic in personal data. Merchants and service platforms are currently funding the ad-targeting business would have no need to purchase 3rd party data that is 50% accurate if they are able to “purchase” it directly from consumers… who happen to be their target anyway.
    I don’t think passing laws necessarily solves the privacy problem. In fact more laws is probably the last thing we need. I think the problem will and should be solved by platforms and technologies.

  5. In my personal opinion, once you go out there and allow social networking sites to access your information there goes your privacy out the door. Losing it is the price to pay on all these mobile technology developments that people enjoy nowadays.

  6. Derrick Harris Friday, March 29, 2013

    These are all good comments, thanks. I tend to agree that we do inherently sacrifice some privacy when we use web services, but I don’t necessarily think it’s such a bad thing (depending on the service provider).

    I, for one, like free Gmail and even services such as Google Now — and I also trust Google to keep my data safe. I also think there’s value in using web, mobile and other data for broader research purposes.

    My concern is how we keep it data safe in the latter situations, where showing it to anyone outside the organization originally entrusted with it immediately exposes it to being de-anonymized.

    1. The guys at Kaggle just sent me this paper about how the data in its Heritage Health Prize contest was designed to thwart de-anonymization: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3374547/ .

  7. Thanks for this great post. As long as figuring out a user’s identity from cellphone location data takes the resources and expertise of a team of MIT researchers, I suspect most of us will take comfort in the idea that we have anonymity for all practical purposes. But that if that comes to an end? Cell phone and Web browsing data are problematic enough – the convergence between public surveillance and technologies like facial recognition and Google Glass is poised to take this issue to the next level.

    In short, I don’t think anonymization is pointless, but it’s only a small part of the effort businesses and other organizations need to make in order to guard individual privacy. Otherwise, the backlash could turn off the spigots of data for a lot of useful activities, as you point out.

Comments have been disabled for this post