15 Comments

Summary:

How does the NSA analyze all the data it’s collecting from cell phone users? With a massive database system built with just such scale and workloads in mind.

The NSA has more data than we can fathom. Scary, but it's unlikely anyone's data really stands out.

The National Security Agency might not have the names of Verizon’s wireless customers, but the agency probably can figure out what they’re up to if it’s so inclined. The metadata Verizon has provided the NSA — phone numbers, numbers called, duration of calls, location — is a veritable treasure trove to an organization with the right analytic skills and the right tools. The NSA has both.

There are numerous methods the NSA could use to extract some insights from what must be a mind-blowing number of phone calls and text messages, but graph analysis is likely the king. As we’ve explained numerous times over the past few months, graph analysis is ideal for identifying connections among pieces of data. It’s what powers social graphs, product recommendations and even some fairly complex medical research.

My LinkedIn social graph

My LinkedIn social graph

But now it has really come to the fore as a tool for fighting crime (or intruding on civil liberties, however you want to look at it). The NSA is storing all those Verizon (and, presumably, other carrier records) in a massive database system called Accumulo, which it built itself (on top of Hadoop) a few years ago because there weren’t any other options suitable for its scale and requirements around stability or security. The NSA is currently storing tens of petabytes of data in Accumulo.

For a more thorough description of Accumulo and the NSA infrastructure, read our post “Under the covers of the NSA’s big data effort.”

In graph parlance, vertices are the individual data points (e.g., phone numbers or social network users) and edges are the connections among them. In late May, the NSA released a slide presentation detailing how fast fast Accumulo is able to process a 4.4-trillion-node, 70-trillion-edge graph. By way of comparison, the graph behind Facebook’s Graph Search feature contains billions of nodes and trillions of edges. (In the low trillions, from what I understand.)

So, yes, the NSA is able to easily analyze the call and text-message records of hundreds of million of mobile subscribers. It’s also building out some massive data center real estate to support all the data it’s collecting.

nsa

How might a graph analysis work within the NSA? The easy answer, which the government has acknowledged, is to figure out who else is in contact with suspected terrorists. If there’s a strong connection between you and Public Enemy No. 1, the NSA will find out and get to work figuring out who you are. That could be via a search warrant or wiretap authorization, or it could conceivably figure out who someone likely is by using location data.

Having such a big database of call records also provides the NSA with an easy way to go back and find out information about someone should their number pop up in a future investigation. Assuming the number is somewhere in their index, agents can track it down and get to work figuring out who it’s related to and from where it has been making calls.

Presumably, agents could begin with location data, too. If a bomb went off at Location X, bringing up all the numbers making calls from towers in that area might be a good starting point for investigation. Tracking someone’s movement from location data could be helpful, too.

If this all sounds a little creepy, maybe it should. After all, the world’s biggest, baddest intelligence agency can pretty much figure out who you are, who you know and where you go. And unlike web and retail companies that collect and analyze so much data about us, the government can put you in jail.

It might be even creepier when you consider how much other data law enforcement agencies can collect about you without a warrant.

However, someone familiar with NSA policy told me, the good news is that the vast majority of people are still anonymous even in this sea of data: There’s just too much data to care until someone pops up in the bad guys’ networks or gets on the agency’s radar.

  1. “However, someone familiar with NSA policy told me, the good news is that the vast majority of people are still anonymous even in this sea of data: There’s just too much data to care until someone pops up in the bad guys’ networks or gets on the agency’s radar.”

    I’m a writer – that’s a sort of “but, hey… on the flipside, you know…” – ;)
    You don’t have to agree, but I’ll say it here… these policies and things are a pile of crap.
    not to say we’re living “1984″. Not at all…

    I’d say it’s more like the bastard child of a grotesque affair between “1984″ and ‘Brave New World’.
    Thank you, Om, for writing quality while TechCrunch, MAshable, Ars et al have become bloated techbloids.

    Share
    1. Derrick Harris Thursday, June 6, 2013

      You caught me on that one, but I also do take some solace in this knowledge. Even with targeted ads, etc., it’s just a system spotting some specific behavior or characteristic and acting. It doesn’t care who’s who. Of course, the data is there if someone — a person — were so inclined to go digging.

      Share
  2. The “system” is as powerful as weak. Do you want to be a badass? Easy… no internet, no cell phone.

    Share
    1. I guess you would still pop on the agency’s radar when you are caught on any CCTV camera

      Share
  3. So this is not real-time analysis of data? It’s a graph database which is updated every (day? week? month?)

    Share
    1. Derrick Harris Friday, June 7, 2013

      From what I’ve been told, it’s pattern and network analysis. Is there a real-time system in place? Perhaps.

      Share
  4. If they’re using graph analysis, what will it mean for me if someone from a “bad” cluster dials the wrong number and reaches me. Will I be included in the bad cluster? I imagine spam has to be a similar if not worse problem: where we’re all in regular contact with criminal organizations. And what about stolen phones, track phones, hijacked emails, SIM card swapping? Telecommunication practice doesn’t seem as stable and structured as the graph analysis depicts.

    Share
  5. Christo writes: “If they’re using graph analysis, what will it mean for me if someone from a “bad” cluster dials the wrong number and reaches me. Will I be included in the bad cluster?”

    There are a lot of ways that one can imagine things going wrong… but that doesn’t mean that this is an invalid or inherently invasive method of investigation IN PRINCIPLE. All it means is that it can be done poorly.

    Even putting “rights” aside, any statistician will tell you that a single call from someone in a “bad cluster” would never be enough to reliably associate you with that cluster, and one has to expect that the analyses done by the NSA would be smarter than that. Could they screw up? Of course. Does that mean that this type of thing is inherently bad? Not at all.

    Share
  6. By the way, fantastic article. FYI: I used your graph, crediting you and linking to your article, on my little political satire website here:

    http://liberalbias.com/post/2277/obama-nsa-spying-privacy-wiretapping-omg-freak-out/

    Share
  7. jonathanstray Friday, June 7, 2013

    This is a fascinating article, but I do not believe that big graph algorithms are what the NSA most usually does with our data, or the tool that it most commonly uses in counter-terrorism work.

    First, that Accumulo slide deck makes clear that graph algorithms at this scale are research work. Of course the NSA could be misleading us here, but it’s significant to note that the only algorithm actually discussed is nothing more than breadth-first-search. To do anything sophisticated, like community detection, would be massively more complex, and probably impractical unless new algorithms can be found with linear running time.

    Second, pattern detection via data mining is not how intelligence work typically proceeds. Much more likely, an analyst starts with specific suspects and traces the connections around each one, which is known as link analysis. This article suggests that the NSA has been experimenting with big graph algorithms for years, but users have not found them useful.

    Finally, the false positive problem is overwhelming when you’re trying to detect terrorists. Predicting terrorism is not like predicting that someone might like Doritos: there might be millions of people who like nacho cheese flavor, but only a dozen actual terrorists in the world. Even a method which is 99% accurate — only a 1% false positive rate — will be swamped by false hits, far more than true hits. This problem is a staple of statistics and ata mining literature, and is known as the base rate fallacy. See e.g. this report on data mining for counter terrorism, and this simplified discussion from the BBC.

    All of this leads me to believe that, while spiffy, these sorts of big graph algorithms are far from the mainstream use of the data that the NSA collects.

    Share
    1. Derrick Harris Friday, June 7, 2013

      I’m going to write a bit more about this, but I would say you’re right and wrong. The article you listed seems to stop at 2006, but a lot has changed since then, I assure you.

      As for the actual scope of any given investigation, there’s pattern recognition and then there are those link analyses, etc., when you have a specific entity in mind. Or just simple queries: Who called location A from Location B at Time C? And then you can dig in from there.

      Share
    2. you are correct,but with the patriot act it does not matter if there are errors,you can be “disappeared” any way.

      Share
  8. I do wonder what kind of action could get you “on their radar”. Both now and 20 years from now. Overt criminal activity only? Or maybe just a string of seemingly-inoffensive words, analyzed by the kind of software that filters job applications, that one day flags you as a potential troublemaker, whistleblower or political opponent of whichever party is pulling the strings at that point in time? You might find yourself on some future, society-wide version of the No Fly List, with no idea how you got on it, and no procedure for taking your name off it.

    Share
  9. Check out the correlation and attribution capabilities of a Hewlett Packard product called ArcSight and its CORR engine.

    Share
  10. I liked your article however your last paragraph contains what I believe is the crux of the matter and the difference between 1984/enemy of the state and protecting the US. I think it should be emphasized rather than tacked on. Also, people forget that actual human beings, all are US citizens, most are military, work at the NSA. Its not some Skynet-esque unfeeling , robotic surveillance hunter-killer.

    Share

Comments have been disabled for this post