Sourcefire is now monitoring 2 million endpoints as part of its Immunet anti-malware product, and Hadoop is doing the heavy lifting of analyzing the hundreds of terabytes of data those endpoints are pumping into the company’s centralized data store. This just goes to prove my point that security and big data are a match made in heaven.
As is the case with many security services, Immunet isn’t a Hadoop product as much as it is a product that uses Hadoop (a sign that the technology is maturing). Two millions endpoints generate a lot of data (in number of items, at least, if not in volume), and it takes some special tools to store and process all that information. The more information that SourceFire can determine about threats, the better it can protect is users. Users don’t care about Hadoop, NoSQL or any other IT buzzwords — they just care that their computers are safe — but companies like Sourcefire certainly do.
According to an email from Zulfikar Ramzan, Sourcefire’s chief scientist within its Cloud Technology Group, “Hadoop is one of the more prominent technologies we use, though we have also built some custom technologies as well.” He said the company also constantly evaluates new technologies such as NoSQL databases, but the trick is finding tools that are “well baked enough for use in production environments” and that meet specific needs. Hadoop, for example, is great for general data mining purposes, but SourceFire has custom-built some tools for real-time workloads that the batch-oriented Hadoop can’t readily address.
As for what it does with all that data — hundreds of terabytes and growing — Ramzan said SourceFire is primarily concerned with detecting anomalies, whitelisting “clean” files and indentifying trends. The first two are pretty self-explanatory — it’s important to detect new threats based on suspicious activity, but it’s also important to not falsely label known safe files as being infected — but the latter is a situation where big data techniques can take traditional security analytics to the next level. Ramzan wrote that, “[W]e mine data for threat related trends — such as which threats are the most popular, what geographic regions are seeing disproportionate threat activity, what global threat characteristics we are seeing, etc.”
Presumably, Sourcefire also utilizes its repository of threat data and analytics tools to power its line of intrustion-detection-and-prevention products designed for business users rather than consumers. ipTrust is already doing just that, using Hadoop, Cassandra and other tools in a system that assigns reputation scores to IP addresses attempting to access a company’s network.
Sometimes, however, Hadoop isn’t the answer even in decidedly big data environments. CloudFlare CEO Matthew Prince recently explained his company, which provides network security and performance for web sites, tried Hadoop early but found it didn’t scale or perform up to the company’s needs. Rather than spend money it didn’t have as an early-stage startup trying to make Hadoop fit its needs, which involve a constant stream of traffic data, CloudFlare built its own tool. The result was “SortaSQL,” a hybrid key-value store that combines the PostgreSQL database with the Kyoto Cabinet key-value store.
But Hadoop or not, the song remains the same: companies building cloud-based security services have data needs that legacy database software can’t handle. They also want to do new things to the data to glean even more insights from it. Given the incredible number of Internet-connected devices on the planet — mobile phones, for example, now outnumber people in the United States— it seems there’s a big business in giving security companies the big data tools they need to do their jobs without making them reinvent the wheel at the data-platform layer.
Image courtesy of Geograph user Ross.