Hadoop, the much-hyped software for processing large amounts of data on commodity hardware, has its roots in indexing tons of webpages for a search engine, not handling credit card numbers. And it wasn’t developed from the start with security in mind.
While enterprises have given Hadoop a try, concerns around security have limited adoption, as my colleague Derrick Harris explained in his history of Hadoop. Now multiple vendors of Hadoop distributions — Cloudera, Intel and others — are making or planning initiatives that could bolster security in a few powerful ways.
Patents and patches
This year many companies are showing interest in Hadoop but have expressed reservations about security, said Jim Vogt, president and CEO of Zettaset, which provides security features on top of Hadoop distributions. “When you start talking about security, you’re seriously considering adopting that technology in an enterprise, in a broader market,” Vogt said. (Security in the cloud also happens to be a hot topic, and we’ll be discussing it at our Structure conference in London on Sept. 19.)
And so vendors are answering the call. Vogt said his company is in the midst of patenting methods of managing and controlling encryption keys that are distributed across multiple servers in a Hadoop cluster. And to mitigate any performance degradation from implementing security in Hadoop, the company will roll out next year a system for prioritizing storage of data that resides in a cluster. That means if data is hot, it would be a good idea to put it on an SSD instead of a disk and get jobs done faster, Vogt said. Of course, Zettaset and other companies providing security stand to gain if people feel their infrastructure is insecure, so it makes sense for them to point out existing shortcomings. Nevertheless, it’s making moves right alongside developers working on open-source Apache projects.
As Charles Zedlewski, vice president of product at Cloudera, explained to me, there are four major buckets in the world of security:
- Authentication. “if you are a user of the system, how could I prove you are who you say you are?” Zedlewski said.
- Authorization. That lets “a person can control (over) what you can see or what you can do to a specific piece of data,” Zedlewski said.
- Audit. “If I want to see if I ever had a breach, how can I go back and look at what happened?” Zedlewski said. Auditing can provide documentation to meet regulatory requirements, too.
- Encryption. “You need some other way to (secure) the data, which is an added safeguard you could have,” he said.
As it stands now, raw Apache Hadoop offers some of these features, on MapReduce, HBase, Hive and other Hadoop programs. For instance, he said, “There is strong authentication today. I think the main thing that we’ve seen from customers in terms of what we need to improve there is making it more usable to make it easier to just setup and configure.”
Encryption is another story. While data is moving around the network, it can be encrypted. That ability has been available for two years, Zedlewski said. But when it comes to keeping data encrypted while it’s at rest, sitting in storage, that’s when some companies turn to off-the-shelf encryption libraries from a security vendor such as Gazzang and Vormetric.
Cloudera is looking at adding its own encryption ability from within, so its customers don’t have to do business with yet another company. And because Cloudera is seen as a Hadoop market leader, that could be a much appreciated addition.
Zedlewski thinks Hadoop is least mature in the authorization department. The company wants to let customers easily dial in how granular authorization can be, per record or per field in a given table. If there is a table of 10,000 credit card numbers, for example, “I can actually say, ‘Based on your privileges, you can only look at 50 records at a time, a specific range of values.’ Now that opens up (the data) to more people,” he said. In other words, fine-grained authorization lets more of a company’s employees can be granted access, not just the select few who have gained the most trust.
The year of the Rhino
Around three months ago, Intel — one of the latest entrants to the distribution-vendor category — came out with a wish list for Hadoop security under the name Project Rhino. The intent is to toughen up the gentle-giant elephant that is Hadoop’s mascot for enterprise use.
On authentication, a new internal system that doesn’t rely on an external source such as Kerberos made the list, along with better single sign-on capability. So did an authorization mechanism that can work across the many Hadoop programs, from batch-processing MapReduce to the HBase NoSQL database. Engineers from Intel and other organizations have been working on these projects, said Vin Sharma, an open-source software strategist for the chip maker. These features will be added to Intel’s distribution, and other distributions will be able to pick them up as patches, Sharma said.
Several engineers from Hortonworks have been active this year in an Apache incubator project called Knox. It’s constructing a big virtual fence around the servers comprising a Hadoop cluster, with one secure gateway to get in for the many Hadoop services available, Shaun Connolly, Hortonworks’ vice president of corporate strategy, explained to me.
Hortonworks and others are also working on creating a system for automating the execution of policies for the retention of data in a Hadoop cluster, Connolly said.
MapR intends to add key management for encryption, as well as encryption of data at rest, said Jack Norris, the company’s chief marketing officer. And like Cloudera, MapR wants to make difficult aspects of security easier to implement, specifically encryption of data in transit and authentication.
“If you look at some of our competitors, there are 100 pages of instructions on how to deploy,” Norris said. “The result is people just don’t do it when they have security requirements they don’t want to go through that pain. That’s really a problem.”
Feature image courtesy of Flickr user mournjargon.