No Comments

Summary:

Big data vendor Cloudera is no Google when it comes to data center footprint, but the cost and complexity of its infrastructure are growing with each passing year. Cloudera VP of engineering Peter Cooper-Ellis explains how better data centers and cloud computing help ease the burden.

The demands of running a big data software business are nowhere near those of running a huge web platform like Facebook or Google, but they’re not insignificant either. For Hadoop vendor Cloudera, the growing popularity of its technology means larger customers, more hardware and software partners, and, as a result, more servers and more complexity in its data center. Its enough to make someone crave the cloud.

Right now, Cloudera Vice President of Engineering Peter Cooper-Ellis explained, the company uses a three-pronged strategy when it comes to the workloads that run its business. For product development, testing and debugging, it prefers to use the Amazon Web Services cloud. The company’s “bread-and-butter engineering work,” however, runs on its own gear within its own data centers.

That’s where things get complicated. Cloudera’s data center has more than 1,000 servers, with about half as part of a private cloud environment (running the CloudStack and OpenStack platforms) and the other half as bare-metal machines. Certain workloads, such as the core aspects of building and packaging its Hadoop distribution, run on the private cloud. Benchmarking, proofs of concept for customers, and other performance- or consistency-dependent workloads run on bare metal.

Managing its data center resources already costs Cloudera millions of dollars per year, and every new partnership (like those with VMware, EMC and Dell, for example) or big customer means more configurations it has to support and more gear it has to buy.

Peter Cooper-Ellis

Peter Cooper-Ellis

“Every year, we add more complexity,” Cooper-Ellis said. “My preference would be to run everything in the cloud if we could, but we just can’t do it.”

Partially, this is because of the type of workloads inherent to the business of building and selling enterprise software  – long-running (nightly, or even weeks-long) product tests that aren’t ideally suited for the ephemeral nature of compute instances in the AWS cloud.

The other big concern is cost. “[W]here we have a long-term, long-running workload,” he said, “Amazon gets really expensive for us.”

However, engineering and customer support aren’t the sole drivers of Cloudera’s data center demands. The company also uses its own products, Cooper-Ellis said, managing terabytes of data that it collects via usage logs from customer installations, as well as from forums, mailing lists and internal records. It runs analyzes all that data in order to predict problems and help support staff troubleshoot customers’ issues.

So instead of longing too much for something it can’t yet have, Cloudera is doing what it can to keep the cost and complexity of its data center in check. For example, it recently decided to improve energy efficiency by switching providers and is now running in a Santa Clara, California, data center operated by Vantage Data Centers. Vantage President and CEO Sureel Choksi told me tech companies are among its largest group of customers, and also tend to be the ones most concerned with efficiency and PUE ratings.

The Vantage campus in Santa Clara, where Cloudera's data center is now located. Source: Vantage

The Vantage campus in Santa Clara, where Cloudera’s data center is now located. Source: Vantage

Cloudera is also trying to move as many customers as it can onto “a short list of reference architectures and use cases,” Cooper-Ellis said. The more standardized the configurations its customers are running, the fewer configurations Cloudera has to build its Hadoop distribution to support and deploy in its own data center. As part of this effort, Cloudera wants to help more customers move their Hadoop environments into the cloud.

Companies are starting to put meaningful Hadoop workloads in the cloud, Cooper-Ellis said, so getting Cloudera products optimized for cloud computing environments is a top-level engineering initiative for 2014. Cloudera launched with much more of an emphasis on cloud-based Hadoop deployments, but users weren’t really ready for that back in 2009. The plan now is to partner with third-party cloud providers rather than host its own Hadoop service, although it might do the latter for limited-time trials, he noted. (Pivotal presently hosts a 1,000-node Hadoop testbed cluster in the SuperNAP data center in Las Vegas.)

These types of deals help standardize the configurations users run, and ideally can turn some users running free, unsupported Hadoop software in the cloud into paying customers. Cloudera’s competitors, including Hortonworks and MapR, already have strong cloud partnerships in place – Hortonworks with Microsoft and Rackspace, and MapR with AWS and Google. Cloudera announced partnerships with a handful of cloud providers in October, and in December certified its software to run on Amazon’s cloud.

As Cooper-Ellis said, “It’s now time to put the cloud in Cloudera.”