The term “big data” conjures up images of huge server clusters running within data centers at Facebook, Yahoo and Google. It should also raise images of engineers figuring out how to run these systems at scale and monitor them. Some might argue that, as big data catches on with mainstream customers, hardware vendors are trying to take the guesswork out of that process. But vendors pushing appliances tell another story.
On the surface, appliances might appear a tough sell because they mean paying for value-added services that deliver higher profit margins to hardware vendors. One of the big promises of new frameworks like Hadoop, after all, is that they’re relatively risk-free investments because they’re designed to run in parallel across mere commodity boxes.
But in the vendors’ version of the story, appliances are almost necessary because big data analytics is a critical area of IT investment. If companies are going to do it right — and that’s assuming they don’t have expertise in parallel processing or scale-out systems — it’s probably best to just pay experts to deliver a system built from the ground up to do that particular job, and to handle any subsequent issues via support services.
Already, Oracle, EMC, Dell, HP, IBM, Teradata and Kognitio are among the vendors pushing big data appliances designed for Hadoop, massively parallel databases, online transaction processing and/or other analytic workloads. Does this momentum suggest that vendors are onto something about how big data software must be delivered, or just that they smell an opportunity to turn a profit?
Vendors are probably right (this time)
There’s plenty of reason to think that vendor interests actually align with most customers’ interests. After all, however you slice it, cluster and distributed system management is complex. Especially if the goal is putting a system into production, most organizations will want to ensure it not only stays up and running, but actually delivers the performance that most people expect will accompany their big data efforts.
Organizations have two choices: Research best practices then buy best-of-breed servers, storage, interconnects and management software and hope everything works, or pay someone to drop in a preconfigured cluster and handle the deployment. Sure, specialized appliances will certainly cost more on paper, but harder-to-quantify costs, such as the man-hours lost to planning, deploying and troubleshooting, also need to be taken into account.
Just look at the statistics. Looking specifically at Hadoop — the technology arguably driving the big data revolution — 44 percent of respondents to a 2010 survey ranked a steep learning curve among the biggest obstacles to adoption. According to a recent survey by Ventana Research, about one-third of respondents are dissatisfied with operational issues such as performance and scalability.
There’s also an overall lack of big data skills. Surveys of Hadoop users regularly find a lack of in-house knowledge, and some studies have actually quantified a shortage in other areas. The Bureau of Labor and Statistics predicts an ever-growing demand for workers with deep analytical skills, but the McKinsey Institute predicts a shortage of 1.5 million business intelligence analysts alone by 2018. That’s not to mention likely shortages in areas such as machine learning and complex statistics.
Vendors selling in the big data space, as well as web pioneers like Google and Facebook, are hiring the best-possible engineers to make their products better-performing and easier to manage, but that leaves precious few skilled workers for everyone else. Assuming user companies don’t have unlimited budgets for big data, one train of thought would be to spend on teaching employees how to ask the right questions of their data rather than how to build, deploy and manage the systems.
Another factor in favor of appliances is the integration of big data environments and software such as Hadoop, data warehouses and/or BI tools as companies realize that data silos are a waste of good information. Traditionally, giving Hadoop and an analytic database access to each other’s data requires a specially designed connector and potentially high-latency travel across the cable connecting the two environments. This is beginning to change, however.
Already, EMC Greenplum is touting a Data Computing Appliance that puts the Greenplum Database and Hadoop, as well as partner analytics software, into a single rack sharing the same backbone. Teradata is about to start selling its Aster MapReduce appliance that’s based on the Aster SQL-MapReduce software, but has built-in connectors to both Teradata Database and Hadoop environments. Oracle recently announced two new appliances: one for business intelligence (which might be unnecessary, actually) and another that includes an open source Hadoop distribution, a NoSQL database and an implementation of the R statistical-analysis software.
Buying an appliance doesn’t negate the need for server administration, but it does take a lot of the guesswork out of it. With something as inherently complex as big data, prepackaged infrastructure and vendor support make it easier to get started doing analytics once the decision has been made to go down that road. Apples to apples, commodity hardware and free software will be cheaper than an appliance and commercial software every time, but time is money too.
Converged infrastructure shows the way
My arguments might seem like undue praise for big data appliances, but the recent rise of converged infrastructure in the server and storage worlds could be telling about what mainstream buyers are looking for. Questioned by pundits and analysts when Cisco first starting selling its Unified Computing System in 2009, these systems that combine servers, storage, networking gear and management software into a single rack have taken the IT world by storm.
Servers have been Cisco’s biggest growth sector over the past few quarters, and (despite some alleged dissension in the ranks) its VCE joint venture with EMC has been selling lots of multi-million-dollar Vblock systems. HP has made its BladeSystem Matrix converged infrastructure system a centerpiece of its private cloud effort. Even Dell, which was dismissive of such products, is selling one under the vStart moniker.
Colin Fletcher, senior solutions manager for vStart, told me that when Dell announced vStart, some customers just wanted a product that “ready to run” right out of the box with no complex assembly required. As it turns out, configuring systems that optimally address the performance requirements of virtualized applications — which are the targets of most converged infrastructure — isn’t always easy, either, even a decade after VMware introduced x86 server virtualization.
That mindset would seem to bode well for big data appliances, which certainly serve a narrower purpose than do many converged infrastructure systems. If companies are willing to spend money on turnkey systems for mission-critical applications — an area where they’d arguably want best-of-breed components and to avoid lock-in as much as possible — they’re probably more likely to do the same for less-important big data applications. It’s arguable that the opposite is true, that companies will be willing to take a chance on big data systems because they’re not mission-critical. But that might represent too great a potential outlay of human resources given the relative importance of the systems.
What about big data in the cloud?
As anyone who reads my writing on the GigaOM network knows, I’m also an ardent supporter of running big data workloads in the cloud. The idea of buying a specialized appliance to do a job that one could conceivably do without owning any hardware might appear contrary, but I actually think the two delivery models are parallels, although not on the same plane.
At the highest level, both appliances and the cloud are about removing some of the complexity associated with big data. Cloud computing certainly takes it to another level by eliminating the need to configure or operate anything, but as we’ve seen with computing in general, not everyone can justify a move to the cloud. For security reasons, performance reasons or to save the hassle of migrating data across the Internet, some companies will never run big data workloads in the cloud. At this point, appliances are a good bet for achieving the cloud-like benefit of putting as many resources as possible into analytics rather at the expense of system configuration.
And looking forward, we’ll likely start seeing cloud providers actually embrace the appliance model to some degree, even for big data applications. A big data appliance in the cloud might be a section of infrastructure designed specifically for big data applications and priced accordingly by providers. Amazon Web Services already enables this to some degree by letting customers run Elastic MapReduce jobs on its Cluster Compute Instances that feature high-end processors and a 10 GbE backbone.
In the end, to appliance or not to appliance is all about balancing cost against performance and, in the physical world, operational effort. For specialized workloads such as big data analytics that might require specialized infrastructure to run optimally, appliances are at least worthy of serious consideration. There’ll still be plenty of work to do actually writing applications and creating algorithms.
Oh, and Teradata, Netezza and Oracle have doing just fine selling database appliances over the past few years. Now that big data is on the tip of everyone’s tongues, it’s difficult to imagine the thirst for appliances has somehow dried up now.