Across a wide range of industries from health care and financial services to manufacturing and retail, companies are realizing the value of analyzing data with Hadoop. With access to a Hadoop cluster, organizations are able to collect, analyze, and act on data at a scale and price point that earlier data-analysis solutions typically cannot match.
While some have the skill, the will, and the need to build, operate, and maintain large Hadoop clusters of their own, a growing number of Hadoop’s prospective users are choosing not to make sustained investments in developing an in-house capability. An almost bewildering range of hosted solutions is now available to them, all described in some quarters as Hadoop as a Service (HaaS). These range from relatively simple cloud-based Hadoop offerings by Infrastructure-as-a-Service (IaaS) cloud providers including Amazon, Microsoft, and Rackspace through to highly customized solutions managed on an ongoing basis by service providers like CSC and CenturyLink. Startups such as Altiscale are completely focused on running Hadoop for their customers. As they do not need to worry about the impact on other applications, they are able to optimize hardware, software, and processes in order to get the best performance from Hadoop.
In this report we explore a number of the ways in which Hadoop can be deployed, and we discuss the choices to be made in selecting the best approach for meeting different sets of requirements.
Key findings include:
- Hadoop is designed to perform at scale, and large Hadoop clusters behave differently from the small groups of machines developers typically use to learn.
- There are a range of models for running a Hadoop cluster, from building in-house talent and infrastructure to adopting one of several Hadoop-as-a-Service solutions
- Competing HaaS products bring different costs and benefits, making it important to understand your requirements and their strengths and weaknesses. Some offer an environment in which a customer can run — and manage — Hadoop while others take responsibility for ensuring that Hadoop is available, maintained, patched, scaled, and actively monitored.
Feature image courtesy Flickr user Pattys-photos
2 From pilot to production
Interest in Hadoop continues to grow, with investors and analysts making bold claims about the likely market opportunity over the next few years. Hortonworks CEO Rob Bearden claims his company’s Hadoop-based solutions will generate $1 billion in revenue by 2017 or 2018. Competitor Cloudera recently secured a $900 million funding round from investors willing to make a big bet on the Hadoop market. Recent analysis suggests that the global market for Hadoop software and services could exceed $20 billion by 2018.
Despite the investment and the revenue predictions, the core of any Hadoop solution remains freely available open-source software. Many organizations that are beginning to explore the opportunity offered by Hadoop aren’t spending any new money at all. They’re identifying surplus machines, downloading a Hadoop distribution for free, installing it, and having a play. There aren’t comprehensive publicly available surveys to draw upon, but anecdotal evidence would certainly suggest that these small Hadoop experiments are widespread.
Finding a spare server or two may be relatively straightforward, as is making the time to install Hadoop with its default configuration, locate, and then load some experimental data. In some cases, these activities may well be taking place without any explicit knowledge or approval from elsewhere in the organization. Even with more formal recognition, these small-scale experiments are unlikely to have significant resource implications. But other than giving a developer or two some superficial familiarity with Hadoop and its components, these initial experiments are also unlikely to deliver much value or actionable insight to the organization.
Hadoop’s capabilities are revealed as data volumes increase. Managing data and clusters at scale presents different challenges to those associated with running test data through a couple of machines. Again and again, organizational deployments of Hadoop fail as they simplistically try to replicate processes and procedures tested on one or two machines across more-complex clusters.
Scaling Hadoop, either as part of a meaningful pilot project or to deliver production enterprise workloads, is a challenging undertaking, and managing the complex interactions among parallel nodes remains a complex and often largely manual process. Each node must be actively monitored throughout its commitment to a particular workload, and Hadoop’s often arcane errors and failure modes can rarely be resolved automatically. Network design also impacts the effective performance of a cluster, with most corporate networks poorly configured for the rapid and high-volume inter-node data transfers typical of an effective Hadoop cluster. For some, it makes sense to build organizational competence and scalable infrastructure in house. For others, it is more appropriate to trust third parties with running the infrastructure, developing the competencies, or even delivering services that combine the two. For a primer on Hadoop, please see Gigaom Research’s report “Hadoop in the enterprise: how to start small and grow to success.”
3 Scaling to DIY
Building a big data capability inside an organization and running it at scale is possible. Technology companies such as Facebook, Yahoo, and eBay do it routinely and cite the benefits of being able to increase efficiency and reduce cost by controlling everything from hardware and networking to software and applications. Universities, pharmaceutical firms, banks, and retailers also recognize the value of adding big data capabilities to their existing IT infrastructure, making the investment in technology and skills to ensure that they can analyze and respond to data inside their own data centers and on their own terms.
Organizations should bear in mind a number of considerations when they’re deciding if developing an in-house big data capability is the most sensible strategy for their organization.
Is an in-house implementation critical?
For some organizations, the ability to process, interpret, and act on data at scale is critical and core to their current and future business. There may be a strategic value in ensuring that the necessary skills and technologies are developed and maintained in house rather than entrusted to third parties and potential competitors. That perceived strategic value would, in some cases, outweigh the cost and complexity of acquiring and retaining skills and infrastructure. Alternatively, an organization may recognize that the effective manipulation of data is critical to its future business success while also recognizing that building that capability internally would only distract from other areas where it can differentiate and add value. It might therefore choose to identify and engage closely with a trusted partner with whom it could work for an extended period.
Deciding which approach to take in addressing data as a core business requirement will largely depend on wider corporate attitudes to partnership, the state of internal skills, and the perceived value of retaining internal IT capabilities.
Can you get the skills?
Managing Hadoop at scale requires specialist skills in software, network management, and cluster configuration. These skills are currently in demand, making it expensive to hire and difficult to retain staff. Once a cluster is operational, a further set of skills is required to work with the data, undertake analysis, and meaningfully convey insights and actions to the rest of the business. These skills, too, are in high demand. Hadoop clusters are complex, and they require constant active management and maintenance. The open-source projects developing much of the code release frequent updates that need to be applied without adversely affecting the performance of the cluster or the business applications that depend on it.
It is certainly possible to hire staff with some or all of these skills. It is also possible to train existing staff so that they gain a basic understanding that can then be developed over time. In many cases, finding consultants or Hadoop-as-a-Service providers that already have the necessary skills available for hire may be cheaper or more straightforward.
Can you afford the infrastructure?
Hadoop clusters comprising hundreds or thousands of nodes require a significant capital investment, which may be difficult to justify without a clear understanding of the use case now and in the future. Some big data activities, such as routine and repetitive analysis of the same server logs, will deliver value to the business while having consistent and predictable hardware requirements. More-exploratory data science is likely to have far less predictable infrastructure needs, meaning that the Hadoop cluster runs the risk of being largely idle a lot of the time and seriously under-specified at other times. It may be more effective to outsource some or all of an organization’s Hadoop requirement to a partner with the capacity and capability to elastically scale both up and down.
Are you really in this for the long haul?
Investments of time, money, and attention to build an internal big data capability may be justifiable. But in the early stages of big data adoption it can be difficult to realistically assess both the cost of a big data deployment and the value it will deliver to the business.
- Hadoop requires constant proactive monitoring and maintenance, as well as operations staff with a particular set of skills. The Hadoop project is in a state of flux as it evolves, improves, and gains newer capabilities such as YARN. Keeping pace with this rate of change will require a significant and ongoing commitment.
- Is the money spent on training around one Hadoop distribution justified, or should you switch to a slightly different distribution from another supplier and have to at least partially retrain?
- Is the investment in a Hadoop cluster to meet your current challenges (with an emphasis on processors and memory, perhaps) fit for purpose as requirements change and another factor (such as the speed with which you can ingest streaming data from factory sensors) becomes more important? As workloads change from one job to the next, it will be necessary to reconfigure the cluster accordingly. Updates to Hadoop, plus completely new features like Spark, may also have an impact on the design and composition of the cluster.
- Do you really know how big the data sets are going to be or how computationally intensive your analyses will become? One customer of a Hadoop-as-a-Service offering reported that it regularly ingests around 100 million items per month but that this jumped unexpectedly to 400 million earlier this year. An elastic service coped with the spike in traffic in ways that more tightly constrained local infrastructure may not have managed.
Does the business have a strategic, shared, and robust vision for the role of big data within the organization and a willingness to invest for years in the in-house skills and infrastructure this will require? Or are different parts of the business exploring, learning, and gaining insight into their different requirements? Short and focused engagements with established third parties may prove a more effective way to gain insight and experience, at least until some longer-term use cases (and their associated requirements) become better understood.
4 Hadoop as a Service
Organizations that wish to work with partners in developing and running their Hadoop capability have a number of options to explore. Each class of service will have particular strengths as well as use cases for which it is less well-suited.
Let others manage the hardware
Any of the main cloud IaaS providers can be used to host a Hadoop cluster by simply renting a number of virtual machines and manually installing Hadoop on these. This approach may prove cost-effective in two main scenarios:
- Proficient users of their chosen IaaS provider who want to create a small cluster for a short time to aid their Hadoop learning
- Proficient users of their chosen IaaS provider who are also experienced with Hadoop and have a specific short-term data-processing requirement and no available capacity on their existing clusters
Hadoop does not always run well on the cheaper instance types offered by most public cloud providers. Noisy neighbors placing load on other virtual machines on the same physical infrastructure and unpredictable network performance can both cause problems in a Hadoop cluster, leading to the perception that public clouds are not suitable for running Hadoop.
Larger and more-powerful virtual machine instances and increased use of solid state storage options only partially address these challenges. Amazon Web Services (AWS), for example, offers the I2 and HS1 virtual machine instance types that deliver the performance required for effective use of Hadoop. Google is also attracting attention for the speed at which its Cloud Engine can run Hadoop. The design decisions that led to the separation of compute (like Amazon’s EC2) from storage (Amazon’s S3) in most public clouds create a performance bottleneck for those wishing to transfer large volumes of data. This issue is exacerbated because IaaS data center networks are not normally configured to favor high levels of data transfer between machines inside the data center. These issues are particularly significant for users of Hadoop, where high levels of data transfer are a core aspect of the way a cluster operates and because the native Hadoop Distributed File System (HDFS) was designed on the assumption that compute and storage will normally be located as close to one another as possible. The need to transfer and possibly convert data from a storage system such as S3 before a Hadoop job can even begin processing increases cost and delay in ways that may be significant.
IaaS providers like AWS, Google, and Rackspace offer a viable way to run Hadoop jobs, particularly for Hadoop-savvy customers who are already invested in a particular IaaS solution. The costs, trade-offs, and performance profile of a typical IaaS offering do not make them an obvious choice for new customers who would then need to learn both Hadoop’s and the IaaS provider’s systems. Further, their cost models may not make them the cheapest option for sustained and intensive utilization, and the customer remains responsible for installing, configuring, and running Hadoop. The cost and time involved in importing or exporting large data volumes may be a cause for concern in some cases.
Service level agreements (SLAs) for IaaS will only cover the availability and performance of the underlying hardware, and they will have no bearing on the functioning of the Hadoop cluster running on that hardware.
AWS also offers richer services like Elastic MapReduce (EMR) that run on top of AWS infrastructure, and competitors such as Microsoft and Rackspace have their own equivalents. Third-party HaaS providers like Qubole and Treasure Data also rely on AWS, building their managed offering on top of AWS storage (S3) and compute (EC2) products. Although offering the ability to scale up and down (i.e., elasticity) that its name suggests, it’s worth noting that Elastic MapReduce and its simpler competitors don’t currently offer the easy and rapid scaling that customers might have come to expect from simpler IaaS services like EC2. The complexity of a Hadoop cluster, its interconnections, and its management mean that elastic scaling may take some time to configure, with the potential to impact the performance of jobs already running on the cluster. We’ll explore these options in more detail further on.
Let others install Hadoop
Most major IaaS providers now offer their own take on HaaS, pre-installing Hadoop on their own infrastructure and making it available for consumption as a service, usually managed and billed through the same APIs and web interfaces as their other products. The core components of Hadoop, plus any extensions or additional capabilities, have all been tested on the cloud provider’s hardware and are demonstrably able to work together. Configuration options will normally have been selected in order to achieve the best possible performance from the available combination of compute, storage, networking, and software. Amazon’s Elastic MapReduce (EMR) is the best-known of these, but Microsoft has HDInsight and Rackspace offers its Big Data Platform in partnership with Hortonworks.
These solutions remove the need for detailed knowledge of installing, configuring, and updating Hadoop, and a basic cluster will typically be available for use in minutes. The customer will still need the skills to operate the cluster, and some manual intervention will be required to scale available compute resources up and down, to reconfigure the cluster for different types of workload, and to generally ensure that the best combination of tools is being made available for each different workload. It is also important to note that these Hadoop solutions typically run on the IaaS providers’ existing hardware and networking infrastructure. While this will no doubt be optimized as far as possible to cope with Hadoop-style workloads, the underlying limitations in compute power, network topology, and separation between storage and compute discussed above will still need to be considered.
SLAs for these products normally cover the availability and performance of the underlying hardware, not the functioning of the Hadoop cluster running on that hardware.
Although customers will normally have some scope for making configuration decisions, they will probably not be able to see or modify every aspect of the cluster. This may limit their ability to tailor the cluster to their specific requirements. It also prevents them from being able to replicate the configuration elsewhere (on their own hardware or with another provider) if they require a redundant backup or if they choose to migrate.
Sellpoints, the Emeryville, California–based provider of online sales network technology, is using Hadoop to power analysis of log data as part of a new product offering. Already a heavy user of AWS products like S3, EC2, and RDS but without any existing Hadoop capability, the company explored both an AWS solution (comprising EMR and AWS’ data warehouse offering, Redshift) and a fully supported pure Hadoop alternative from Altiscale. In proofs of concept built to process around 10 million lines of log data, Redshift’s raw performance delivered results in response to individual queries that unsurprisingly gave the AWS combination an advantage over Altiscale’s Hadoop system. But with most of the processing time being spent on machine-learning tasks inside Hadoop, Redshift’s speed advantage had little impact on the time required to complete most real-world workloads considered by the pilot. For these particular workloads, Altiscale’s elasticity and dedicated support proved more compelling than extending Sellpoints’ existing relationship with AWS, and the system is now analyzing several billion lines of log data collected from over 100 million unique visitors to customer properties.
Let others manage Hadoop
A third option — and the one Sellpoints selected — is to trust a third party to manage hardware, software, and the complex interactions within and between the two. At one extreme, managed service providers (MSPs) like IBM and CSC engage in professional-services engagements to design, procure, run, and maintain a dedicated Hadoop capability in a customer data center, in the cloud, or in a co-location or managed hosting facility, in some engagements doing enough to remove any real requirement for Hadoop operational skills at the customer. At the other extreme, startups like Altiscale take responsibility for ensuring that Hadoop performs to its full potential but mostly leave the use of Hadoop to perform data analysis and data science to their customers. These customers are typically already reasonably skilled with Hadoop and recognize the value of not having to invest effort in keeping it running.
SLAs for all of these solutions are normally comprehensive, covering everything associated with ensuring that the Hadoop deployment delivers as specified.
From nodes to jobs
Many of today’s conversations about deploying Hadoop tend to begin — and, often, end — with counting Hadoop nodes. Providers of fully managed Hadoop solutions argue that this emphasis on basic infrastructure is unhelpful and that customers should be more concerned with the size and complexity of the business problem they face. Just as an IaaS provider like AWS elastically scales basic infrastructure when demand increases, a managed Hadoop solution should elastically scale the resources and nodes available as the job requires. Customers of managed Hadoop solutions should not need to concern themselves with operating the Hadoop cluster or its infrastructure at all, with the required storage and compute resources simply provided on demand; monitored, maintained, and available throughout the processing of a workload; and then returned to the pool once the task finishes. The managed nature of these solutions means a quite different experience for customers, removing any requirement for them to interact with — or even know much about — the details of Hadoop or the infrastructure on which it is running. Altiscale goes a step further, removing the need to even interact with Hadoop’s management interfaces to start and stop jobs. Instead, customers use the data science tools they are already familiar with (i.e., Hive, Pig, R) to submit jobs to the cluster via YARN API calls.
This difference in emphasis is also reflected in the way that companies like Altiscale can bill for usage. Rather than charging for a set number of virtual machines, customers pay for the volume of data under management. Behind the scenes, the infrastructure scales up and down as customers run jobs with different requirements.
Keeping the lights on
Hadoop clusters require active and ongoing management to ensure that both individual nodes and the cluster as a whole are performing as expected. Frequent updates and patches to the software add an additional layer of complexity, as changes must be applied to large numbers of machines at once. Both the installers of Hadoop (Amazon Elastic MapReduce, Microsoft’s HDInsight) and the managers of Hadoop (Altiscale, CSC, Qubole, Treasure Data) will take responsibility for this aspect of cluster management. Some managers go further, ensuring that customer workloads are successfully migrated to and optimized for new versions of Hadoop.
Complex customer workloads bring additional challenges that the likes of EMR and HDInsight will typically avoid addressing, while Altiscale is one company that actively monitors running jobs, detecting potential failures, advising on optimization techniques, and more. For some customers this additional level of monitoring and feedback brings real value, especially as compute-intensive jobs may run for long periods of time. Even those customers with existing Hadoop experience recognize the value of proactive monitoring for these jobs and assistance in diagnosing and resolving failures and other issues as they occur.
5 At your service
Hadoop holds great promise in tackling a wide range of business challenges, and the market is responding with large valuations and an explosion in adoption. The technology is and will remain complex for many adopters, despite the efforts of contributors to the Apache Hadoop project codebase and the increasingly polished distributions emerging from Hadoop companies like Cloudera, MapR, and Hortonworks. There will always be those who are happy and comfortable to install, manage, and optimize the core software and hardware across a Hadoop cluster, and this process will undoubtedly continue to get easier. There are also those for whom this is not a sensible use of time and resources. Some lack internal Hadoop skills or the desire to acquire and refresh them, and managed service contracts with MSPs like IBM or CSC may be the most sensible way to proceed. Others have gained their Hadoop expertise the hard way, by building and running clusters of their own, and they recognize that continuing to do so is not the most effective way to leverage their particular set of skills, resources, and business requirements. For them, companies like Altiscale, Qubole, and Treasure Data follow different paths to provide what they need: optimized, elastic, monitored, and maintained Hadoop clusters that are essentially available when needed.
The Hadoop project itself continues to grow and to add new capabilities and ways of working. The addition of YARN in Hadoop 2.0 and the growing interest in Spark, for example, both represent significant advances in Hadoop’s capabilities that also carry significant implications for Hadoop users wishing to take full advantage of advances emerging from the community. Supported Hadoop services typically track and respond to additions to the project code, although not always as quickly as some customers might wish. At the time of writing, for example, Treasure Data and Qubole are still operating on the 1.x version of Hadoop and have not yet gained the additional capabilities offered by innovations like YARN.
One customer with an established in-house Hadoop capability turned to Altiscale for help with data from a new product. Data volumes were expected to exceed the capacity available internally, and they were also expected to grow significantly and unpredictably. Altiscale’s elastic Hadoop cluster gave the company the ability to scale with demand, and Altiscale engineers worked with it to optimize processes and workflows that proved less capable at volume than they had been in the early phases of the project. Although data volumes are now predictable and processes now optimized, the company plans to continue running this particular cluster with Altiscale, citing the real business value of an ongoing and proactive relationship with Altiscale’s cluster-monitoring team.
6 Making the choice
Hadoop offers a powerful set of capabilities, enabling adopters to analyze data at a scale and price point that earlier technologies simply cannot match.
A variety of deployment models mean that prospective adopters of Hadoop have a range of options available to them that are capable of meeting most combinations of budget, skill, and problem set. Whether self-installed locally or on an IaaS provider’s cloud, offered as a higher-level service by an IaaS provider, or managed with varying degrees of sophistication by traditional managed service providers or pure-play supporters of Hadoop as a Service, there is business value and strategic insight to be derived from Hadoop-based data analysis.
Each model of implementation offers a viable means of running routine and relatively straightforward Hadoop jobs, such as those comprising a small number of analyses over a data set of reasonably consistent size. As data sets become more complex, as their size fluctuates unpredictably, and as the number, scope, and form of any analyses grows and changes, Hadoop’s users tend to face a decision: invest heavily and consistently in building a robust in-house capability or partner with one of the higher-level HaaS players. Mid-tier solutions such as AWS’ Elastic MapReduce, although often compelling and cost-effective, typically perform less well in these increasingly complex scenarios. As Hadoop and the infrastructure it is running on are placed under greater load, apparently inconsequential early compromises around the network connections among nodes, latency between storage and compute, and configuration decisions concerning minutiae deep inside the system become increasingly significant.
For many, their routine, predictable, and infrequent use of Hadoop will rarely stretch the capabilities of almost any Hadoop host. But for those regularly pushing Hadoop to deliver value from complex ad hoc queries run across large and changing data sets, there is a real decision to make: invest deep, long, and heavily in building and sustaining an in-house Hadoop capability to rival the web’s giants, or find a partner who has already done that work and build on its foundations.
7 Key takeaways
- Hadoop is designed to perform at scale, and large Hadoop clusters behave differently than the small groups of machines developers typically use to learn with.
- While Hadoop can be installed on most types of infrastructure, clusters perform best when network interconnections among nodes and the per-node mix of memory and CPU have been optimized to the particular requirements of memory-intensive parallel computation. General-purpose infrastructure from mainstream IaaS and hosting providers is unlikely to deliver the best combination of price and performance.
- There are a range of models for running a Hadoop cluster, from building in-house talent and infrastructure to adopting one of several Hadoop-as-a-Service (HaaS) solutions.
- HaaS providers come in many forms, from those that simply install Hadoop on machines from mainstream IaaS or hosting providers to those that pay far more attention to the whole experience and data life cycle by actively monitoring the design, submission, processing, and reporting of every workload on purpose-built infrastructures.
- Competing HaaS products bring different costs and benefits, making it important to understand your requirements and their strengths and weaknesses.
- The varieties of HaaS deliver different levels of availability, performance, and SLA. They also require different levels of internal resource and expertise to effectively exploit them.
8 About Paul Miller
Paul Miller is an analyst and consultant, based in the East Yorkshire (U.K.) market town of Beverley and working with clients worldwide. He helps clients understand the opportunities (and pitfalls) around cloud computing, big data, and open data, as well as presenting, podcasting, and writing for a number of industry channels. His background includes public policy and standards roles, several years in senior management at a U.K. software company, and a Ph.D. in archaeology.
Paul was the curator for Gigaom Research’s Infrastructure and Cloud Computing channel during 2011, routinely acts as moderator for Gigaom Research webinars, and has authored a number of underwritten research papers such as this one.
9 About Gigaom Research
Gigaom Research gives you insider access to expert industry insights on emerging markets. Focused on delivering highly relevant and timely research to the people who need it most, our analysis, reports, and original research come from the most respected voices in the industry. Whether you’re beginning to learn about a new market or are an industry insider, Gigaom Research addresses the need for relevant, illuminating insights into the industry’s most dynamic markets.
Visit us at: research.gigaom.com.