Hadoop-based solutions are increasingly encroaching on the traditional systems that still dominate the enterprise-IT landscape. While Hadoop has proved its worth, neither its wholesale replacement of existing systems nor the expensive and unconstrained build-out of a parallel and entirely separate IT stack make good sense for most businesses. Instead, Hadoop should normally be deployed alongside existing IT and within existing processes, workflows, and governance structures. Rather than initially embarking on a completely new project in which return on investment may prove difficult to quantify, there is value in identifying existing IT tasks that Hadoop may demonstrably perform better than the existing tools. ELT offload from the traditional enterprise data warehouse (EDW) represents one clear use case in which Hadoop typically delivers quick and measurable value, familiarizing enterprise-IT staff with the tools and their capabilities, persuading management of their demonstrable value, and laying the groundwork for more-ambitious projects to follow. This paper explores the pragmatic steps to be taken in introducing Hadoop into a traditional enterprise-IT environment, considers the best use cases for early experimentation and adoption, and discusses the ways Hadoop can then move toward mainstream deployments as part of a sustainable enterprise-IT stack. Key findings include:
- Hadoop is an extremely capable and highly flexible tool. Early in an implementation, the combination of that flexibility with a poorly scoped project creates a real risk of scope creep and project bloat, decreasing the probability of success.
- It makes sense to apply Hadoop to well-understood problems as a pilot, creating opportunities to measure return on investment while allowing all concerned parties to concentrate on learning the platform.
- Hadoop can be an efficient and cost-effective tool for offloading some of the data processing currently done inside EDW, freeing up capacity and creating the sort of measurable challenge early adopters of Hadoop need.
- Hadoop augments — but does not replace — the EDW.
2 What is Hadoop?
Now a central piece of most conversations about big data, Hadoop began in 2005 as a Yahoo project to exploit data-processing ideas made popular by Google’s work on MapReduce, published the previous year. The initial impetus was to create a technical solution capable of processing very large data volumes across clusters of commodity servers. The MapReduce model provides an effective method to divide data into manageable chunks, store and process each chunk on a different server, and then combine the result for reporting. Recognizing that there is a high probability of one or more servers (or nodes) failing when a cluster comprises hundreds or thousands of cheap commodity nodes, the model is designed to be extremely fault-tolerant. Individual nodes can fail without significantly affecting the rest of the cluster.
The core Hadoop software is freely available under an open-source license, and its ongoing development is now under the auspices of the Apache Software Foundation. The main Apache Hadoop project includes four principal modules (of which Hadoop MapReduce is one), and a growing collection of related projects at Apache are important in extending the software to include data warehousing, machine learning, stream processing, and other data-processing functions.
Apache Hadoop is freely available for download and use, but a range of third parties have emerged to develop their own versions (or distributions) of the code. Companies like Cloudera, Hortonworks, and MapR are working to extend the capabilities of Apache Hadoop and to make it easier to deploy in mainstream applications. As well as marketing the advantages of its own distributions, each company also contributes much of its work back to the Apache Hadoop project itself.
3 Where does Hadoop make sense?
Hadoop continues to be heavily deployed at the internet companies where its underlying principles were initially developed. Companies such as eBay, Yahoo, and Facebook all run a number of large Hadoop clusters (often comprising thousands of machines) with petabytes of data under management and available for analysis. Hadoop is able to store and process structured, unstructured, and semi-structured data almost regardless of file format or source, making it valuable in extracting insights from a complex set of inputs. Internet companies, for example, are often analyzing the content submitted by their users, the behavior of those users as they move through the site (clickstream and log analysis), and their responses to material displayed to them (A/B testing and ad-impression measurement).
Beyond these early adopters, Hadoop is also receiving significant interest from across a wide range of other industries, from financial services and retail to travel, defense, pharmaceuticals, and oil and gas exploration. In each case, the ability to cost-effectively manipulate large volumes of mixed data is proving a compelling proposition. Few in these industries currently address the scale of data routinely processed by the likes of Facebook and Yahoo. Instead, pilot deployments tend to use tens of nodes with production workloads sometimes expanding to run across a few hundred machines.
Hadoop’s cost-effectiveness and flexibility make it a compelling tool in a wide range of industries and in tackling a broad set of potential use cases. However, the lack of clear constraints on a Hadoop implementation can make it difficult for new adopters to fully understand the tool’s strengths and weaknesses or its measurable impact on the day-to-day operation of their business. Without clear scoping and constraints, early projects will wander and fail to deliver demonstrable returns.
4 What are the barriers to adoption in the enterprise?
Hadoop’s ability to cost-effectively analyze large volumes of complex data from across an organization’s different data silos is driving high levels of interest from businesses of all types. However, Hadoop’s strengths are the result of a rather different approach to designing, deploying, and using IT resources. The disconnect between the way in which traditional business intelligence and enterprise data warehouse (EDW) approaches are run and the optimal workflow for Hadoop are worth considering prior to any significant investment in Hadoop.
Structure isn’t everything
Some of the easiest data sets to consume within Hadoop are the so-called unstructured (more accurately semi-structured) data produced by application logs, social media interactions, and the like. But most of the data in a modern enterprise organization that has recognizable and quantifiable value will be highly structured in nature, detailing stock levels, customer contacts, sales figures, and so on. It will normally be stored on a mainframe or in a highly structured database or EDW environment, and it will typically be defined by one or more tightly controlled schema definitions. Hadoop can also add value to data of this type, as we shall see, but there may be additional steps involved in getting the data into a form ready for processing in Hadoop.
A meeting of cultures, a shortage of skills
For years custodians of enterprise data have been trained to work in a highly structured and often tightly regulated manner. Skills in database administration, compliance, audit, and system integration remain highly sought after. Complex processes have grown up over many years, and organizations are often heavily reliant upon ensuring that change is slow, considered, and tightly managed. Safeguards developed to ensure business continuity and legal compliance can appear intransigent and inexplicably slow to those more used to the lightweight processes and risk-taking attitudes more typically encouraged at internet companies, startups, and within open-source development projects.
Hadoop clusters and the effective use of tools such as Hadoop itself require a different set of skills that are not common among enterprise IT professionals. Establishing, scaling, and maintaining an effective Hadoop capability poses particular challenges that are different from those typically faced by administrators of tightly integrated premium database and EDW products. Hadoop may be free to download and it may be relatively straightforward for experienced IT staff to deploy a small test cluster, but operating in production and at scale is a far more complex proposition. In production, too, the Hadoop cluster may need to be integrated with existing compliance, audit, and security processes.
A number of companies now offer various flavors of Hadoop as a Service, which range from simply outsourcing hardware and basic software configuration (like Amazon’s Elastic MapReduce or Microsoft’s HDInsight) to richer professional-services engagements with a growing range of systems integrators. Each of these partially removes some of the need for local skills and knowledge, but none completely replace the requirement to understand what Hadoop may be capable of in particular situations.
The infrastructure deficit
One of Hadoop’s strengths is that it does not require expensive high-performance servers to run. Instead, it is designed to operate in a fault-tolerant fashion across clusters of commodity hardware. The hardware may be cheap and readily available from affordable providers such as Dell, but that does not mean that the enterprise procurement, deployment, and maintenance procedures are well-evolved to cope with acquiring hardware of that type at volume. While it is usually possible to find a few machines for a pilot deployment, any move toward production workloads may require a reevaluation of the processes by which hardware is purchased and maintained.
That’s not how we do things here
Perhaps the biggest barrier to adoption, and one that touches on each of the issues discussed above, is resistance to change. Processes, attitudes, and systems evolve to meet a particular set of circumstances, and unless there is some clear and explicit driver for change, it can be extremely difficult to evolve these. Offering tangible demonstrations of the ways in which Hadoop or any other innovation adds value will often be key.
5 What are the risks of adoption in the enterprise?
Despite the barriers to enterprise adoption of Hadoop explored above, there is a clear enthusiasm for realizing some of the value promised by today’s big data hype. In shadow-IT deployments, cautiously approved experiments, and overly ambitious proofs of concept, many enterprises are getting their first experiences of Hadoop. In general, the more realistic and scoped projects are far more likely to deliver insights that the business can use in shaping its future big data strategy than the open-ended, vague, or ambitious. Organizations are unlikely to successfully move beyond initial pilots for a number of reasons, including those discussed below.
Perhaps the biggest risks to adopting a Hadoop-based big data strategy within the enterprise are unclear and unrealistic expectations about the problems to be tackled and the value to be gained. Big data and Hadoop are surrounded by hype in the technology press and the mainstream media. Executives signing off on an unscoped big data strategy with the expectation that transformative market insights or monetizable customer trends will simply emerge are likely to be sadly disappointed.
The technology may be performing well and it may be adding value, but it is unlikely to achieve the wide-ranging effects that casual followers of the hype are probably expecting. It’s far better to under-promise and over-deliver rather than allow employees and executives to measure any small but reasonable successes against the inflated expectations of media hype.
Lack of measurable return on investment
When implementing any new technology or process within an established line of business, it’s important to establish metrics suitable for assessing the success or failure of the innovation. Too many companies take their first exploratory steps with Hadoop by tackling new or experimental areas of the business. Without a baseline of the time, cost, or effort required when using more-traditional techniques, it is difficult to effectively assess whether or not Hadoop is adding value. The learning process and the experimental nature of the problems being addressed combine to create a mass of perhaps contradictory signals that are difficult to quantify and assess.
Especially in the initial stages of Hadoop adoption, there may be value in identifying tightly scoped and well-understood problems or processes that the organization already addresses by other means. When already familiar with the data and the problem space, teams can focus their learning on Hadoop itself. Already familiar with the time, cost, effort, and outcomes from existing approaches, decision-makers can focus their evaluation on the added value delivered through the use of Hadoop. Both the development team and the decision-makers have far more opportunity to realistically measure the benefits of using Hadoop instead of the tools and processes it replaces. As we will explore below, many early deployments of Hadoop concentrate on augmenting existing data-processing tasks related to an enterprise data warehouse for this reason. The costs and benefits of existing ELT and ETL processes are well-understood, and a successful Hadoop deployment in this area can quickly deliver benefits such as freeing up finite server resources on expensive EDW hardware.
As familiarity with Hadoop grows, it is easier to consider deploying it to tackle additional challenges that are either newly discovered or were considered too difficult to contemplate with traditional techniques. Hadoop’s value is understood by that stage, and measurement and evaluation can focus on the results that Hadoop is delivering in the new area of exploration.
Creation of yet another silo
Data for analysis in a new Hadoop cluster is usually collected from existing sources, including an established EDW, archives of log files, a feed of social media data from Gnip or DataSift, and so on. It makes sense to bring this data into the Hadoop cluster for processing and analysis, but care must always be taken to:
- Avoid inadvertently bypassing or deliberately ignoring data protection, compliance, audit, and other safeguards typically built into enterprise systems such as the EDW
- Prevent the new data pool from becoming stagnant based on versions of data far older than the rest of the organization’s systems are using
- Unnecessarily maintaining a separate copy of data for longer than analysis and processing require
- Ensure that modifications and enhancements to source data can be fed back into the systems it came from, elsewhere in the organization
- Make findings and insights from the Hadoop cluster easily available for incorporation into other systems, elsewhere in the organization
Although based on commodity hardware, the costs of a Hadoop cluster can rapidly grow as the number of nodes (servers) increases to hundreds or thousands of machines. Although the incremental cost of adding additional servers may appear low, those costs are not insignificant once operating at scale. It is therefore worth investing time and effort in the early stages of a project to maximize the performance and efficiency of each and every node. The cost of managing the cluster also grows as the interconnections among pieces become increasingly complex, and specialist cluster-management solutions may be required to address the rapidly growing complexity of the challenge. A network capable of routinely transferring small pilot data sets will quickly begin to struggle when large numbers of machines are loading, analyzing, and reporting on terabytes or petabytes of data.
Without a clear understanding of the costs and benefits involved, a viable experiment on a small cluster can easily become a costly failure in production.
Failure to upskill
Just as physical infrastructure can struggle to make the transition from pilot to production, so too can the teams responsible for maintaining and extracting insight from a sustained Hadoop investment. The skills involved in playing with some data for a pilot are unlikely to be the skills required to build consensus, identify priority areas of study, engage with stakeholders across the business, and present results in ways that drive the business forward.
All organizations face challenges in moving small pilots into large-scale and business-wide production deployments. For some, the answer will be to nurture and develop the skills of existing employees. Others may choose to hire additional talent. A third group will choose to identify external partners capable of making many of the design, management, and operational decisions on an ongoing basis.
Wherever those skills come from (and each of these three approaches will suit different companies) they will be critical in ensuring a smooth transition from an interesting pilot to a sustainable production deployment of Hadoop.
6 Mitigating risk by retaining focus
As we have discussed, early experiments with Hadoop should typically meet the following criteria:
- Well-scoped, with a shared understanding of the data, its sources, and the expected outcomes
- Familiar data, processes, and questions, allowing attention to focus on learning and evaluating Hadoop
- Finite, with a clear and tangible end in sight
- Affordable, with an input of infrastructure and staff resource justified by the expected outcomes
In addition, if early work can deliver quick and tangible benefits to the business, then this is also beneficial.
Using Hadoop to support existing EDW installations is one use case that meets these criteria, making it a strong candidate for early experimentation with Hadoop across a wide range of established businesses. The Hadoop cluster does not replace existing investment in the EDW, but it does remove some of the workload from expensive and finite EDW server resources. This enables the organization to make better use of its existing EDW investment while also gaining experience with using Hadoop to solve real problems.
ELT and ETL
Although sometimes lampooned as a lumbering dinosaur or the place where data goes to die, the enterprise data warehouse remains an important class of software that powers reporting and analysis functions core to the operation of most larger businesses. Data bound for the EDW is routinely extracted from operational systems in business units such as sales, marketing, or manufacturing before being modified for upload to the EDW through a process or set of processes commonly referred to as ETL: extract, transform, load.
Source: Syncsort, Gigaom Research
The extract step covers the process or processes required to pull data out of sources such as a sales’ customer management system, an external social media stream, a mainframe-based ERP solution, or monitoring data from sensors on a factory floor.
Data originating in different systems will be almost impossible to compare and combine without some additional effort. The transform step usually involves a number of processes that alter data formats if necessary, apply data-quality and integrity-checking rules, harmonize data structures (making sure dates in each system are formatted the same way), join data from different sources, sort data in different ways (alphabetically, chronologically), and implement any mappings to or from code lists, thesauri, and other standardization tools. Many of these transformations will be logically straightforward, but together they can combine to create a significant computational task.
Once transformed in order to be both comparable and consistent, data from the different source databases must be loaded into the EDW for analysis, reporting, and long-term storage.
There is some difference of opinion about the best place for the transformation step to take place. The most common approach, historically, has been to transform data before loading into the EDW: ETL. There is growing interest in approaches that load data in whatever format it was provided, moving more of the intelligence into the destination system (typically an EDW). This approach can prove more flexible, with the EDW increasingly able to transform data into a variety of formats as requirements change. There is also a perception that this ELT process is more suited to fluid environments, where there may be a need for data to both flow into the EDW and back out from the EDW to source systems.
Keeping the EDW busy
ELT may increase business agility, but it also places significant load on the SQL databases at the heart of the modern EDW, databases that are rarely optimized to deal with sequential (and often manually coded) data transformations. Mike Koehler, the CEO of EDW powerhouse Teradata, was quoted earlier this year as suggesting that 20 percent to 40 percent of the load on a Teradata EDW is related to data transformations. At least some of this work might be more cost-effectively done elsewhere.
Save time and money: Offload to Hadoop
Hadoop clusters are often estimated to be 10 to 100 times cheaper than a high-end EDW for processing data. For mission-critical reporting and preservation functions, the additional cost of the EDW can clearly continue to be justified by most businesses, but with many in the industry accepting a figure of around 70 percent of all data warehouses being performance- and capacity-constrained, there is clear value in finding more-cost-effective ways to free up capacity on expensive and stretched EDW infrastructure.
Source: Syncsort, Gigaom Research
Hadoop can be well-suited to tackling the data transformations and batch processes that inefficiently consume so many of the resources available to an EDW. Jobs that take a long time to run and those that include semi-structured data from logs and clickstream analyses may prove particularly suitable for early offloading to Hadoop. The resulting data is still available for analysis and storage within the EDW, but it has reached a usable form far more quickly and cheaply than would have been possible inside the EDW itself.
7 Building on success
Using ETL or ELT offload to prove the value of Hadoop inside a business makes sense for a number of reasons, as we have explored. It delivers real, immediate, and measurable value to the business while freeing up expensive and constrained resources within the EDW and giving staff the opportunity to develop Hadoop skills.
Once demonstrably successful, organizations may well choose to extend the scope of their offload work to take more of the processing overhead away from the EDW. The EDW itself does not disappear, but the need for additional investment in expensive EDW hardware just to keep pace with growing data volumes is likely to diminish. With a clearer understanding of Hadoop’s capabilities and ready access to Hadoop experience in-house or from a trusted partner, executives are now in a far-better position to begin exploring the range of less-tangible uses to which Hadoop might be put within their business.
8 Key takeaways
- Hadoop’s cost-effectiveness and flexibility make it a compelling tool in a wide range of industries and in tackling a broad set of potential use cases. However, the lack of clear constraints on a Hadoop implementation can make it difficult for new adopters to fully understand the tool’s strengths and weaknesses or its measurable impact on the day-to-day operation of their business. Without clear scoping and constraints, early projects will wander and fail to deliver demonstrable returns.
- It makes sense to apply Hadoop to known and well-understood problems in the first instance, creating opportunities to measure return on investment and ensuring that all concerned can concentrate on learning Hadoop rather than having to worry about learning new data sets, business environments, and so on.
- The enterprise data warehouse remains a core asset inside many larger organizations, but continued growth in data volumes is creating a situation in which most EDW deployments are at or very near capacity. Hadoop can be an efficient and cost-effective tool for offloading some of the data processing currently done inside the EDW, freeing up EDW capacity and creating the sort of measurable challenge early adopters of Hadoop need to work with.
- Hadoop does not replace the EDW; it augments it.
9 About Paul Miller
Paul Miller is an analyst and consultant based in the East Yorkshire (U.K.) market town of Beverley who works with clients worldwide. He helps clients understand the opportunities and pitfalls around cloud computing, big data, and open data as well as presents, podcasts, and writes for a number of industry channels. His background includes public policy and standards roles and several years in senior management at a U.K. software company.
Miller was the curator for Gigaom Research’s infrastructure and cloud computing channel during 2011, routinely acts as moderator for Gigaom Research webinars, and has authored a number of underwritten research papers such as this one.
10 About Gigaom Research
Gigaom Research gives you insider access to expert industry insights on emerging markets. Focused on delivering highly relevant and timely research to the people who need it most, our analysis, reports, and original research come from the most respected voices in the industry. Whether you’re beginning to learn about a new market or are an industry insider, Gigaom Research addresses the need for relevant, illuminating insights into the industry’s most dynamic markets.
Visit us at: research.gigaom.com.