Analyst Report: Bringing Hadoop to the mainframe

1 Summary

According to market leader IBM, there is still plenty of work for mainframe computers to do. Indeed, the company frequently cites figures indicating that 60 percent or more of global enterprise transactions are currently undertaken on mainframes built by IBM and remaining competitors such as Bull, Fujitsu, Hitachi, and Unisys. The figures suggest that a wealth of data is stored and processed on these machines, but as businesses around the world increasingly turn to clusters of commodity servers running Hadoop to analyze the bulk of their data, the cost and time typically involved in extracting data from mainframe-based applications becomes a cause for concern.

By finding more-effective ways to bring mainframe-hosted data and Hadoop-powered analysis closer together, the mainframe-using enterprise stands to benefit from both its existing investment in mainframe infrastructure and the speed and cost-effectiveness of modern data analytics, without necessarily resorting to relatively slow and resource-expensive extract transform load (ETL) processes to endlessly move data back and forth between discrete systems.

Key findings include:

  • Mainframes still account for 60 percent or more of global enterprise transactions.
  • Traditional ETL processes can make it slow and expensive to move mainframe data into the commodity Hadoop clusters where enterprise data analytics processes are increasingly being run.
  • In some cases, it may prove cost-effective to run specific Hadoop jobs on the mainframe itself.
  • In other cases, advances in Hadoop’s stream-processing capabilities can offer a more cost-effective way to push mainframe data to a commodity Hadoop cluster than traditional ETL.
  • The skills, outlook and attitudes of typical mainframe system administrators and typical data scientists are quite different, creating challenges for organizations wishing to encourage closer cooperation between the two groups.

Feature image courtesy Flickr user Steve Jurvetson

2 Hadoop in the enterprise

Hadoop[1] is used inside the enterprise to process data for a wide range of use cases, from analyzing server logs to optimizing supply chains and modeling financial fraud. These analyses have tended to focus upon unstructured (social media conversations) and semi-structured (transaction and log) data, and they are most often run as batch jobs to analyze historical data. Much of this source data is stored in existing enterprise IT systems, including mainframe-based databases and enterprise data warehouses (EDW). Processes to extract data from existing systems, transform it into common formats for analysis, and then load it into a further system to discover and act on insights are typically referred to as ETL (extract, transform, load) or ELT (extract, load, transform). Hadoop can play a role in lowering the costs associated with these processes, but it is still time-consuming to continually move data from one system and format to another.

More recent developments in the Hadoop community have introduced a richer set of capabilities for the analysis of near-real-time, real-time, and even streaming data sets. As these become increasingly interesting to mainstream enterprise adopters of Hadoop, the delays associated with traditional ETL and ELT processes present a more significant performance bottleneck.

[1] For a primer on Hadoop, please see Gigaom Research’s report “Hadoop in the enterprise: how to start small and grow to success.”

3 Remember the mainframe?

In 2014 IBM is celebrating the fiftieth birthday of the mainframe, in recognition of the April 1964 launch of the company’s modular and upgradeable System/360 family.

Although designed to be highly reliable, secure, and backwards-compatible with legacy applications, mainframes have seen their market squeezed by the rise of cheaper and more flexible forms of computing, typified by the desktop PC and commodity servers running Windows and Linux. NASA is among a growing selection of organizations to move away completely from mainframes, but for the U.K. Met Office and 71 percent of the Fortune 500, the machines continue to deliver value in tackling specific workloads. The role of mainframes may be declining at each of these sites and new sales continue to fall, but the big machines show no real signs of completely disappearing soon.

According to IBM, 92 of the top 100 banks and 23 of the leading 25 retailers are among those relying on mainframes to process data concerning core aspects of their business. Taken together, over 3,000 of the world’s largest public and private institutions reportedly run 10,000 mainframe machines alongside other types of IT infrastructure.

This investment in mainframe technologies is substantial, and for many organizations mainframe-based legacy applications are fundamental to the ongoing operation of their business. The earliest adopters of email have long since migrated from mainframe-based email services to Microsoft Exchange, Lotus Notes, or cloud-based alternatives like Google Apps. ERP systems like SAP have also mostly moved off mainframes to client/server-based alternatives, and completely new projects generally begin life in cloud or otherwise virtualized development architectures. But the core — and often bespoke or extremely customized — applications at the heart of financial processing, weather forecasting, patient tracking, or insurance underwriting remain tied to the mainframes on which they have organically evolved over decades. Mainframe-focused development languages like COBOL and Fortran, mainframe-based assumptions and practices around security and resilience, and tightly interconnected code and business logic mean that, in many circumstances, the cost of migrating these applications to an alternative architecture is simply too great to consider. Organizations cannot afford to operate without these applications during any migration process, and the risks posed by a botched or incomplete migration are often considered to be potentially catastrophic to the business.

Instead of seriously contemplating moving these workloads, organizations are paying attention to new ways in which they can gain even greater value from their historical and ongoing investment in mainframes.

The enterprise data stored in mainframe-based systems is extremely valuable to the organization, and timely and cost-effective analysis of this data can have a significant impact on managers’ ability to monitor performance and respond to both internal and external trends. Elsewhere in the business, traditional business intelligence (BI) and data-analysis tools are increasingly being supplemented by Hadoop-based approaches. These typically make it cost-effective to collect, store, and analyze far larger data volumes, shifting from observing a sample of core logs or key performance indicators (KPIs) to modeling potential outcomes based on all of the easily available data. Despite the business value of the data stored in the mainframe, the cost and effort involved in extracting data for analysis elsewhere can prove prohibitive. In extreme cases, the business makes strategic data-based decisions without the benefit of the core data locked inside its own mainframe-based systems.

Two alternative but complementary approaches are available to assist in overcoming this challenge, ensuring that data from mainframe-based systems is available for analysis and exploration alongside data from other areas of the business:

  • Conducting Hadoop-based data analysis locally on the mainframe
  • Improving the mechanisms by which data can be extracted from the mainframe for analysis in traditional cluster-based Hadoop deployments


4 Mainframe, meet Hadoop

In a number of circumstances there may be value in analyzing mainframe-based data on the mainframe itself, without extracting it for processing and analysis elsewhere.

Working an asset

Long-term mainframe implementations represent a sizeable and ongoing commitment, including investment in mainframe hardware, software, support, and staff with specific skills in maintaining and operating the system. These large machines are normally purchased or leased on long-term contracts, and capacity will typically be specified to cope with anticipated peaks in demand. Expensive infrastructure can, therefore, be significantly underutilized during non-peak periods. Figures from Mark Levin at Metrics Based Assessments, for example, suggest utilization levels reached 83 percent in one study of best practice, with average weekend loads as low as just 38 percent. While an underutilized cluster of servers might be partially shut down to save power and cooling costs, a mainframe is essentially a fixed cost; it is running — and being powered, cooled, maintained, and supported — regardless of the load being placed on it.

Where Hadoop jobs are not particularly time-sensitive, there may be value in utilizing spare cycles on an existing mainframe investment instead of incurring the additional expense of procuring, deploying, and maintaining a separate Hadoop cluster.

Respecting the borders

Mainframes run applications in contexts such as health care, financial services, and government, processing transactions involving the personal details of millions of individual patients, customers, and citizens. These workflows tend to be heavily regulated, and compliance regimes or legislative frameworks may govern the ways in which data can be used, reused, or moved around. Institutional security policies, too, may treat the mainframe and the data it holds differently from other parts of the enterprise IT estate, placing it in a separate and secure area of the network.

It may therefore be technically or practically difficult to extract sensitive personal data from the mainframe for processing, making data analysis on the mainframe itself a more logical first step; a mainframe-based pilot of Hadoop may be the easiest way to demonstrate value. The results of that pilot might suggest that future analysis continue to be conducted on the mainframe, or it might provide evidence that can then be used to build a case for the technical, procedural, and organizational changes required to simplify the process of extracting data from the mainframe.


The majority of Hadoop processes today are still essentially batch jobs; a large set of historical data is loaded into Hadoop’s MapReduce engine and chopped into manageable pieces for processing. The results are combined and the answer revealed. These batch processes tend to be routine, based on a known and finite set of data points, and the desired outcome tends to be well-understood in advance. For example,

  • Process monthly sales figures to find the best-selling products and the best-performing stores. Compare to previous months
  • Analyze this week’s credit card transactions. Compare to fraud patterns detected in previous weeks to identify potentially fraudulent behavior
  • Analyze this year’s income from insurance premiums and expenditure on health care. Identify patterns and trends to modify policies and maintain profitability

There is also interest in enabling more-exploratory interaction with the data, which may require ready access to every piece of information in the system. Extracting all of an organization’s mainframe-based data may be too computationally expensive or administratively painful to be worth contemplating, just to support unconstrained and unquantified exploration of the data set. The potential benefits are simply too poorly understood to justify the investment of time and effort.

By utilizing spare cycles on the mainframe itself, exploratory data analysis may be undertaken without significant new investment and without having to make a case for the resources necessary to support large-scale data extraction to another system.

Shortening the cycle

As well as using Hadoop for batch analysis of historical data, there is value in analyzing more-recent data rapidly enough for the insights to inform future behavior. For example,

  • Analyze today’s sales to inform overnight shipments of replacement stock
  • Analyze up-to-date information on room availability to decide how many hotel rooms to discount with third-party booking sites

In these situations, it may take too long to extract data for analysis elsewhere; analysis on the system of record (i.e., the mainframe) may be the only cost-effective way to complete the processing in sufficient time for it to inform near-future behavior.

Hadoop on the mainframe

Hadoop is typically associated with clusters of x86-based machines running Linux, and that remains the dominant model. Individuals and small teams continue to experiment with Hadoop on alternative platforms, including the Raspberry Pi, Oracle’s SPARC-based servers, and machines using low-power ARM processors. Veristorm’s vStorm Enterprise productizes the ability to install Hadoop on mainframes running Linux, addressing the requirements explored above. The hardware’s virtualization capabilities ensure that data in applications running on alternative mainframe operating systems can also be ingested into Hadoop.

Alternatively you can look at ways to get data off the mainframe more quickly for effective integration into existing data-analytics capabilities within the organization. Competitors such as Syncsort and Informatica are more concerned with simplifying this process of getting data off the mainframe for analysis elsewhere, as we will now discuss.

5 Hadoop, meet mainframe

In other circumstances, it makes more sense to simplify the process of extracting data from the mainframe for analysis elsewhere. This may either relate to traditional batch processing of historical data or an attempt to analyze data in real time as it streams from the systems of record.


Most enterprise deployments of Hadoop are designed to analyze data that originates in other enterprise systems, including mainframe-based applications, an enterprise data warehouse, and so on. Established providers of enterprise IT systems therefore have an interest in simplifying the process of getting data from their systems into Hadoop, and Pivotal, Teradata, IBM, Microsoft, SAP, and others all provide tools designed to simplify and accelerate this process. Specifically in the area of data extraction from mainframes, Software AG, Syncsort, Veristorm, and Informatica are among those offering a range of tools, utilities, and workflows.

At a technical level, dumping data from the mainframe is reasonably well-understood, and implementations tend to be robust. Challenging issues remain, and these can best be addressed in policy, with training, and by ensuring effective communication among relevant stakeholders.

Security and compliance

As discussed above, applications on mainframes often contain personal, confidential, or otherwise sensitive data. Network configuration, access control, and the manner in which the application itself was architected on the mainframe are all part of the process by which this data is secured. By extracting data from the mainframe for processing elsewhere, these safeguards will be actively or inadvertently bypassed, potentially placing sensitive data — and corporate reputations — at risk. These risks apply to data moving across the network from the mainframe to a staging area or other system and to data at rest outside the mainframe.

Enterprise-grade Hadoop distributions such as those from Cloudera and Hortonworks are increasingly paying attention to security and compliance concerns, but organizations used to placing undue faith in the security baked into their mainframe deployment need to pay particular attention to the different assumptions and policies associated with a cluster-based application like Hadoop. Thus, when integrating mainframe data and Hadoop, an important security consideration is to ensure the metadata is properly accounted for in accordance with existing governance policies.

Availability of resources

Mainframe resources are expensive and finite. Variations in load mean that there will often be spare capacity for running jobs that are not time-critical, as discussed above. Regularly and predictably allocating a smaller set of resources to support extracting data from mainframe databases like DB2 may actually prove more difficult, with implications for capacity planning and load management on the mainframe itself.


Traditionally, extracting data from a mainframe-based application or database requires a number of steps, including:

  • Identifying the set of data to be transferred
  • Extracting the data from the host application
  • Moving the data over the network to a separate staging or processing area
  • Transforming the data into formats, syntaxes, layouts, or orders likely to be required during analysis
  • Moving the data over the network to the Hadoop cluster
  • Loading the data
  • Performing the analysis

This process can be time-consuming, and it needs to be repeated whenever the source data — or the desired formats and syntaxes — changes.


Newer capabilities associated with the Hadoop project, including YARN and Storm, move Hadoop’s capabilities beyond the batch-dominated processes of MapReduce to embrace processing of data streams. This capability is particularly powerful when applied to processing real-time data, but it also delivers benefits for those who wish to extract data from mainframe-based applications for processing in a commodity Hadoop cluster.

Bypassing ETL

By streaming data off the mainframe to a Hadoop cluster running on commodity hardware, organizations bypass the expense and delay of running a traditional ETL or ELT process. Data is delivered in a timelier manner, and some of the security concerns associated with multiple ETL staging areas are bypassed altogether as the approach avoids the need for separate staging.

Tight integration between the Hadoop distributions and the providers of mainframe-based tools can, if implemented correctly, mean that the data scientists analyzing data in Hadoop need no particular mainframe knowledge; the data simply arrives in Hadoop as a stream, just like data coming from other enterprise systems. Veristorm’s solution, for example, is certified by Cloudera and offered in partnership with Hortonworks.


6 Ops, meet the data scientist

The stereotypes of mainframe system administrators and a new generation of data scientists are extreme, and they make them appear almost incompatible.

  • One group lavishes care and attention on physical infrastructure and pays a great deal of attention to policies, procedures, and security.
  • The other sees infrastructure as a disposable commodity, available when needed and discarded after use. For them, data (and the stories it can tell) is what matters. Security, compliance, and antiquated laws may be seen as simply an inconvenience, to be ignored where possible and worked around when necessary.

The reality is, of course, far more nuanced, but the competing priorities of maintaining a system and achieving results can create challenges.

More specifically, data scientists are unlikely to be familiar with the programming languages, applications, or strengths and weaknesses of a mainframe.

Processes that encourage each group to leverage its own strengths are likely to prove most beneficial to the organization as a whole, reducing opportunities for conflict, decreasing the need for training, and making use of existing skills. Streaming data from the mainframe to Hadoop running on commodity hardware is one example of this; the system administrators concentrate on keeping the mainframe performing at its best, and the data scientists use the tools and infrastructure they already know to simply consume a new stream of data. For them, the fact that this stream originates on a mainframe and not an enterprise data warehouse, an array of sensors, or a national network of cash registers is almost irrelevant; it is simply a stream of data to be analyzed.

7 Conclusion

Even as newer workloads continue their move to commodity x86 hardware in the enterprise data center or the cloud, the venerable mainframe remains a vital piece of IT infrastructure for many global corporations. The long-term investment in hardware, software, processes, and skilled staff is significant, and organizations are keen to maximize the return on their investment by minimizing idle capacity and by ensuring that data from mainframe systems plays a full role in the organization’s analyses and decision-making. Two complementary but different approaches are available to enterprises wishing to maximize the value of existing investments in mainframe infrastructure and Hadoop-based data analysis.

In situations where an existing mainframe has idle capacity, it is possible to apply an organization’s Hadoop experience and workflows to data on the mainframe itself, removing the need to export data for analysis elsewhere. This approach is generally used for analyzing specific mainframe-hosted workloads when the mainframe has spare capacity, and it is unlikely to be considered a realistic replacement for the commodity Hadoop cluster.

A more broadly applicable (and, often, more cost-effective) solution is to accelerate the rate at which mainframe-hosted data can be extracted for analysis on the commodity hardware more typically associated with Hadoop. Advances in and around the Apache Hadoop project itself, such as YARN and Storm, are beginning to deliver the capabilities that third-party vendors such as Veristorm need to accelerate workflows that once depended on resource- and time-expensive ETL processes.

It would be impractical to suggest that the mainframe could or should be anyone’s first choice as a means to run Hadoop. Commodity clusters of x86-based servers remain a more cost-effective solution in almost every case. But where it is still in use, the mainframe is often the repository of key institutional data and core transactional workflows. As decision-making processes elsewhere in the organization becomes increasingly dependent on data-based inputs, it is important that the value locked inside the mainframe is able to play a full role.

8 Key takeaways

  • Mainframes account for 60 percent or more of global enterprise transactions, and they continue to run core processes in financial services, health care, and the public sector.
  • Traditional ETL processes can make it slow and expensive to move mainframe data into the commodity Hadoop clusters where enterprise data analytics are increasingly being done, creating the risk that business decisions will be made without the benefit of mainframe-based data.
  • In some cases it may prove cost-effective to run specific Hadoop jobs on the mainframe itself, consuming spare cycles during non-peak periods.
  • In other cases, advances in Hadoop’s stream-processing capabilities offer a more cost-effective way to push mainframe data to a commodity Hadoop cluster than traditional ETL.
  • The skills, outlook and attitudes of typical mainframe system administrators and typical data scientists are quite different, creating challenges for organizations wishing to encourage closer cooperation between the two groups.
  • The mainframe remains an important piece of the enterprise IT estate. If an organization is embracing Hadoop to support data-based decision-making, then core data from the organization’s mainframe needs to be part of that equation.

9 About Paul Miller

Paul Miller is an analyst and consultant, based in the East Yorkshire (U.K.) market town of Beverley who works with clients worldwide. He helps clients understand the opportunities and pitfalls around cloud computing, big data, and open data, as well as presents, podcasts, and writes for a number of industry channels. His background includes public policy and standards roles, several years in senior management at a U.K. software company, and a Ph.D. in Archaeology.

Miller was the curator for GigaOM Research’s infrastructure and cloud computing channel during 2011, routinely acts as a moderator for Gigaom Research webinars, and has authored a number of underwritten research papers such as this one.

10 About Gigaom Research

Gigaom Research gives you insider access to expert industry insights on emerging markets. Focused on delivering highly relevant and timely research to the people who need it most, our analysis, reports, and original research come from the most respected voices in the industry. Whether you’re beginning to learn about a new market or are an industry insider, Gigaom Research addresses the need for relevant, illuminating insights into the industry’s most dynamic markets.

Visit us at:

11 Copyright

© Knowingly, Inc. 2014. "Bringing Hadoop to the mainframe" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact