Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Apache Hadoop: Is one cluster enough? by Paul Miller:
The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.
Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.
In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.
To read the full report, click here.