Blog Post

MapReduce vs. SQL: It's Not One or the Other

A study released today by a team of leading database experts, among them Structure 09 speaker Michael Stonebraker, has been generating buzz for its assertion that clustered SQL database management systems (DBMS) actually perform significantly better for most tasks than does cloud golden child MapReduce. But how shocked should we be, really? After all, choosing a parallel data strategy is not an all-or-nothing proposition.

Google (s goog) built MapReduce to handle its particular needs, which are a far cry from the needs of most businesses. Database analyst Curt Monash told Computerworld that the study just reinforced his belief that MapReduce is better for limited tasks like text searching or data mining — you know, the things Google does on an epic scale. For tasks that require relational database capabilities at web scale, database sharding has become a favorite practice. I’ve heard Google itself uses SQL, MapReduce and/or sharding depending on the task. Companies like Aster Data Systems and Greenplum give companies the functionality of both MapReduce and SQL in one user-friendly package.

I think MapReduce (and its variants, like Hadoop) have received a lot of unnecessary adoration thanks to the fervor over cloud computing. Some people, it seems, associated Google, Yahoo (s yhoo) and their web brethren with cloud computing, and thus surmised that in order to do cloud computing, you must do exactly what Google and Yahoo do. This, of course, is not the case. From a business perspective, cloud computing is just as much about saving money and making life easier as it is about doing massive amounts of computing. If you don’t have unique computing needs like the web giants, but just want eliminate the joys of owning and managing machines, there are plenty of clustered SQL solutions available in the cloud. Like most things in life, it’s just a matter of finding the right tool for the job.

16 Responses to “MapReduce vs. SQL: It's Not One or the Other”

  1. The fundamental difference is this:

    A PL/SQL function is not parallelizeable by default when executed inside a DBMS. When (not if) this happens, a DBMS will be able to do what MapReduce can do.

  2. Unfortunately the paper only really tested MapReduce’s ability to replicate relational functions. It’s not designed for efficient joining or indexing of any kind, so obviously there will be huge disparities in performance.

    I have a presentation posted online for anyone looking for a primer on Hadoop and HBase and how they compare to relational databases. I address some of the limitations of each and what their strong points are.

  3. To some extent the paper’s point is correct. Hadoop’s MapReduce implementation is anything but speedy — i.e. it is rumored to be an order of magnitude slower than Google’s internal MapReduce implementation. Users of Hadoop need to be willing to throw as much as 10 times the hardware at a problem to match any of the better MPP database implementations. That means buying 1000 Hadoop servers to keep pace with 100 MPP database servers. That’s an enormous cost in terms of power, capital expenditure, datacenter space, and more.

    Setting aside performance questions for a moment, there are good reasons why many programmers prefer to express their problems in MapReduce rather than SQL. And likewise why DBAs and analysts generally prefer SQL rather than getting their hands dirty writing code. Each is trained to approach problems in a certain way, and they prefer the mode of expression that best fits with their skills and experience.

    The good news is that Hadoop isn’t synonymous with MapReduce from a performance perspective. Here at Greenplum we’ve implemented MapReduce natively on our parallel dataflow engine, using the same building blocks used to execute SQL at high performance and massive scale. That means that user get the best of both worlds — the ability to analyze their data using SQL, MapReduce or both together in the same program — with industry-leading performance in either case.,blog/

  4. Everyone’s stuck in the speeds, feeds, and optimizations, but the point is money. First, money in the sense of building great businesses and self-cannibalizing them to keep them great. Second, professional analysts like Monash always stick up for their clients, in this case the database incumbents. Stonebreaker’s motives are likely purer. He’s one of the biggest brains the Bay Area has ever produced, but I’m going to speculate he’s emotionally over-invested in structured DBs.

    Most of the tasks being benchmarked were optimized for SQL DBs because they were the most cost-effective systems when those business processes were designed. We will be soon assigning them to the long, slow profitable declining category of “legacy systems.” As with any other transition, the change will not be universal. Mainframes are still the best for a number of tasks but no longer for the bulk of them.

    Lookery and hundreds of other companies, many cloud-hosted, are building new business processes that are optimized for MapReduce and similar architectures. In many — even most — cases, these business processes will be far more cost-effective than the ones they will replace. New incumbents will arise, new benchmarks written, and new statistics reported by analysts with new biases.

    Companies invested in SQL apps that can be replaced, most often indirectly, by MapReduce-esque apps need to start self-cannibalizing.

    • The paper has no disclosures that Mr. Stonebraker is also founder of Vertica – the column-based DBMS used in the experiment against Hadoop and the other unnamed DBMS-“X”.

      According to the paper’s conclusions: “DBMS-X was 3.2 times faster than Map/Reduce and Vertica was 2.3 times faster than DBMS-X”.

  5. Google did not invent MapReduce, they have implemented it for their Big Table/data store implementation for their search backbone and applied to to other projects where needed.

    Also, Hadoop isnt a variant of MapReduce, it turns tasks into MapReduce jobs to be carried out over large data sets.

  6. I think a point that a lot of people miss is that MapReduce and SQL are theoretically interchangeable but in practice are narrow logical optimizations for specific problems. MapReduce scales to much larger data sets than SQL, but at the price of being hopelessly inefficient for many complex relational operations. SQL can offer good performance across a very broad range of complex relational operations, but has much steeper scalability limits. You pick the right tool for the job, and there are a number of companies now like Aster and Greenplum that offer solutions to balance the tradeoffs between both models.

    More interestingly, the tradeoff is not inherent to databases but the result of a longstanding unsolved algorithm problem in computer science — the MapReduce/SQL dichotomy is the result of workaround hacks. If a company solved that underlying problem, you could build a distributed database system that had capabilities markedly superior to both the current crop of both Hadoop/Google type databases and clustered SQL databases. There are a lot of applications and data sets that are inadequately served by either existing database model.

    Great stuff and a welcome improvement in the market, but MapReduce and SQL are not likely to be the last word.

    • The new Enterprise Architecture is going to be “several” RDBMS Point applications – like core manufacturing systems , servicing (billing , CRM etc) and your Enterprise wide single version of truth (master data) in a Parallel DB like vertica or Simple DB or some variant of Hadoop.

      This is kinda what the Enterprise is geared up today with EDW and core applications as seperate departments , the difference is going to be the near-realtime nature of EDW providing the readonly views for consumption for core applications