Blog Post

Supercomputers, Hadoop, MapReduce and the Return to a Few Big Computers

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Yahoo announced yesterday it would collaborate with CRL to make supercomputing resources available to researchers in India. The announcement comes on the heels of Yahoo’s Feb. 19 claim to have the world’s largest Hadoop-based application now that it’s moved the search webmap to the Hadoop framework.

There are a number of Big Computing problems today. In addition to Internet search, cryptography, genomics, meteorology and financial modeling all require huge computing resources. In contrast to purpose-built mainframes like IBM’s Blue Gene, many of today’s biggest computers layer a framework atop commodity machines.

Google has MapReduce and the Google File System. Yahoo now uses Apache Hadoop. The [email protected] screensaver was a sort of supercomputer. And hacker botnets, such as Storm, may have millions of innocent nodes ready to work on large tasks. Big Computing is still bigit’s just built from lots of cheap pieces.

But supercomputing is heating up, driven by two related trends: On-demand computing makes it easy to build a supercomputer, if only for a short while; and software as a service means fewer instances of applications serving millions of users from a few machines. What happens next is simple economics.

Frameworks like Hadoop scale extremely well. But they still need computers. With services like Amazon’s EC2 and S3, however, those computers can be rented by the minute for large tasks. Derek Gottfrid of the New York Times used Hadoop and Amazon to create 11 million PDF documents. Combine on-demand computing with framework to scale applications and you get true utility computing. With Sun, IBM, Savvis and others introducing on-demand offerings, we’ll soon see everyone from enterprises to startups to individual hackers buying computing instead of computers.

At the same time, Software-as-a-Service models are thriving. Companies like, Rightnow and Taleo replaced enterprise applications with web-based alternatives and took away deployment and management headaches in the process. To stay alive, traditional software companies (think Oracle and Microsoft) need to change their licensing models from per-processor to per-seat or per-task. Once they do this, simple economies of scale dictate that they’ll run these applications in the cloud, on behalf of their clients. And when you’ve got that many users for an application, it’s time to run it as a supercomputing cluster.

Maybe we’ll only need a few big computers, after all. And, of course, billions of portable devices to connect to them.

Interested in Web Infrastructure? Attend our upcoming conference, Structure08 on June 25th in San Francisco

4 Responses to “Supercomputers, Hadoop, MapReduce and the Return to a Few Big Computers”

  1. One reader sent me some comments in private that I thought deserved a public airing.

    As he points out,:

    “Microsoft already has per-seat licensing. In fact, and depending on the product, you can get both options. Windows Server is per-seat or per server. Applications like SQL server are per seat (user or device CALs) or per processor.

    The choice becomes interesting when dealing with VM’s. In a virtual environment, licenses for apps like SQL server are based on virtual processors and not physical processors.

    e.g: Imagine a dual quad-core blade running VMWare VI3 with 20 VMs, one of which runs SQL server. The SQL server VM has been assigned to share 2 virtual processors (=2 physical cores), and you have 200 users accessing both servers throughout the day, with an average of 100 concurrent users connected to either SQL server at any one time (per-seat user CALs are concurrent).

    In the scenario above, with MS licensing, you have these two options:

    1) Per-seat user CALs cost $190 each * 100 = $19,000 to handle 100 concurrent users
    2) Per processor MS-SQL Standard licenses cost ~$6000 each * 2 (1 per virtual processor) = $12,000 to handle unlimited users

    Take your pick!

    Per seat is cheaper for a small number of users with relatively low processor usage, but if you expand beyond a couple of hundred users, it may not be the best option. With high processor usage and a small number of SQL servers, things might remain in favor of per-seat…

    Scenario 2:

    Picture the same application running on the same dual quad core blade but now the SQL server is assigned 4 virtual processors…. That’s 4 cores assigned to the same VM. At $6000 per virtual processor, that’s now $24,000. If you have the same 100 concurrent users accessing the system, this is a much more expensive model, especially if you have multiple SQL servers that these users are accessing at the same time (the same user CAL can be used to access multiple servers at the same time, whereas per processor is specific to each SQL installation/processor combination)…

    That being said, you could put one super SQL server on the blade without VI3 and only need 2 per-processor licenses (licensing is per physical processor in this case) at a cost of $12,000. This setup would handle hundreds of concurrent users.

    Now factor in VI3 licenses, OS licenses, and any other licenses you need per-processor or per-user, that would be running on this blade.”

    To clarify, in my comment, “traditional software companies (think Oracle and Microsoft) need to change their licensing models from per-processor to per-seat or per-task” I was referring more to application back-end software like databases, message queues, directory servers and the like which are generally billed out based on processor capacity. But his point is well taken — and the complexity that comes with such licensing options is one reason many users want a simpler billing model.