Blog Post

Microsoft, Now Loving Hadoop

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Last week, OStatic noted the rumor, first reported by VentureBeat, that Microsoft intended to buy Silicon Valley semantic search engine Powerset for $100 million. Lo and behold, Microsoft and Powerset are confirming today that an acquisition agreement has been signed. The terms of the deal have not been disclosed, but the rumored $100 million figure was in line with valuations put on Powerset based on its early financing.

Powerset’s search technology uses the open-source, cluster-based technology Hadoop, which provides fast answers to queries by using the resources of many computers. Hadoop, a project from the Apache Software Foundation, is also behind Yahoo’s search.

Natural language search got a bad rap early on in the rise of the web as players such as AskJeeves stumbled, but clustered query technology like Hadoop’s may represent a game-changer. Microsoft, of course, has been desperately trying to catch up in search, where it is a distant third to Google and Yahoo. It won’t be surprising to see large portions of Microsoft’s LiveSearch start to depend on Powerset, and in so doing, depend on open-source upstart Hadoop.

7 Responses to “Microsoft, Now Loving Hadoop”

  1. Has anyone tried NCache. It is a distributed object and session caching solution for .Net Apps. We have been using it for the last 2 years. I would recommend it to anyone interested in distributed caching.

  2. Johnny

    By “used in production” you mean you can buy it, no. But it’s used by several teams at Microsoft, specifically, the Live Search team. I learned about at from Greg Linden’s blog: He actually help design the original recommendation engine at Amazon, now he works at Microsoft. He speaks frequently about Hadoop in his blog.

    I guess it depends how much your organization values its time. And how much more ground it can cover by paying the Windows “tax”. Dryad combined with LINQ is a powerful value proposition. It lets you do more with less.

    “For production use, the query languages don’t matter, because the developer would rather have control. However, for researchers, it dramatically lowers the cost of arbitrary queries.” So whenever you deal with a database, you forgo “SQL” and use the raw API instead? Didn’t think so. As a software engineer, i prefer to use SQL – even for reports that get ran again and again. A query language lets you do more with less.

    I’ve used MySQL and even found a bug or two in it. While I have access to the source code, other than fixing it myself I submitted a bug report and i’m going to let one of the MySQL engineers fix it. One line change can have rippling effects across the entire code base – its best to let the people who have contextual knowledge fix it. I have also found bugs in Microsoft SQL Server. I don’t have access to the code there, but, just like I did for MySQL, I submitted a bug report. There’s no difference.

    Companies don’t care about “openness”, they care about the bottom line. And if a Microsoft stack makes the number on the bottom line larger – they’re going to use that. It’s simple Econ 101.

  3. Is Dryad actually used in production? It looked like it was only a research project. Hadoop on the other hand is used by many companies in production (

    There are at least 3 query languages built on top of Hadoop. Pig is already an Apache project. Hive, from Facebook, is being added as a Hadoop contrib module ( IBM has also released Jaql. For production use, the query languages don’t matter, because the developer would rather have control. However, for researchers, it dramatically lowers the cost of arbitrary queries.

    With infrastructure, open is good. Porting your applications to Dryad (if you could get it) would lock you in to a lot of Microsoft at a very deep level. Unless you are Microsoft, deploying Windows across a cluster is a huge unnecessary cost. But more importantly, all of the users have control with Hadoop. They can find and fix bugs that are blocking them. That is a critical requirement.

    In terms of Dryad being as good as Hadoop, I don’t have any evidence of that. Hadoop is used to build the Yahoo Search webmap ( Hadoop won the terabyte sort benchmark, that was started by Microsoft (

  4. Johnny Fry

    Microsoft actually already has a massively parallel processing platform, it’s called Dryad. It’s actually superior to Hadoop in that the interaction with it all happens over LINQ, specifically, PLINQ. So as a developer, you just write LINQ queries (probably using special extension methods) and it gets executed on N machines. Which is actually really sweet, because anyone can write LINQ queries now – you just keep writing the same queries you always have using LINQ and they get executed on Dryad.

    This is actually in stark comparison to Hadoop. Facebook had to write their own SQL-like language for it. Lets see if FB open sources that. My hypothesis? They won’t – it’s way too much of a “secret sauce”.

    As for Velocity, which you’re right in that is it the Microsoft version of Memcached. As it stands right now, is pretty much on par with Memcached, but Microsoft has already committed to more features (such as fail-over, searching) than Memcached.

    Both Dryad and Velocity are as good, or even better, than their open-source versions. The price of Velocity (i suspect) will be free. And Dryad will be probably be incorporated into SQL Server. Far from slightly worse.

    You’re right on assuming it will only run on Windows though. Microsoft has to maintain value on their platform. They will both probably be closed source as well (although, anything is possible). But I can’t say I’ve had the need/desire to modify Hadoop’s source.

  5. Or, Microsoft could do what they always do and re-invent a version of Hadoop that is slightly worse than the open version and only works with their operating system.

    If you think this scenario sounds far-fetched, think again — they’re currently doing the same thing with memcache.

  6. Hadoop is low-level infrastructure for data storage and retrieval, something I’d expect Microsoft would already have in place for Live Search and their cloud computing initiatives.

    It could be that Hadoop is vastly better than what they have and they’ll adopt it, but I think it is more likely that they pour Powerset’s secret-sauce over their existing infrastructure.