Hadoop: Supercomputing for the Rest of Us

Over on the OStatic blog–dedicated to open source technology–Reuven Lerner has a post on Hadoop that may be of interest to many web workers. If you’re unfamiliar with Hadoop, it’s an open source alternative to how Google’s all-important MapReduce technology works, which allows it to take advantage of huge clusters of computers to produce fast results for queries and much more.

Sponsored by the Apache Software Foundation, Hadoop allows you to specify that one or more computers are ready and willing to perform a processing operation. Drafting many computers into service at once has huge implications. Among several possible applications, it could allow for much more efficient web crawling, and trolling of large data sets such as customer databases.


You’ve probably tuned into the buzz surrounding “cloud computing.” Hadoop works with the idea that a complex task can be split apart automatically, with many tasks handed out to thousands of available computers. If that reminds you of how the SETI@Home project worked, it is somewhat similar.

In the comments to Reuven’s Hadoop post, one reader says: “Web crawls – break out your target list of sites to crawl and have the spiders go nuts.” Another says: “Collaborative filtering – you need to mine through reams of customer data and build data structures that can be queried later. A lot of the crunching can be parallelized.” You might also like to check out the innovative Hadoop application the New York Times came up with.

Sooner or later, the power of this kind of distributed computing is going to need to be freely and easily available to everyone who works on the web. That is inevitable, because leveraging supercomputer-level processing power will become necessary to stay competitive.

Do you think we’ll see thousands of computers working in parallel on our tasks influence the way we work?

Comments have been disabled for this post