Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.
Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.
In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.
[soundcloud url=”http://api.soundcloud.com/tracks/80972101%3Fsecret_token%3Ds-RbbVK” params=”” width=” 100%” height=”166″ iframe=”true” /]
Wanted: A better search engine
Almost everywhere you go online now, Hadoop is there in some capacity. Facebook, eBay, Etsy, Yelp , Twitter, Salesforce.com — you name a popular web site or service, and the chances are it’s using Hadoop to analyze the mountains of data it’s generating about user behavior and even its own operations. Even in the physical world, forward-thinking companies in fields ranging from entertainment to energy management to satellite imagery are using Hadoop to analyze the unique types of data they’re collecting and generating.
Everyone involved with information technology at least knows what it is. Hadoop even serves as the foundation for new-school graph and NoSQL databases, as well as bigger, badder versions of relational databases that have been around for decades.
But it wasn’t always this way, and today’s uses are a long way off from the original vision of what Hadoop could be.
When the seeds of Hadoop were first planted in 2002, the world just wanted a better open-source search engine. So then-Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build it. They called their project Nutch and it was designed with that era’s web in mind.
Looking back on it today, early iterations of Nutch were kind of laughable. About a year into their work on it, Cutting and Cafarella thought things were going pretty well because Nutch was already able to crawl and index hundreds of millions of pages. “At the time, when we started, we were sort of thinking that a web search engine was around a billion pages,” Cutting explained to me, “so we were getting up there.”
There are now about 700 million web sites and, according to Wired’s Kevin Kelly, well over a trillion web pages.
But getting Nutch to work wasn’t easy. It could only run across a handful of machines, and someone had to watch it around the clock to make sure it didn’t fall down.
“I remember working on it for several months, being quite proud of what we had been doing, and then the Google File System paper came out and I realized ‘Oh, that’s a much better way of doing it. We should do it that way,'” reminisced Cafarella. “Then, by the time we had a first working version, the MapReduce paper came out and that seemed like a pretty good idea, too.”
“What they spent a lot of time doing was generalizing this into a framework that automated all these steps that we were doing manually,” Cutting explained.
[soundcloud url=”http://api.soundcloud.com/tracks/80972106%3Fsecret_token%3Ds-gmRg8″ params=”” width=” 100%” height=”166″ iframe=”true” /]
Raymie Stata, founder and CEO of Hadoop startup VertiCloud (and former Yahoo CTO), calls MapReduce “a fantastic kind of abstraction” over the distributed computing methods and algorithms most search companies were already using:
“Everyone had something that pretty much was like MapReduce because we were all solving the same problems. We were trying to handle literally billions of web pages on machines that are probably, if you go back and check, epsilon more powerful than today’s cell phones. … So there was no option but to latch hundreds to thousands of machines together to build the index. So it was out of desperation that MapReduce was invented.”
Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java, notably, whereas Google’s MapReduce used C++) and ported Nutch on top of it. Now, instead of having one guy watch a handful of machines all day long, Cutting explained, they could just set it running on between 20 and 40 machines that he and Cafarella were able to scrape together from their employers.
[soundcloud url=”http://api.soundcloud.com/tracks/80972114%3Fsecret_token%3Ds-yCIvx” params=”” width=” 100%” height=”166″ iframe=”true” /]
Bringing Hadoop to life (but not in search)
Anyone vaguely familiar with the history of Hadoop can guess what happens next: In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them. They spun out the storage and processing parts of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant) as an open-source Apache Software Foundation project and the Nutch web crawler remained its own separate project.
“This seem like a perfect fit because I was looking for more people to work on it, and people who had thousands of computers to run it on,” Cutting said.
Cafarella, now an associate professor at the University of Michigan, opted to forgo a career in corporate IT and focus on his education. He’s happy as a professor — and currently working on a Hadoop-complementary project called RecordBreaker — but, he joked, “My dad calls me the Pete Best of the big data world.”
Ironically, though, the 2006-era Hadoop was nowhere near ready to handle production search workloads at webscale — the very task it was created to do. “The thing you gotta remember,” explained Hortonworks Co-founder and CEO Eric Baldeschwieler (who was previously VP of Hadoop software development at Yahoo), “is at the time we started adopting it, the aspiration was definitely to rebuild Yahoo’s web search infrastructure, but Hadoop only really worked on 5 to 20 nodes at that point, and it wasn’t very performant, either.”
Stata recalls a “slow march” of horizontal scalability, growing Hadoop’s capabilities from the single digits of nodes into the tens of nodes and ultimately into the thousands. “It was just an ongoing slog … every factor of 2 or 1.5 even was serious engineering work,” he said. But Yahoo was determined to scale Hadoop as far as it needed to go, and it continued investing heavy resources into the project.
It actually took years for Yahoo to moves its web index onto Hadoop, but in the meantime the company made what would be a fortuitous decision to set up what it called a “research grid” for the company’s data scientists, to use today’s parlance. It started with dozens of nodes and ultimately grew to hundreds as they added more and more data and Hadoop’s technology matured. What began life as a proof of concept fast became a whole lot more.
“This very quickly kind of exploded and became our core mission,” Baldeschwieler said, “because what happened is the data scientists not only got interesting research results — what we had anticipated — but they also prototyped new applications and demonstrated that those applications could substantially improve Yahoo’s search relevance or Yahoo’s advertising revenue.”
Shortly thereafter, Yahoo began rolling out Hadoop to power analytics for various production applications. Eventually, Stata explained, Hadoop had proven so effective that Yahoo merged its search and advertising into one unit so that Yahoo’s bread-and-butter sponsored search business could benefit from the new technology.
And that’s exactly what happened, because although data scientists didn’t need things like service-level agreements, business leaders did. So, Stata said, Yahoo implemented some scheduling changes within Hadoop. And although data scientists didn’t need security, Securities and Exchange Commission requirements mandated a certain level of security when Yahoo moved its sponsored search data onto it.
“That drove a certain level of maturity,” Stata said. “… We ran all the money in Yahoo through it, eventually.”
The transformation into Hadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008, Baldeschwieler said. That meant doing everything from these line-of-business applications to spam filtering to personalized display decisions on the Yahoo front page. By the time Yahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage.
[soundcloud url=”http://api.soundcloud.com/tracks/80972099%3Fsecret_token%3Ds-g7Wo5″ params=”” width=” 100%” height=”166″ iframe=”true” /]
From the classroom …
However, although Yahoo was responsible for the vast majority of development during its formative years, Hadoop didn’t exist in a bubble inside Yahoo’s headquarters. It was a full-on Apache project that attracted users and contributors from around the world. Guys like Tom White, a Welshman who actually wrote O’Reilly Media’s book Hadoop: The Definitive Guide despite being what Cutting describes as a guy who just liked software and played with Hadoop at night.
Up in Seattle in 2006, a young Google engineer named Christophe Bisciglia was using his 20 percent time to teach a computer science course at the University of Washington. Google wanted to hire new employees with experience working on webscale data, but its MapReduce code was proprietary, so it bought a rack of servers and used Hadoop as a proxy.