<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GigaOM &#187; mapreduce</title>
	<atom:link href="http://gigaom.com/tag/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://gigaom.com</link>
	<description></description>
	<lastBuildDate>Wed, 22 May 2013 12:19:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gigaom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/0db8f6557d022075dbbf010c54d46d93?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GigaOM &#187; mapreduce</title>
		<link>http://gigaom.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gigaom.com/osd.xml" title="GigaOM" />
	<atom:link rel='hub' href='http://gigaom.com/?pushpress=hub'/>
		<item>
		<title>Google donates patents to protect cloud software from lawsuits</title>
		<link>http://gigaom.com/2013/03/28/google-donates-patents-to-protect-cloud-software-from-lawsuits/</link>
		<comments>http://gigaom.com/2013/03/28/google-donates-patents-to-protect-cloud-software-from-lawsuits/#comments</comments>
		<pubDate>Thu, 28 Mar 2013 14:06:52 +0000</pubDate>
		<dc:creator>Jeff John Roberts</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[Patent pledge]]></category>
		<category><![CDATA[patents]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=625211</guid>
		<description><![CDATA[Google announced a "patent pledge" in which it will donate 10 patents related to MapReduce to protect the emerging cloud and big data industry from lawsuits.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=625211&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Google unveiled a &#8220;patent pledge&#8221; that it hopes will shield cloud software and big data developers from the type of litigation that has engulfed the mobile phone industry. The pledge, which is like a non-aggression pact, covers ten patents related to Google&#8217;s MapReduce technology.</p>
<p><a href="http://www.google.com/patents/opnpledge/pledge/">The pledge</a>, which Google <a href="http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html">announced</a> on Thursday, says that developers are free to use or sell the technology described in the patents without fear of future lawsuits. The shield applies, however, only to projects based on open source software that is available to all.</p>
<p>Google&#8217;s patent pledge appears intended to complement the open-source software licenses that allow programmers to build on each other&#8217;s work. Such licenses, like the GNU General Public License, grant anyone the right under copyright law to use designated blocks of software code; these rights, however, can be undercut by competing patent rights.</p>
<p>The ten patents included in Google&#8217;s pledge include a <a href="http://arstechnica.com/information-technology/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop/">controversial one issued last year</a> that covers a form of parallel processing known as MapReduce. The patent gave rise to fears that Google would be able to monopolize tools <a href="http://gigaom.com/2012/02/06/what-it-really-means-when-someone-says-hadoop/">like Hadoop</a>, which is an integral part of the so-called &#8220;big data&#8221; revolution that is fueling a wide range of new products and services. Google&#8217;s pledge appears intended to allay that fear.</p>
<p>In a phone interview, a person at Google familiar with the project explained that the MapReduce patent pledge is intended to help the emerging big data and cloud software industry avoid a litigation train-wreck like the one that befell the mobile industry. (In recent years, an arms race of patents covering smartphones has led to a relentless series of global lawsuits which have limited the spread of software technology and increased prices for consumers.)</p>
<p>Google suggests it will add other patents to its non-aggression pool and is inviting others to do the same. In theory, this will lead to an open and expanding workshop of tools for cloud developers; however, there is no guarantee it will work out this way.</p>
<p>One problem is that the pledge will have little effect against patent trolls <a href="http://gigaom.com/2012/08/11/malaria-is-no-excuse-for-patent-trolling-mr-myhrvold/">like Intellectual Ventures</a>, which buy up old patents and use them to file lawsuits against productive companies. The trolls are largely immune from retaliation because they operate through shell companies and don&#8217;t actually make any products that can be the subject of a counter-suit.</p>
<p>The Google source said the pledge may not be effective against trolls but that it may curtain the practice of &#8220;privateering&#8221; &#8212; where major companies give patents to trolls in order to harass rivals or in return for a cut of the proceeds the trolls obtain. This person said that, under the terms of the pledge, Google reserves the right to sue anyone who financially benefits from such lawsuits.</p>
<p>There is also the question of whether the Google pledge is legally enforceable. Typically, promises to the world at large don&#8217;t carry any legal force because they lack what lawyers call &#8220;consideration.&#8221; The Google source, however, said those who rely on the pledge could likely prevent Google from going back on the pledge through a doctrine called &#8220;promissory estoppel.&#8221;</p>
<p>In the bigger picture, the Google patent pledge represents part of a growing effort among Silicon Valley companies to rein in a patent system that many believe has become over-extended. Twitter, for instance, last year introduced an <a href="http://paidcontent.org/2012/04/17/twitter-promotes-patent-peace-with-innovators-agreement/">employment contract that promises its engineers </a>that their inventions won&#8217;t be used to fuel the patent wars.</p>
<p><em style="font-size:13px;line-height:19px;">(Image by <a id="portfolio_link" href="http://www.shutterstock.com/gallery-172762p1.html">alphaspirit</a> via Shutterstock)</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=625211&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=866367"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=866367" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=625211+google-donates-patents-to-protect-cloud-software-from-lawsuits&utm_content=jeffjohnroberts">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=625211+google-donates-patents-to-protect-cloud-software-from-lawsuits&utm_content=jeffjohnroberts">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=625211+google-donates-patents-to-protect-cloud-software-from-lawsuits&utm_content=jeffjohnroberts">Dissecting the data: 5 issues for our digital future</a></li><li><a href="http://pro.gigaom.com/2009/06/why-the-hoopla-about-hadoop/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=625211+google-donates-patents-to-protect-cloud-software-from-lawsuits&utm_content=jeffjohnroberts">Why the Hoopla About Hadoop?</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/03/28/google-donates-patents-to-protect-cloud-software-from-lawsuits/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/03/shutterstock_125562701.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/03/shutterstock_125562701.jpg?w=150" medium="image">
			<media:title type="html">Shield, umbrella, protect</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/05dfcf765f1554b08954bb9e1ee63363?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">jeffjohnroberts</media:title>
		</media:content>
	</item>
		<item>
		<title>The history of Hadoop: From 4 nodes to the future of data</title>
		<link>http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/</link>
		<comments>http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/#comments</comments>
		<pubDate>Mon, 04 Mar 2013 13:00:43 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[VertiCloud]]></category>
		<category><![CDATA[WibiData]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=613362</guid>
		<description><![CDATA[In the first of our four-part multi-media series on Hadoop, the people who helped build Hadoop talk about its birth, its promise and the challenges in moving it from webscale to just large-scale.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=613362&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Depending on how one defines its birth, <a href="http://hadoop.apache.org/">Hadoop</a> is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.</p>
<p>Alone, Hadoop is a software market that IDC <a href="http://gigaom.com/2012/05/07/all-aboard-the-hadoop-money-train/">predicts will be worth $813 million</a> in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm <a href="http://gigaom.com/2013/01/08/idc-says-big-data-will-be-24b-market-in-2016-i-say-its-bigger/">predicts will hit more than $23 billion</a> by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and <a href="http://gigaom.com/2012/11/09/a-few-stats-rumors-and-stories-on-on-hadoops-rapid-growth/">spurred hundreds of millions in venture capital investment</a> since 2008.</p>
<p>In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; <a href="http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/">a map of the now-large and complex ecosystem</a> of companies selling Hadoop products. <a href="http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/">Part III is a look into the future of Hadoop</a> that should serve as an opening salvo for much of the discussion <a href="http://event.gigaom.com/structuredata/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=613362+the-history-of-hadoop-from-4-nodes-to-the-future-of-data&amp;utm_content=dharrisstructure">at our Structure: Data conference</a> March 20-21 in New York. Finally, <a href="http://gigaom.com/2013/03/08/hadoop-through-the-years-a-gigaom-retrospective/">part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history</a>, as reported by GigaOM over the years.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="http://w.soundcloud.com/player?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F80972101%253Fsecret_token%253Ds-RbbVK"></iframe>
<h2 id="wanted-a-better-search-engine">Wanted: A better search engine</h2>
<p>Almost everywhere you go online now, Hadoop is there in some capacity. <a href="http://gigaom.com/2012/06/13/how-facebook-keeps-100-petabytes-of-hadoop-data-online/">Facebook</a>, <a href="http://gigaom.com/2012/01/31/under-the-covers-of-ebays-big-data-operation/">eBay</a>, <a href="http://gigaom.com/2011/11/02/how-etsy-handcrafted-a-big-data-strategy/">Etsy</a>, <a href="http://gigaom.com/2012/12/02/pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud/">Yelp</a> , <a href="http://gigaom.com/2012/03/07/how-twitter-is-doing-its-part-to-democratize-big-data/">Twitter</a>, <a href="http://gigaom.com/2012/09/17/5-ideas-to-help-everyone-make-the-most-of-big-data/">Salesforce.com</a> — you name a popular web site or service, and the chances are it’s using Hadoop to analyze the mountains of data it’s generating about user behavior and even its own operations. Even in the physical world, forward-thinking companies in fields ranging from <a href="http://gigaom.com/2012/09/16/how-disney-built-a-big-data-platform-on-a-startup-budget/">entertainment</a> to <a href="http://gigaom.com/2012/10/11/the-rent-is-too-damn-high-but-big-data-means-the-power-bill-isnt/">energy management</a> to <a href="http://gigaom.com/2012/04/17/satellite-imagery-and-hadoop-mean-70m-for-skybox/">satellite imagery</a> are using Hadoop to analyze the unique types of data they’re collecting and generating.</p>
<p>Everyone involved with information technology at least knows what it is. Hadoop even serves as the foundation for new-school <a href="http://incubator.apache.org/giraph/">graph</a> and <a href="http://hbase.apache.org/">NoSQL databases</a>, as well as <a href="http://gigaom.com/2012/07/24/how-one-startup-wants-to-inject-hadoop-into-your-sql/">bigger, badder versions of relational databases</a> that have been around for decades.</p>
<p>But it wasn’t always this way, and today’s uses are a long way off from the original vision of what Hadoop could be.</p>
<div id="attachment_616209" class="wp-caption alignleft" style="width: 210px"><img alt="Doug Cutting" src="http://gigaom2.files.wordpress.com/2013/03/cutting.jpg?w=708"   class="size-full wp-image-616209"><p class="wp-caption-text">Doug Cutting</p></div>
<p>When the seeds of Hadoop were first planted in 2002, the world just wanted a better open-source search engine. So then-Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build it. They called their project <a href="http://nutch.apache.org/">Nutch</a> and it was designed with that era’s web in mind.</p>
<p>Looking back on it today, early iterations of Nutch were kind of laughable. About a year into their work on it, Cutting and Cafarella thought things were going pretty well because Nutch was already able to crawl and index hundreds of millions of pages. “At the time, when we started, we were sort of thinking that a web search engine was around a billion pages,” Cutting explained to me, “so we were getting up there.”</p>
<p>There are now about 700 million web sites and, <a href="http://articles.cnn.com/2011-09-12/tech/web.index_1_internet-neurons-human-brain?_s=PM%3ATECH">according to Wired’s Kevin Kelly</a>, well over a trillion web pages.</p>
<p>But getting Nutch to work wasn’t easy. It could only run across a handful of machines, and someone had to watch it around the clock to make sure it didn’t fall down.</p>
<div id="attachment_616210" class="wp-caption alignright" style="width: 251px"><img alt="Mike Cafarella" src="http://gigaom2.files.wordpress.com/2013/03/cafarella241.jpg?w=708"   class="size-full wp-image-616210"><p class="wp-caption-text">Mike Cafarella</p></div>
<p>“I remember working on it for several months, being quite proud of what we had been doing, and then the Google File System paper came out and I realized ‘Oh, that’s a much better way of doing it. We should do it that way,’” reminisced Cafarella. “Then, by the time we had a first working version, the MapReduce paper came out and that seemed like a pretty good idea, too.”</p>
<p>Google released the <a href="http://research.google.com/archive/gfs.html">Google File System paper</a> in October 2003 and the <a href="http://research.google.com/archive/mapreduce.html">MapReduce paper</a> in December 2004. The latter would prove especially revelatory to the two engineers building Nutch.</p>
<p>“What they spent a lot of time doing was generalizing this into a framework that automated all these steps that we were doing manually,” Cutting explained.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="http://w.soundcloud.com/player?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F80972106%253Fsecret_token%253Ds-gmRg8"></iframe>
<p>Raymie Stata, founder and CEO of Hadoop startup <a href="http://verticloud.com/">VertiCloud</a> (and former Yahoo CTO), calls MapReduce “a fantastic kind of abstraction” over the distributed computing methods and algorithms most search companies were already using:</p>
<blockquote id="quote-everyone-had-somethi"><p>“Everyone had something that pretty much was like MapReduce because we were all solving the same problems. We were trying to handle literally billions of web pages on machines that are probably, if you go back and check, epsilon more powerful than today’s cell phones. … So there was no option but to latch hundreds to thousands of machines together to build the index. So it was out of desperation that MapReduce was invented.”</p></blockquote>
<div id="attachment_616201" class="wp-caption aligncenter" style="width: 718px"><img alt="MapReduce diagram, from the Google paper" src="http://gigaom2.files.wordpress.com/2013/03/index-auto-0008-0001.gif?w=708&#038;h=489" width="708" height="489" class="size-large wp-image-616201"><p class="wp-caption-text">Parallel processing in MapReduce, from the Google paper</p></div>
<p>Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java, notably, whereas Google’s MapReduce used C++) and ported Nutch on top of it. Now, instead of having one guy watch a handful of machines all day long, Cutting explained, they could just set it running on between 20 and 40 machines that he and Cafarella were able to scrape together from their employers.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="http://w.soundcloud.com/player?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F80972114%253Fsecret_token%253Ds-yCIvx"></iframe>
<h2 id="bringing-hadoop-to-life-but-no">Bringing Hadoop to life (but not in search)</h2>
<p>Anyone vaguely familiar with the history of Hadoop can guess what happens next: In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them. They spun out the storage and processing parts of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant) as an open-source Apache Software Foundation project and the Nutch web crawler remained its own separate project.</p>
<p>“This seem like a perfect fit because I was looking for more people to work on it, and people who had thousands of computers to run it on,” Cutting said.</p>
<p>Cafarella, now <a href="http://web.eecs.umich.edu/~michjc/bio.html">an associate professor at the University of Michigan</a>, opted to forgo a career in corporate IT and focus on his education. He’s happy as a professor — and currently working on a Hadoop-complementary project called <a href="http://cloudera.github.com/RecordBreaker/">RecordBreaker</a> — but, he joked, “My dad calls me the Pete Best of the big data world.”</p>
<p>Ironically, though, the 2006-era Hadoop was nowhere near ready to handle production search workloads at webscale — the very task it was created to do. “The thing you gotta remember,” explained Hortonworks Co-founder and CEO Eric Baldeschwieler (who was previously VP of Hadoop software development at Yahoo), “is at the time we started adopting it, the aspiration was definitely to rebuild Yahoo’s web search infrastructure, but Hadoop only really worked on 5 to 20 nodes at that point, and it wasn’t very performant, either.”</p>
<div id="attachment_616234" class="wp-caption aligncenter" style="width: 718px"><a href="http://www.flickr.com/photos/yodelanecdotal/4746014041/sizes/l/in/photostream/"><img alt="Baldeschwieler at Hadoop Summit 2010. Source: Yodel Anectdotal" src="http://gigaom2.files.wordpress.com/2013/03/4746014041_7a80b97c2e_b.jpg?w=708&#038;h=472" width="708" height="472" class="size-large wp-image-616234"></a><p class="wp-caption-text">Baldeschwieler at Hadoop Summit 2010. Source: Yodel Anectdotal</p></div>
<p>Stata recalls a “slow march” of horizontal scalability, growing Hadoop’s capabilities from the single digits of nodes into the tens of nodes and ultimately into the thousands. “It was just an ongoing slog … every factor of 2 or 1.5 even was serious engineering work,” he said. But Yahoo was determined to scale Hadoop as far as it needed to go, and it continued investing heavy resources into the project.</p>
<p>It actually took years for Yahoo to moves its web index onto Hadoop, but in the meantime the company made what would be a fortuitous decision to set up what it called a “research grid” for the company’s data scientists, to use today’s parlance. It started with dozens of nodes and ultimately grew to hundreds as they added more and more data and Hadoop’s technology matured. What began life as a proof of concept fast became a whole lot more.</p>
<p>“This very quickly kind of exploded and became our core mission,” Baldeschwieler said, “because what happened is the data scientists not only got interesting research results — what we had anticipated — but they also prototyped new applications and demonstrated that those applications could substantially improve Yahoo’s search relevance or Yahoo’s advertising revenue.”</p>
<p>Shortly thereafter, Yahoo began rolling out Hadoop to power analytics for various production applications. Eventually, Stata explained, Hadoop had proven so effective that Yahoo merged its search and advertising into one unit so that Yahoo’s bread-and-butter sponsored search business could benefit from the new technology.</p>
<div id="attachment_616207" class="wp-caption aligncenter" style="width: 718px"><a href="http://www.flickr.com/photos/joeywan/2467450286/"><img alt="Cutting (center) flanked by Baldeschwieler and Om Malik at GigaOM's Hadoop Meetup in 2008." src="http://gigaom2.files.wordpress.com/2013/03/2467450286_db547ef9ef_b.jpg?w=708&#038;h=365" width="708" height="365" class="size-large wp-image-616207"></a><p class="wp-caption-text">Cutting (center) flanked by Baldeschwieler and Om Malik at GigaOM’s Hadoop Meetup in 2008.</p></div>
<p>And <a href="http://gigaom.com/2010/06/29/yahoo-secures-and-tames-hadoop-with-new-tools/">that’s exactly what happened</a>, because although data scientists didn’t need things like service-level agreements, business leaders did. So, Stata said, Yahoo implemented some scheduling changes within Hadoop. And although data scientists didn’t need security, Securities and Exchange Commission requirements mandated a certain level of security when Yahoo moved its sponsored search data onto it.</p>
<p>“That drove a certain level of maturity,” Stata said. “… We ran all the money in Yahoo through it, eventually.”</p>
<p>The transformation into Hadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008, Baldeschwieler said. That meant doing everything from these line-of-business applications to spam filtering to personalized display decisions on the Yahoo front page. By the time Yahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="http://w.soundcloud.com/player?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F80972099%253Fsecret_token%253Ds-g7Wo5"></iframe>
<h2 id="from-the-classroom">From the classroom …</h2>
<p>However, although Yahoo was responsible for the vast majority of development during its formative years, Hadoop didn’t exist in a bubble inside Yahoo’s headquarters. It was a full-on Apache project that attracted users and contributors from around the world. Guys like Tom White, a Welshman who actually wrote O’Reilly Media’s book <i>Hadoop: The Definitive Guide</i> despite being what Cutting describes as a guy who just liked software and played with Hadoop at night.</p>
<p>Up in Seattle in 2006, a young Google engineer named Christophe Bisciglia was using his 20 percent time to teach a computer science course at the University of Washington. Google wanted to hire new employees with experience working on webscale data, but its MapReduce code was proprietary, so it bought a rack of servers and used Hadoop as a proxy.</p>
<p><a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/2/">Go to page 2 (of 2) on GigaOM .</a></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=613362&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=802753"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=802753" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=613362+the-history-of-hadoop-from-4-nodes-to-the-future-of-data&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=613362+the-history-of-hadoop-from-4-nodes-to-the-future-of-data&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/03/defining-hadoop-the-players-technologies-and-challenges-of-2011/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=613362+the-history-of-hadoop-from-4-nodes-to-the-future-of-data&utm_content=dharrisstructure">Defining Hadoop: the Players, Technologies and Challenges of 2011</a></li><li><a href="http://pro.gigaom.com/2012/11/real-%c2%adtime-query-for-hadoop-democratizes-access-to-big-data-analytics/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=613362+the-history-of-hadoop-from-4-nodes-to-the-future-of-data&utm_content=dharrisstructure">Real-­time query for Hadoop democratizes access to big data analytics</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/03/gigaom-hadoop-icon-final.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/03/gigaom-hadoop-icon-final.jpg?w=150" medium="image">
			<media:title type="html">gigaom hadoop icon final</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/03/cutting.jpg" medium="image">
			<media:title type="html">Doug Cutting</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/03/cafarella241.jpg" medium="image">
			<media:title type="html">Mike Cafarella</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/03/index-auto-0008-0001.gif?w=708" medium="image">
			<media:title type="html">MapReduce diagram, from the Google paper</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/03/4746014041_7a80b97c2e_b.jpg?w=708" medium="image">
			<media:title type="html">Baldeschwieler at Hadoop Summit 2010. Source: Yodel Anectdotal</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/03/2467450286_db547ef9ef_b.jpg?w=708" medium="image">
			<media:title type="html">Cutting (center) flanked by Baldeschwieler and Om Malik at GigaOM&#039;s Hadoop Meetup in 2008.</media:title>
		</media:content>
	</item>
		<item>
		<title>SQL is what&#8217;s next for Hadoop: Here&#8217;s who&#8217;s doing it</title>
		<link>http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/</link>
		<comments>http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/#comments</comments>
		<pubDate>Thu, 21 Feb 2013 18:29:31 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data warehouse]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=612289</guid>
		<description><![CDATA[More and more companies and open source projects are trying to let users run SQL queries from inside Hadoop itself. Here's a list of what's available and, on a high level, how they work.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=612289&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>When we first began putting together the schedule for <a href="http://event.gigaom.com/structuredata?utm_source=data&amp;utm_medium=editorial&amp;utm_content=dharrisstructure&amp;utm_campaign=intext&amp;utm_term=612289+sql-is-whats-next-for-hadoop-heres-whos-doing-it">Structure: Data</a> several months ago, we knew that running SQL queries on Hadoop would be a big deal — we just didn’t know how big a deal it would actually become. Fast-forward to today, a mere month away from the event (March 20-21 in New York), and the writing on the wall is a lot clearer. SQL support isn’t the end-game for Hadoop, but it’s the feature that will help Hadoop find its way into more places in more companies that understand the importance of next-generation analytics but don’t want to (or can’t yet) re-invent the wheel by becoming MapReduce experts.</p>
<p>In fact, there are now so many products and projects pushing SQL queries and interactive data analysis on Hadoop — including two more announced this week — that it’s getting hard to keep track. But I’ll do my best.</p>
<p>Of course, Facebook began this whole movement to bring SQL database-like functionality to Hadoop when it created Hive in 2009. Hive, <a href="http://hive.apache.org/">now an Apache project</a>, includes a data-management layer and SQL-like query language called HiveQL. It has proven rather useful and popular over the years, but Hive’s reliance on MapReduce makes it somewhat slow by nature — MapReduce scans the entire data set and moves a lot of data over the network while processing a job — and there hasn’t been much effort to package it in a manner that might attract mainstream users.</p>
<p>And keep in mind that this next generation of SQL-on-Hadoop tools aren’t just business intelligence or database products that can access data stored in Hadoop; EMC Greenplum, HP Vertica, IBM Netezza, ParAccel, Microsoft SQL Server and Teradata/Aster Data (which this week <a href="http://www.asterdata.com/news/teradata-aster-discovery-platform-offers-powerful-data-science-solution.php">released some cool new features</a> for just this purpose) all allow some sort of access to Hadoop data. Rather, these are applications, frameworks and engines that let users query Hadoop data from inside Hadoop, sometimes by re-architecting the underlying compute and data infrastructures. The beauty of this approach is that data is usable in its existing form and, in theory, doesn’t require two separate data stores for analytic applications.</p>
<h2 id="data-warehouses-and-bi-the-str">Data warehouses and BI: The Structure: Data set</h2>
<p><a href="http://structuredata2013-editgraphic.eventbrite.com/"><img alt="Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now." src="http://gigaom2.files.wordpress.com/2013/02/structure-data_in-article-banners_300x2001.png?w=708"   class="alignleft size-full wp-image-610577"></a>I’m highlighting this group of companies first, not because I think they’re the best (although that might well be), but because I’m truly excited about the panel they’ll be featured on at our conference next month. The panel is moderated by Facebook engineering manager Ravi Murthy– a guy who knows his way around a database — so they’ll have to answer some tough questions from one of the most-advanced and most-aggressive Hadoop and analytics tools users out there:</p>
<p><strong><a href="http://incubator.apache.org/drill/">Apache Drill</a>: </strong>Drill is a MapR-led effort to create a Google Dremel-like (or BigQuery-like) interactive query engine on top of Hadoop. First <a href="http://gigaom.com/2012/08/17/for-fast-interactive-hadoop-queries-drill-may-be-the-answer/">announced in August</a>, the project is still under development and in the incubator program within Apache. According to its web site, “One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”</p>
<p><strong><a href="http://hadapt.com">Hadapt</a>:</strong> Hadapt, which actually <a href="http://gigaom.com/2011/03/23/making-hadoop-work-in-more-places-with-hadapt/">launched at Structure: Data in 2011</a>, was the first of the SQL on Hadoop vendors and is somewhat unique in that it has a real product on the market and real users in production. Its unique architecture includes tools for advanced SQL functions and a split-execution engine for MapReduce and relational tasks, and both HDFS and relational storage. In October, the company <a href="http://gigaom.com/2012/10/16/hadapt-does-big-love-for-big-data-and-hints-at-hadoops-future/">announced a tight integration with Tableau Software</a> around advanced visual analytics.<strong> </strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/had_graphic2-scaled.jpg"><img alt="HAD_Graphic2-scaled" src="http://gigaom2.files.wordpress.com/2013/02/had_graphic2-scaled.jpg?w=708"   class="aligncenter size-full wp-image-612351"></a></p>
<p><strong><a href="http://gigaom2.files.wordpress.com/2013/02/platforaarch.jpg"><img alt="platforaarch" src="http://gigaom2.files.wordpress.com/2013/02/platforaarch.jpg?w=92&#038;h=150" width="92" height="150" class="alignright size-thumbnail wp-image-612755"></a><a href="http://platfora.com">Platfora</a>: </strong>Technically not a SQL product, Platfora is red-hot right now and is trying to re-imagine the world of business intelligence for a big data world. Essentially an HTML5 canvas laid atop Hadoop and an in-memory, massively parallel processing engine, the company’s software, which <a href="http://gigaom.com/2012/10/23/platfora-shows-a-whole-new-way-to-do-business-intelligence-on-big-data/">it unveiled in October</a>, is designed to make analyzing data stored in Hadoop a fast and visually intuitive process.</p>
<p><strong><a href="http://www.qubole.com">Qubole</a>:</strong> Qubole is an interesting case in that it’s essentially a cloud-based version of the popular <a href="http://hive.apache.org/">Apache Hive</a> framework <a href="http://gigaom.com/2012/06/06/exclusive-the-brains-behind-hive-launch-on-demand-hadoop-service/">launched by the guys who created Hive while working at Facebook</a>. Qubole claims it auto-scaling abilities, optimized Hadoop code and columnar data cache make its service run much faster than Hive alone — and running on Amazon Web Services makes it easier than maintaining a physical cluster.<strong> </strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/cache.jpg"><img alt="cache" src="http://gigaom2.files.wordpress.com/2013/02/cache.jpg?w=708&#038;h=456" width="708" height="456" class="aligncenter size-full wp-image-612765"></a></p>
<h2 id="data-warehouses-and-bi-the-res">Data warehouses and BI: The rest</h2>
<p><strong><a href="http://www.citusdata.com/">Citus Data</a>:</strong> Citus Data’s CitusDB isn’t just about Hadoop, but rather <a href="http://gigaom.com/2013/02/19/citusdb-today-sql-on-hadoop-tomorrow-the-world/">wants to bring the power of its distributed Postgres implementation to all types of data</a>. It relies on Postgres’s foreign data wrappers feature to convert disparate data types into the database’s native format, and then on its own distributed-processing technology to carry out queries in seconds or less. Because of its Postgres foundation, CitusDB can join data from different data sources and retains all the native features that come with that database.<strong> </strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/citus_hadoop_architecture1.png"><img alt="citus_hadoop_architecture" src="http://gigaom2.files.wordpress.com/2013/02/citus_hadoop_architecture1.png?w=708"   class="aligncenter size-full wp-image-612399"></a></p>
<p><strong><a href="http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/">Cloudera Impala</a>:  </strong>Cloudera’s Impala <a href="http://gigaom.com/2012/10/24/cloudera-makes-sql-a-first-class-citizen-in-hadoop/">might just be the most-important SQL-on-Hadoop effort</a> around because of Cloudera’s expansive installation and partner footprints. It’s a massively parallel processing engine that bypasses MapReduce to enable interactive queries on data stored in either HDFS or HBase, using the same variant of SQL that Hive uses. However, because Cloudera doesn’t build applications, it’s relying on higher-level BI and analytics partners to provide the user interface.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/impala.png"><img alt="impala" src="http://gigaom2.files.wordpress.com/2013/02/impala.png?w=708"   class="aligncenter size-full wp-image-612405"></a></p>
<p><strong><a href="http://karmasphere.com">Karmasphere</a>: </strong>Karmasphere is one of the first startups to build an analytic application atop Hadoop, and in <a href="http://gigaom.com/2012/06/11/is-2013-the-year-hadoop-uptake-turns-into-a-tornado/">its 2.0 release last year</a> the company added support for SQL queries of data in HDFS. Like Hive, Karmasphere still relies on MapReduce to process queries, which means it’s inherently slower than newer approaches. However, unlike Hive, Karmasphere allows for parallel queries to run at the same time and includes a visual interface for writing queries and filtering results.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/multiple-large.png"><img alt="multiple-large" src="http://gigaom2.files.wordpress.com/2013/02/multiple-large.png?w=708&#038;h=307" width="708" height="307" class="aligncenter size-large wp-image-612778"></a></p>
<p><strong><a href="http://www.cascading.org/lingual/">Lingual</a>:</strong> Lingual is a new open source project from Concurrent <em>(see disclosure)</em>, the parent company of the Cascading framework for Hadoop. <a href="http://www.marketwire.com/press-release/concurrent-enables-sql-users-build-big-data-applications-on-hadoop-less-than-30-seconds-1759041.htm">Announced on Wednesday</a>, Lingual runs on Cascading and gives developers and analysts a true ANSI SQL interface from which to run analytics or build applications. Lingual is compatible with traditional BI tools, JDBC  and the Cascading family of APIs.<strong> </strong></p>
<p><strong><a href="https://github.com/forcedotcom/phoenix">Phoenix</a>: </strong>Phoenix is a new and relatively unknown open source project that comes out of Salesforce.com and aims to allow fast SQL queries of data stored in HBase, the NoSQL database built atop HDFS. Its stated mission: “Become the standard means of accessing HBase data through a well-defined, industry standard API.” Users interact with it through JDBC interfaces, and its developers claim its sub-second response times for small queries and seconds-long response for querying tens of millions of rows.</p>
<div id="attachment_612413" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/02/squirrel-copy.jpg"><img alt="A sample of Phoenix via the SQuirreL client" src="http://gigaom2.files.wordpress.com/2013/02/squirrel-copy.jpg?w=708&#038;h=496" width="708" height="496" class="size-large wp-image-612413"></a><p class="wp-caption-text">A sample of Phoenix via the SQuirreL client</p></div>
<p><strong><a href="http://gigaom2.files.wordpress.com/2013/02/shark.jpg"><img alt="shark" src="http://gigaom2.files.wordpress.com/2013/02/shark.jpg?w=300&#038;h=219" width="300" height="219" class="alignright size-medium wp-image-612439"></a><a href="http://shark.cs.berkeley.edu/">Shark</a>:</strong> Shark isn’t technically Hadoop, but it’s cut from the same cloth. <em>Shark</em>, in this case, stands for “Hive on Spark,” with Hive meaning the same thing it does to Hadoop, but with Spark <a href="http://spark-project.org/">being an in-memory platform</a> designed to run parallel-processing jobs 100 times faster than MapReduce (a speed improve over traditional Hive that Shark also claims). Shark also includes APIs for turning query results into a type of data format amenable to machine learning algorithms. Both Shark and Spark are developed by the University of California, Berkeley’s <a href="https://amplab.cs.berkeley.edu/projects/">AMPLab</a>.<strong><br></strong></p>
<p><strong><a href="http://gigaom2.files.wordpress.com/2013/02/screen-shot-2013-02-19-at-5-37-01-pm-300x235.png"><img alt="Screen-Shot-2013-02-19-at-5.37.01-PM-300x235" src="http://gigaom2.files.wordpress.com/2013/02/screen-shot-2013-02-19-at-5-37-01-pm-300x235.png?w=708"   class="alignright size-full wp-image-612322"></a><a href="http://hortonworks.com/blog/100x-faster-hive/">Stinger Initiative</a>: </strong>Launched on Wednesday (along with <a href="http://hortonworks.com/blog/introducing-knox-hadoop-security/">a security gateway called Knox</a> and a <a href="http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/">faster, simpler processing framework called Tez</a>), the Stinger Initiative is a Hortonworks-led effort to make Hive faster — up too 100x — and more functional. Stinger adds more SQL analytics capabilities to Hive, but the most-important aspects are infrastructural: an optimized execution engine, a columnar file format and the ability to avoid MapReduce bottlenecks by running atop Tez.</p>
<h2 id="operational-sql">Operational SQL</h2>
<p><strong><a href="http://drawntoscale.com/">Drawn to Scale</a>:</strong> Drawn to Scale is a startup that has <a href="http://gigaom.com/2012/07/24/how-one-startup-wants-to-inject-hadoop-into-your-sql/">built an operational SQL database on top of HBase</a>. The key word here is database, as its product, called Spire, is modeled after Google’s F1 designed to power transactional applications as analytic ones. Spire has a fully distributed index and queries are sent only to the node with the relevant data, so reads and writes are fast and the system can handle lots of concurrent users without falling down.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/02/spirearchitecture-015.png"><img alt="SpireArchitecture.015" src="http://gigaom2.files.wordpress.com/2013/02/spirearchitecture-015-e1361407038325.png?w=708&#038;h=438" width="708" height="438" class="aligncenter size-large wp-image-612477"></a></p>
<p><strong><a href="http://gigaom2.files.wordpress.com/2013/02/splice.jpg"><img alt="splice" src="http://gigaom2.files.wordpress.com/2013/02/splice.jpg?w=300&#038;h=166" width="300" height="166" class="alignright size-medium wp-image-612669"></a><a href="http://www.splicemachine.com">Splice Machine</a>: </strong>Database startup Splice Machine is also trying to get into the operational space by building its Splice SQL Engine atop the naturally distributed HBase database. Splice Machine focuses its message on transactional integrity, which is really where it separates itself from scalable NoSQL databases and analytics-focused SQL-on-Hadoop efforts. It relies on HBase’s aut0-sharding feature in order to making scaling an easy process.</p>
<p><a href="http://structuredata2013-editgraphic.eventbrite.com"><img src="http://gigaom2.files.wordpress.com/2013/02/structure-data_in-article-banner_590x1101.png?w=708" alt="Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now."   class="aligncenter size-full wp-image-610578"></a></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-73008p1.html">Shutterstock user hauhu</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=612289&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=60468"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=60468" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=612289+sql-is-whats-next-for-hadoop-heres-whos-doing-it&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=612289+sql-is-whats-next-for-hadoop-heres-whos-doing-it&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=612289+sql-is-whats-next-for-hadoop-heres-whos-doing-it&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=612289+sql-is-whats-next-for-hadoop-heres-whos-doing-it&utm_content=dharrisstructure">A near-term outlook for big data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/02/shutterstock_37622056.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/02/shutterstock_37622056.jpg?w=150" medium="image">
			<media:title type="html">sql statement</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/structure-data_in-article-banners_300x2001.png" medium="image">
			<media:title type="html">Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/had_graphic2-scaled.jpg" medium="image">
			<media:title type="html">HAD_Graphic2-scaled</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/platforaarch.jpg?w=92" medium="image">
			<media:title type="html">platforaarch</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/cache.jpg" medium="image">
			<media:title type="html">cache</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/citus_hadoop_architecture1.png" medium="image">
			<media:title type="html">citus_hadoop_architecture</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/impala.png" medium="image">
			<media:title type="html">impala</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/multiple-large.png?w=708" medium="image">
			<media:title type="html">multiple-large</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/squirrel-copy.jpg?w=708" medium="image">
			<media:title type="html">A sample of Phoenix via the SQuirreL client</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/shark.jpg?w=300" medium="image">
			<media:title type="html">shark</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/screen-shot-2013-02-19-at-5-37-01-pm-300x235.png" medium="image">
			<media:title type="html">Screen-Shot-2013-02-19-at-5.37.01-PM-300x235</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/spirearchitecture-015-e1361407038325.png?w=708" medium="image">
			<media:title type="html">SpireArchitecture.015</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/splice.jpg?w=300" medium="image">
			<media:title type="html">splice</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/structure-data_in-article-banner_590x1101.png" medium="image">
			<media:title type="html">Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.</media:title>
		</media:content>
	</item>
		<item>
		<title>While we waste four cores, scientists use a million at a time</title>
		<link>http://gigaom.com/2013/01/28/while-we-waste-four-cores-scientists-use-a-million-at-a-time/</link>
		<comments>http://gigaom.com/2013/01/28/while-we-waste-four-cores-scientists-use-a-million-at-a-time/#comments</comments>
		<pubDate>Mon, 28 Jan 2013 18:46:47 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[Multi-core]]></category>
		<category><![CDATA[parallal processing]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[smartphones]]></category>
		<category><![CDATA[supercomputer]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=604968</guid>
		<description><![CDATA[A group of Stanford researchers recently ran a complex fluid dynamics workload across more than a million cores on the Sequoia supercomputer. It's an impressive feat and might foretell a future where parallel programming becomes commonplace even on our smartphones.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=604968&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Chances are, the quad-core processor powering your desktop computer or high-end laptop is vastly underworked. But it’s not your fault: <a href="http://thecodist.com/article/writing-multithreaded-code-is-like-juggling-chainsaws">Writing code that executes in parallel is difficult</a>, so most consumer applications (save for some compute-intensive video games that really need help, for example) continue to run on just one core at a time. Which makes it all the more impressive that a group of Stanford researchers <a href="http://engineering.stanford.edu/news/stanford-researchers-break-million-core-supercomputer-barrier">recently ran a jet-engine-noise simulation across 1 million cores simultaneously</a>.</p>
<p>As anyone even casually familiar with parallel processing knows, running applications across more nodes means jobs execute faster because they’re able to share the computing workload. The more cores, the faster it runs. This what makes Hadoop, for example, <a href="http://gigaom.com/2012/02/06/what-it-really-means-when-someone-says-hadoop/">so great at processing large chunks of data</a>. The MapReduce framework on which it’s based divvies up the work across nodes and everything they find is stitched back together as the result of a job.</p>
<p>But even Hadoop can only scale to tens of thousands of nodes and, because of its focus on “nodes,” actually <a href="http://gigaom.com/2011/06/28/hadoop-may-be-hot-but-it-needs-to-be-useful/">isn’t really good at utilizing multi-core processors</a> to their fullest (expect to hear more about the limitations of Hadoop at our <a href="http://event.gigaom.com/structuredata/?utm_source=cloud&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=604968+while-we-waste-four-cores-scientists-use-a-million-at-a-time&amp;utm_content=dharrisstructure">Structure: Data conference</a> March 20-21 in New York). The IBM-built Sequoia supercomputer (housed at Lawrence Livermore National Laboratory) that the Stanford team used consists of 98,304 processors (or nodes), each containing 16 computing cores. That’s a grand total of 1,572,864 cores, and the researchers were able to use the majority of them, which they claim is a record of some sort.</p>
<div id="attachment_605040" class="wp-caption aligncenter" style="width: 510px"><a href="http://gigaom2.files.wordpress.com/2013/01/seq_config.jpg"><img alt="Sequoia, decomposed" src="http://gigaom2.files.wordpress.com/2013/01/seq_config.jpg?w=708"   class="size-full wp-image-605040"></a><p class="wp-caption-text">Sequoia, decomposed</p></div>
<p>But record or not, that’s an incredibly complex undertaking. Programming the jet-engine simulation meant figuring out how to divvy the code into more than a million different tasks that could run across tens of thousands of nodes and 16 cores within each of those nodes. If even one of those processes is buggy, it could slow down or ruin the whole simulation.</p>
<p>Even in the world of supercomputing, where systems <a href="http://www.top500.org/list/2012/11/">now regularly contain hundreds of thousands of cores</a> — some of them special-purpose GPU co-processors — there’s a shortage of programming talent to actually use them all to their fullest potential. As my colleague Stacey Higginbotham <a href="http://gigaom.com/2011/09/02/supercomputings-problem-isnt-power-its-software/">explained in some time ago</a>, the world of high-performance computing is hurtling toward exascale computing but a bigger problem than energy-consumption might be finding applications that need that much computing power and the algorithms capable of operating at that scale.</p>
<p>Still, the implications of advances in parallel programming are huge — like potentially life-altering huge. This is true not only <a href="http://gigaom.com/2010/05/21/introducing-the-worlds-most-powerful-supercomputer-for-climate-research/">because of the scientific questions we’ll soon be able to answer</a> at speeds inconceivable even a decade ago, but also because of the computing power we’ll all soon be carrying around in our pockets and purses. If you think those multi-core smartphones and tablets are great now because they can run multiple applications at the same time, just wait until their processors are even bigger and badder <a href="http://www.wired.com/gadgetlab/2012/03/how-a-quad-core-chip-would-supercharge-ipad-performance/">and we have more applications</a> — photo- and video-editing, computer-aided design, games and who knows what else — that can actually get the most out of them.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=604968&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=522931"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=522931" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=604968+while-we-waste-four-cores-scientists-use-a-million-at-a-time&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=604968+while-we-waste-four-cores-scientists-use-a-million-at-a-time&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=604968+while-we-waste-four-cores-scientists-use-a-million-at-a-time&utm_content=dharrisstructure">Dissecting the data: 5 issues for our digital future</a></li><li><a href="http://pro.gigaom.com/2012/11/real-%c2%adtime-query-for-hadoop-democratizes-access-to-big-data-analytics/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=604968+while-we-waste-four-cores-scientists-use-a-million-at-a-time&utm_content=dharrisstructure">Real-­time query for Hadoop democratizes access to big data analytics</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/01/28/while-we-waste-four-cores-scientists-use-a-million-at-a-time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/01/dawn_complete_high-res.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/01/dawn_complete_high-res.jpg?w=150" medium="image">
			<media:title type="html">Sequoia</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/01/seq_config.jpg" medium="image">
			<media:title type="html">Sequoia, decomposed</media:title>
		</media:content>
	</item>
		<item>
		<title>Facebook open sources Corona &#8212; a better way to do webscale Hadoop</title>
		<link>http://gigaom.com/2012/11/08/facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop/</link>
		<comments>http://gigaom.com/2012/11/08/facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop/#comments</comments>
		<pubDate>Thu, 08 Nov 2012 20:01:15 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[Corona]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[Web Infrastructure]]></category>
		<category><![CDATA[webscale]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=582252</guid>
		<description><![CDATA[Facebook has open sourced a new system called Corona for scheduling and managing Hadoop jobs. Corona attempts to do away with many of the problems that come along with massive-scale Hadoop operations, and soon looks to take Facebook's Hadoop deployment beyond just MapReduce.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=582252&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Facebook is at it again, building more software to make Hadoop a better way to do big data at web scale. Its latest creation, which the company <a href="https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona">has also open sourced</a>, is called Corona and aims to make Hadoop more efficient, more scalable and more available by re-inventing how jobs are scheduled.</p>
<p>As with <a href="http://www.facebook.com/note.php?note_id=468211193919">most of its changes to Hadoop over the years</a> &#8212; including the <a href="http://gigaom.com/cloud/how-facebook-keeps-100-petabytes-of-hadoop-data-online/">recently unveiled AvatarNode</a> &#8212; Corona came to be because Hadoop simply wasn&#8217;t designed to handle Facebook&#8217;s scale or its broad usage of the platform. What kind of scale are we talking about? According to Facebook engineers Avery Ching, Ravi Murthy, Dmytro Molkov,‎ Ramkumar Vadali, and Paul Yang <a href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920">in a blog post detailing Corona on Thursday</a>, the company&#8217;s largest cluster is more than 100 petabytes; it runs more than 60,000 Hive queries a day; and its data warehouse has grown 2,500x in four years.</p>
<p>Further, Ching and company note &#8212; echoing something Facebook VP of Infrastructure Engineering Jay Parikh told me in September when <a href="http://gigaom.com/data/for-the-future-of-big-data-startups-look-to-facebook/">discussing the future of big data startups</a> &#8212; Hadoop is responsible for a lot of how Facebook runs both its platform and its business:</p>
<blockquote><p>Almost every team at Facebook depends on our custom-built data infrastructure for warehousing and analytics, with roughly 1,000 people across the company &#8212; including both technical and non-technical personnel &#8212; using these technologies every day. Over half a petabyte of new data arrives in the warehouse every 24 hours, and ad-hoc queries, data pipelines, and custom MapReduce jobs process this raw data around the clock to generate more meaningful features and aggregations.</p></blockquote>
<h2>So, what is Corona?</h2>
<p>In a nutshell, Corona represents a new system for scheduling Hadoop jobs that makes better use of a cluster&#8217;s resources and also makes it more amenable to multitenant environments like the one Facebook operates. Ching et al explain the problems and the solution in some detail, but the short explanation is that Hadoop&#8217;s JobTracker node is responsible for both cluster management and job-scheduling, but has a hard time keeping up with both tasks as clusters grow and the number of jobs sent to them increase.</p>
<p>Further, job-scheduling in Hadoop involves an inherent delay, which is problematic for small jobs that need fast results. And a fixed configuration of &#8220;map&#8221; and &#8220;reduce&#8221; slots means Hadoop clusters run inefficiently when jobs don&#8217;t fit into the remaining slots or when they&#8217;re not MapReduce jobs at all.</p>
<p>Corona resolves some of these problems by creating individual job trackers for each job and a cluster manager focused solely on tracking nodes and the amount of available resources. Thanks to this simplified architecture and a few other changes, the latency to get a job started is reduced and the cluster manager can make fast scheduling decisions because it&#8217;s not also responsible for tracking the progress of running jobs. Corona also incorporates a feature that divvies a cluster into resource pools to ensure every group within the company gets its fair share of resources.</p>
<p><a href="http://gigaom2.files.wordpress.com/2012/11/corona.jpg"><img  title="corona" alt="" src="http://gigaom2.files.wordpress.com/2012/11/corona.jpg?w=708"   class="aligncenter size-full wp-image-582351" /></a></p>
<p>The results have lived up to expectations since Corona went into full production in mid-2012: the average time to refill idle resources improved by 17 percent; resource utilization over regular MapReduce improved to 95 percent from 70 percent (in a simulation cluster); resource unfairness dropped to 3.6 percent with Corona versus 14.3 percent with traditional MapReduce; and latency on a test job Facebook runs every four minutes has been</p>
<p>Despite the hard work put into building and deploying Corona, though, the project still was a way to go. One of the biggest improvements currently being developed is to enable resource management based on CPU, memory and other job requirements rather than just the number of &#8220;map&#8221; and &#8220;reduce&#8221; slots needed. This will open Corona up to running non-MapReduce jobs, therefore making a Hadoop cluster more of a general-purpose parallel computing cluster.</p>
<p>Facebook is also trying to incorporate online upgrades, which would mean a cluster doesn&#8217;t have to come down every time part of the management layer undergoes an update.</p>
<h2>Why Facebook sometimes must re-invent the wheel</h2>
<p>Anyone deeply familiar with the Hadoop space might be thinking that a lot of what Facebook has done with Corona sounds familiar &#8212; and that&#8217;s because it kind of is. The <a href="http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/">Apache YARN project</a> that has been integrated into the latest version of Apache Hadoop similarly splits the JobTracker into separate cluster-management and job-tracking components, and already allows for non-MapReduce workloads. Further, there <a href="http://gigaom.com/cloud/the-unsexy-side-of-big-data-6-tools-to-manage-your-hadoop-cluster/">is a whole class of commercial and open source cluster-management tools</a> that have their own solutions to the problems Corona tries to solve, including <a href="http://incubator.apache.org/mesos/index.html">Apache Mesos</a>, which is <a href="http://gigaom.com/cloud/twitter-backs-fave-big-data-projects-with-apache-sponsorship/">Twitter&#8217;s tool of choice</a>.<br />
However, anyone who&#8217;s familiar with Facebook knows the company isn&#8217;t likely to buy software from anyone. It also has reached a point of customization with its Hadoop environment where even open-source projects from Apache won&#8217;t be easy to adapt to Facebook&#8217;s unique architecture. From the blog post:</p>
<blockquote><p>It’s worth noting that we considered Apache YARN as a possible alternative to Corona. However, after investigating the use of YARN on top of our version of HDFS (a strong requirement due to our many petabytes of archived data) we found numerous incompatibilities that would be time-prohibitive and risky to fix. Also, it is unknown when YARN would be ready to work at Facebook-scale workloads.</p></blockquote>
<p>So, Facebook plods forward, a Hadoop user without equal (save for maybe Yahoo) left building its own tools in isolation. What will be interesting to watch as Hadoop adoption picks up and more companies beging building applications atop it is how many actually utilize the types of tools that companies like Facebook, <a href="http://gigaom.com/cloud/how-twitter-is-doing-its-part-to-democratize-big-data/">Twitter</a> and <a href="http://gigaom.com/data/quantcast-releases-bigger-faster-stronger-hadoop-file-system/">Quantcast</a> have created and open sourced. They might not have commercial backers behind them, but they&#8217;re certainly built to work well at scale.</p>
<p><em>Feature image courtesy of Shutterstock user <a href="http://www.shutterstock.com/gallery-10991p1.html">Johan Swanepoel</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=582252&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=893866"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=893866" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=582252+facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=582252+facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=582252+facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop&utm_content=dharrisstructure">Dissecting the data: 5 issues for our digital future</a></li><li><a href="http://pro.gigaom.com/2011/03/defining-hadoop-the-players-technologies-and-challenges-of-2011/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=582252+facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop&utm_content=dharrisstructure">Defining Hadoop: the Players, Technologies and Challenges of 2011</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/11/08/facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_42996799.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_42996799.jpg?w=150" medium="image">
			<media:title type="html">herd of elephants</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/corona.jpg" medium="image">
			<media:title type="html">corona</media:title>
		</media:content>
	</item>
		<item>
		<title>Real-­time query for Hadoop democratizes access to big data analytics</title>
		<link>http://pro.gigaom.com/2012/11/real-%c2%adtime-query-for-hadoop-democratizes-access-to-big-data-analytics/</link>
		<comments>http://pro.gigaom.com/2012/11/real-%c2%adtime-query-for-hadoop-democratizes-access-to-big-data-analytics/#comments</comments>
		<pubDate>Wed, 07 Nov 2012 07:55:09 +0000</pubDate>
		<dc:creator>George Gilbert</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[batch-processing]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Cloudera Impala]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[data-analytics]]></category>
		<category><![CDATA[database management systems]]></category>
		<category><![CDATA[emc-greenplum]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hewlett-Packard]]></category>
		<category><![CDATA[HP]]></category>
		<category><![CDATA[Mapr]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[real-time queries]]></category>
		<category><![CDATA[splunk]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://pro.gigaom.com/?p=157731</guid>
		<description><![CDATA[The delivery of real-­time query makes Hadoop accessible to more users — and by orders of magnitude. Its significance goes well beyond delivering a database management system (DBMS) kind of query engine that other products have had for decades. Rather, Hadoop as a platform now supports a whole new  paradigm of analytics. With the introduction of real-­time query, Hadoop has taken a major step toward unifying the majority of big data analytic applications onto one platform. This research paper targets information technology professionals who have in-­depth experience with traditional RDBMS and seek to understand where the Hadoop ecosystem and big data analytics fit.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=581587&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=581587&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=541676"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=541676" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=581587+real-%25c2%25adtime-query-for-hadoop-democratizes-access-to-big-data-analytics&utm_content=techstrategypartners">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=581587+real-%25c2%25adtime-query-for-hadoop-democratizes-access-to-big-data-analytics&utm_content=techstrategypartners">Infrastructure Q1: Cloud and big data woo enterprises</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=581587+real-%25c2%25adtime-query-for-hadoop-democratizes-access-to-big-data-analytics&utm_content=techstrategypartners">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/11/unlocking-big-datas-potential-with-search/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=581587+real-%25c2%25adtime-query-for-hadoop-democratizes-access-to-big-data-analytics&utm_content=techstrategypartners">How search can unlock the power of big data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://pro.gigaom.com/2012/11/real-%c2%adtime-query-for-hadoop-democratizes-access-to-big-data-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://pro.gigaom.com/files/2010/07/bronze-elephant.jpg?w=150" />
		<media:content url="http://pro.gigaom.com/files/2010/07/bronze-elephant.jpg?w=150" medium="image">
			<media:title type="html">bronze elephant</media:title>
		</media:content>

		<media:content url="http://1.gravatar.com/avatar/1ad3ba18460ee12eb7567dd78f23756f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">George Gilbert</media:title>
		</media:content>
	</item>
		<item>
		<title>5 trends that are changing how we do big data</title>
		<link>http://gigaom.com/2012/11/03/5-trends-that-are-changing-how-we-do-big-data/</link>
		<comments>http://gigaom.com/2012/11/03/5-trends-that-are-changing-how-we-do-big-data/#comments</comments>
		<pubDate>Sat, 03 Nov 2012 21:00:22 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Impala]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=578401</guid>
		<description><![CDATA[In just a few years, big data has turned from a buzzword and concept best left for large web companies into a force that drives much of our digital lives. Here are five technological trends that will change how data is processed and consumed going forward.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=578401&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It&#8217;s time to rethink the who, what, where, why and how of big data. After a surge of important news in the past couple weeks, we&#8217;re approaching a period of relative calm and can finally assess how the space has evolved in the past year. Here are five trends shaping up that should change almost everything about big data in the near future, including how it&#8217;s done, who&#8217;s doing it and where it&#8217;s consumed. Feel free to share the trends you&#8217;re seeing in the comments.</p>
<h2>The democratization of data science</h2>
<p>The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it&#8217;s new education platforms (e.g., Coursera and Udacity) <a href="http://gigaom.com/data/why-becoming-a-data-scientist-might-be-easier-than-you-think/">teaching students fundamental skills</a> in everything from basic statistics to natural language processing and machine learning. Elsewhere, it&#8217;s products <a href="http://gigaom.com/data/how-0xdata-wants-to-help-everyone-become-data-scientists/">such as 0xdata</a> that aim to simplify and add scale to well-known statistical-analysis tools such as R, or, <a href="http://gigaom.com/2012/03/21/quid-structure-data-2012/">like Quid</a> that try to mask the finer points of concepts such as machine learning and artificial intelligence behind well-designed user interfaces and slick visual representations. Platforms such as Kaggle have opened the door to <a href="http://gigaom.com/cloud/kaggle-is-now-crowdsourcing-data-science-creativity/">crowdsourcing answers to tough predictive-modeling problems</a>.</p>
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='604' height='370' src='http://www.youtube.com/embed/e0WKJLovaZg?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span>
<p>Whatever the avenue, though, the end result is that individuals who have a little imagination, some basic computer science skills and a lot of business acumen can now do more with their data. A few steps down the ladder, companies such as <a href="http://gigaom.com/cloud/data-hero-aims-to-turn-us-all-into-analytics-stars/">Datahero</a>, <a href="http://gigaom.com/europe/infogram-wants-to-help-you-make-beautiful-infographics/">Infogram</a> and <a href="https://www.statwing.com/">Statwing</a> are trying to make analytics accessible even to laypersons. Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.</p>
<h2>Hadoop&#8217;s MapReduce reduction</h2>
<p>Hadoop&#8217;s days as a platform solely for performing MapReduce jobs are officially over, and the change couldn&#8217;t have come fast enough. The evolution began with Apache Hadoop version 2.o and its <a href="http://hortonworks.com/blog/introducing-apache-hadoop-yarn/">new YARN functionality</a> that allows for new processing frameworks, but solidified with the spate of projects and products &#8212; <a href="http://gigaom.com/data/cloudera-makes-sql-a-first-class-citizen-in-hadoop/">including Cloudera&#8217;s very popular commercial distribution</a> &#8212; that <a href="http://gigaom.com/data/batten-down-the-analysts-its-a-big-data-bi-storm/">now include a SQL query engine</a> or <a href="http://gigaom.com/data/platfora-shows-a-whole-new-way-to-do-business-intelligence-on-big-data/">other method for interactive analysis</a> running alongside MapReduce. That was a big item to check off the list of capabilities Hadoop must support, as <a href="http://gigaom.com/cloud/microsofts-hadoop-play-is-shaping-up-and-it-includes-excel/">data analysts need access to Hadoop data</a> in a manner they understand.</p>
<div id="attachment_576120" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2012/10/segmentation.jpg"><img  title="Segmentation" alt="" src="http://gigaom2.files.wordpress.com/2012/10/segmentation.jpg?w=604&#038;h=431" height="431" width="604" class="size-large wp-image-576120" /></a><p class="wp-caption-text">Doing Hadoop-powered BI with Platfora</p></div>
<p>From this point on &#8212; <a href="http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/">like with the Google MapReduce framework</a> on which Hadoop&#8217;s version of MapReduce was modeled &#8212; it seems likely we&#8217;ll see the latter grow less important. Presumably, the Hadoop community will focus more on using the platform&#8217;s distributed nature to support real-time processing and other new capabilities that make Hadoop a better fit in next-generation data applications. <a href="http://gigaom.com/cloud/the-state-of-hadoop-strong-and-poised-to-explode/">If Hadoop can&#8217;t fill the void</a>, there are plenty of people working on other technologies &#8212; <a href="http://gigaom.com/cloud/nodeable-gives-hadoop-a-real-time-boost-with-streamreduce/">Storm</a> and <a href="http://gigaom.com/data/metamarkets-open-sources-druid-its-in-memory-database/">Druid</a>, for example &#8212; that will gladly do so.</p>
<p>The HBase NoSQL database that&#8217;s built atop the Hadoop Distributed File System is a good example of what&#8217;s possible when Hadoop is freed from the MapReduce constraints. Large web companies such as <a href="http://gigaom.com/cloud/how-facebook-is-powering-real-time-analytics/">Facebook</a> and <a href="http://gigaom.com/cloud/under-the-covers-of-ebays-big-data-operation/">eBay</a> already use HBase to power transactional applications, and startups such as <a href="http://gigaom.com/cloud/how-one-startup-wants-to-inject-hadoop-into-your-sql/">Drawn to Scale</a> and <a href="http://gigaom.com/data/batten-down-the-analysts-its-a-big-data-bi-storm/">Splice Machine</a> have used HBase as the foundation for transactional SQL databases. More new products and projects, such as <a href="http://incubator.apache.org/giraph/">graph database Giraph</a>, will look for ways to leverage HDFS because it gives them a file system that&#8217;s scalable, free, relatively mature and, perhaps most importantly, tied into the ever-growing Hadoop ecosystem.</p>
<h2>Coming soon to an app near you</h2>
<p>Of course, all of this technological improvement is <a href="http://gigaom.com/2012/03/21/cloudera-structure-data-2012/">nothing without applications to take advantage of it</a>, so it&#8217;s good news that we&#8217;re seeing a wide range of approaches for making this happen. One of these approaches is making big data accessible to developers, which is where startups such as <a href="http://gigaom.com/data/ex-yahoo-facebook-big-data-vets-launch-paas-for-hadoop/">Continuuity</a>, <a href="http://gigaom.com/cloud/infochimps-makes-its-big-data-for-developers-platform-real-time/">Infochimps</a> and even <a href="http://gigaom.com/data/startup-precog-says-big-data-doesnt-need-to-be-so-complex/">Precog</a> (a big data BI engine, by nature) come into play. They make it relatively easy for developers to create applications that tie at least some functions into a big data backend, sometimes via a process as simple as writing a script or generating a piece of code that programmers can insert directly into their application&#8217;s code.</p>
<p>Another approach that&#8217;s picking up steam is simply to find a use case for big data &#8211;<a href="http://www.wibidata.com/">analyzing user behavior</a>, <a href="http://gigaom.com/cloud/hadoop-kills-zombies-too-is-there-anything-it-cant-solve/">network security</a>, <a href="http://gigaom.com/cloud/if-you-want-to-build-the-next-siri-ai-one-wants-to-help/">artificial intelligence</a>, <a href="http://gigaom.com/cloud/can-i-help-you-how-liveperson-decides-whos-worth-the-personal-touch/">customer service</a> &#8212; and turn it into a product or service that companies can buy and start using out of the box. These are things that early adopters such as Google, Facebook and others have had to build themselves but that others likely won&#8217;t have to. And everywhere you look, big data and data science are already <a href="http://gigaom.com/data/for-the-future-of-big-data-startups-look-to-facebook/">being rolled into many web and mobile applications</a>, from <a href="http://gigaom.com/cloud/is-this-data-scientist-a-consumers-best-friend/">deciding which products to buy</a> to <a href="http://gigaom.com/cloud/how-ancestry-com-is-using-big-data-to-map-time-place-and-people/">figuring out your long lost relatives</a>. Somewhere, somehow, everyone surfing the web or using a mobile app is benefiting from big data.</p>
<h2>Machine learning is everywhere</h2>
<p>Machine learning has had something of a coming-out party in the past year and is now so prevalent it might be easy to mistake it for something that&#8217;s <em>not </em>difficult to do well. It&#8217;s easy to see why machine learning is so popular, though: In an age where consumers (and advertisers) want more personalization, and where computer systems are overwhelmed with data flying at them from all different directions, the prospect of <a href="http://gigaom.com/2012/03/21/machine-learning-structure-data-2012/">writing models that continuously discover patterns</a> among potentially countless data points has to be appealing.</p>
<p>Here&#8217;s a small sample of apps you&#8217;ve likely heard of, or that we&#8217;ve covered, that rely machine learning to work their magic: <a href="http://gigaom.com/2012/10/02/prismatics-bradford-cross-first-we-understand-media-then-the-world/">Prismatic</a>, <a href="http://gigaom.com/2012/10/31/summly-wants-to-make-news-summaries-cool-ok/">Summly</a>, <a href="http://gigaom.com/data/how-trifacta-wants-to-teach-humans-and-data-to-work-together/">Trifacta</a>, <a href="http://gigaom.com/cloud/more-proof-that-big-data-security-are-soulmates/">CloudFlare</a>, <a href="http://gigaom.com/data/mit-researcher-says-he-can-predict-twitter-trends/">Twitter</a>, <a href="http://gigaom.com/2012/06/25/how-google-is-teaching-computers-to-see/">Google</a>, <a href="http://www.serversidemagazine.com/news/10-questions-with-facebook-research-engineer-andrei-alexandrescu/">Facebook</a>, <a href="http://gigaom.com/cleantech/a-khosla-backed-big-data-energy-startup-you-should-know-about/">Bidgely</a>, <a href="http://gigaom.com/2012/10/15/healthrageous-nabs-6-5m-for-personalized-data-driven-health-support/">Healthrageous</a>, <a href="http://gigaom.com/data/machine-learning-and-health-care-mean-6m-for-predilytics/">Predilytics</a>, <a href="http://gigaom.com/cloud/bloomreach-wants-to-save-your-site-with-big-data/">BloomReach</a>, <a href="http://gigaom.com/cloud/your-data-has-a-secret-but-you-yes-you-can-make-it-talk/">DataPop</a>, <a href="http://gigaom.com/data/it-pays-to-know-you-interest-graph-master-gravity-gets-10-6m/">Gravity</a>. I could go on for days, I think.</p>
<div id="attachment_580511" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2012/11/prismatic-copy.jpg"><img  title="prismatic copy" alt="" src="http://gigaom2.files.wordpress.com/2012/11/prismatic-copy-e1351967058399.jpg?w=604&#038;h=333" height="333" width="604" class="size-large wp-image-580511" /></a><p class="wp-caption-text">Prismatic learning my interests</p></div>
<p>Now, it&#8217;s difficult to imagine a new tech company launching that doesn&#8217;t at least consider using machine learning models to make its product or service more intelligent. Heck, even Microsoft <a href="http://bits.blogs.nytimes.com/2012/11/01/microsofts-plan-to-sell-answers/">appears to be making a big bet on machine learning</a> as the foundation of a new revenue stream. The technology to store and process lots of data is out there, and the brainpower looks to be coming along as well. Soon, there will be few excuses for building applications that don&#8217;t learn as they go, for example, what users want to see, how systems fail or <a href="http://gigaom.com/cloud/big-data-making-even-call-centers-intelligent/">when customers are about to cancel a service</a>.</p>
<h2>Mobile data as the engine for AI</h2>
<p>Long before Skynet takes over and the machines turns on humans, our mobile phones will know better than us what we want to do. That&#8217;s because until <a href="http://gigaom.com/2012/10/28/google-glass-app-platform/">technologies like Google&#8217;s Project Glass</a> actually make their way into the wild, our phones and the apps on them are <a href="http://gigaom.com/data/why-were-all-big-data-now/">probably the richest source of personal data around</a>. And thanks to machine learning, <a href="http://gigaom.com/data/google-explains-how-more-data-means-better-speech-recognition/">speech recognition</a> and other technologies, they&#8217;re able to make a lot of sense of what they&#8217;re given.</p>
<p><a href="http://gigaom2.files.wordpress.com/2012/11/saga_now-copy.jpg"><img  title="saga_now-copy" alt="" src="http://gigaom2.files.wordpress.com/2012/11/saga_now-copy.jpg?w=708"   class="alignleft size-full wp-image-580512" /></a>They know where we go, who our friends are, what&#8217;s on our calendars and what we look at online. Thanks to a new generation of applications such as <a href="http://gigaom.com/apple/my-siri-wish-list-5-things-i-want-to-see/">Siri</a>, <a href="http://gigaom.com/cloud/siri-is-so-aloof-saga-wants-to-get-to-know-you/">Saga</a> and <a href="http://gigaom.com/2012/06/27/with-google-now-google-search-is-getting-ready-for-project-glass/">Google Now</a> trying to serve as personal assistants, our phones can understand what we say, know the businesses we frequent and the foods we eat, and the hours we&#8217;re at home, at work or out on the town. Already, their developers claim such apps can augment our limited vantage point by automatically telling us the best directions to our upcoming appointment, or the best place to get our favorite foods in a city the app knows we haven&#8217;t been to before.</p>
<p>The race is officially on to see who can build the smartest app, pull in the most data sources and figure out <a href="http://gigaom.com/2012/11/03/how-data-helped-me-love-the-small-screen/">how to best display it all on a 4-inch screen</a>.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-65904p1.html">Shutterstock user Sebastian Kaulitzki</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=578401&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=144289"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=144289" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=578401+5-trends-that-are-changing-how-we-do-big-data&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=578401+5-trends-that-are-changing-how-we-do-big-data&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=578401+5-trends-that-are-changing-how-we-do-big-data&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2011/11/connected-world-the-consumer-technology-revolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=578401+5-trends-that-are-changing-how-we-do-big-data&utm_content=dharrisstructure">Connected world: the consumer technology revolution</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/11/03/5-trends-that-are-changing-how-we-do-big-data/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_3914395.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_3914395.jpg?w=150" medium="image">
			<media:title type="html">Brain on chip</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/10/segmentation.jpg?w=604" medium="image">
			<media:title type="html">Segmentation</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/prismatic-copy-e1351967058399.jpg?w=604" medium="image">
			<media:title type="html">prismatic copy</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/saga_now-copy.jpg" medium="image">
			<media:title type="html">saga_now-copy</media:title>
		</media:content>
	</item>
		<item>
		<title>Scaling Hadoop clusters: the role of cluster management</title>
		<link>http://pro.gigaom.com/2012/07/scaling-hadoop-clusters-the-role-of-cluster-management/</link>
		<comments>http://pro.gigaom.com/2012/07/scaling-hadoop-clusters-the-role-of-cluster-management/#comments</comments>
		<pubDate>Mon, 23 Jul 2012 07:01:22 +0000</pubDate>
		<dc:creator>Paul Miller</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alcatel Lucent]]></category>
		<category><![CDATA[Ambari]]></category>
		<category><![CDATA[Apache Ambari]]></category>
		<category><![CDATA[Apache Software Foundation]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Chef]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[cluster management]]></category>
		<category><![CDATA[clusters]]></category>
		<category><![CDATA[Crowbar]]></category>
		<category><![CDATA[CSC]]></category>
		<category><![CDATA[Dell]]></category>
		<category><![CDATA[Dell Crowbar]]></category>
		<category><![CDATA[e-commerce]]></category>
		<category><![CDATA[emc-greenplum]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Ganglia]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hadoop Common]]></category>
		<category><![CDATA[Hadoop Distributed File System]]></category>
		<category><![CDATA[Hbase]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[Hortonworks Data Platform]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[lava]]></category>
		<category><![CDATA[loudera]]></category>
		<category><![CDATA[Mapr]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[OpenStack]]></category>
		<category><![CDATA[platform]]></category>
		<category><![CDATA[Platform LSF]]></category>
		<category><![CDATA[platform-computing]]></category>
		<category><![CDATA[procter-gamble]]></category>
		<category><![CDATA[Puppet]]></category>
		<category><![CDATA[Puppet Labs]]></category>
		<category><![CDATA[Rocks]]></category>
		<category><![CDATA[social networking]]></category>
		<category><![CDATA[social networks]]></category>
		<category><![CDATA[stackiq]]></category>
		<category><![CDATA[StackIQ Enterprise Data]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://pro.gigaom.com/?p=117860</guid>
		<description><![CDATA[Organizations are coping with the challenge of processing unprecedented volumes of data. However, the processes involved with using a large cluster to run applications like Hadoop are error-prone. So IT managers are turning to cluster-management solutions to automate tasks associated with cluster creation, management and maintenance. <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=545285&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=545285&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=808980"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=808980" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=545285+scaling-hadoop-clusters-the-role-of-cluster-management&utm_content=cloudofdata">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2011/04/infrastructure-q1-iaas-comes-down-to-earth-big-data-takes-flight/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=545285+scaling-hadoop-clusters-the-role-of-cluster-management&utm_content=cloudofdata">Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=545285+scaling-hadoop-clusters-the-role-of-cluster-management&utm_content=cloudofdata">Infrastructure Q1: Cloud and big data woo enterprises</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=545285+scaling-hadoop-clusters-the-role-of-cluster-management&utm_content=cloudofdata">A near-term outlook for big data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://pro.gigaom.com/2012/07/scaling-hadoop-clusters-the-role-of-cluster-management/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="https://gigaom-pro-files.s3.amazonaws.com/files/2012/07/rockclimbing1.jpg?w=150" />
		<media:content url="https://gigaom-pro-files.s3.amazonaws.com/files/2012/07/rockclimbing1.jpg?w=150" medium="image">
			<media:title type="html">rockclimbing1</media:title>
		</media:content>

		<media:content url="http://1.gravatar.com/avatar/7c1b4afa924d36a76027fe2be0543eeb?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">cloudofdata</media:title>
		</media:content>
	</item>
		<item>
		<title>Because Hadoop isn&#8217;t perfect: 8 ways to replace HDFS</title>
		<link>http://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/</link>
		<comments>http://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/#comments</comments>
		<pubDate>Wed, 11 Jul 2012 21:50:13 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[appistry]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Ceph]]></category>
		<category><![CDATA[CleverSafe]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[file systems]]></category>
		<category><![CDATA[GPFS]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[Isilon]]></category>
		<category><![CDATA[Lustre]]></category>
		<category><![CDATA[Mapr]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[NetApp]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=541225</guid>
		<description><![CDATA[Hadoop is on its way to becomig the de facto platform for the next-generation of data-based applications, but it's not without some flaws. Ironically, one of Hadoop's biggest shortcomings right now is also one of its biggest strengths going forward -- the Hadoop Distributed File System.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=541225&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://gigaom2.files.wordpress.com/2012/07/achilles_heel.jpg"><img  title="achilles heel" src="http://gigaom2.files.wordpress.com/2012/07/shutterstock_16533076.jpg?w=300&#038;h=200" alt="" width="300" height="200" class="alignleft size-medium wp-image-541764" /></a>Hadoop is <a href="http://gigaom.com/cloud/the-state-of-hadoop-strong-and-poised-to-explode/">on its way to becoming the de facto platform</a> for the next-generation of data-based applications, but it&#8217;s not without flaws. Ironically, one of Hadoop&#8217;s biggest shortcomings now is also one of its biggest strengths going forward &#8212; the Hadoop Distributed File System.</p>
<p>Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it&#8217;s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility <a href="http://gigaom.com/cloud/drawn-to-scale-raises-money-to-make-sql-big-data-ready/">as storage system even for non-MapReduce applications</a>.</p>
<p>But if the growing number of options for replacing HDFS signifies anything, it&#8217;s that HDFS isn&#8217;t quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren&#8217;t keen of its direct-attached storage (DAS) architecture. Concerns around availability might be especially valid for anyone (read &#8220;almost everyone&#8221;) who&#8217;s using an older version of Hadoop without the <a href="http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/">High Availability NameNode</a>. Here are eight products and projects whose proprietors argue can deliver what HDFS can&#8217;t:</p>
<p><strong>Cassandra (DataStax)<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/datastax_marketecture_a1-copy.jpg"><img  title="datastax_marketecture_A1 copy" src="http://gigaom2.files.wordpress.com/2012/07/datastax_marketecture_a1-copy.jpg?w=300&#038;h=263" alt="" width="300" height="263" class="alignright size-medium wp-image-541752" /></a>Not a file system at all but an open source, NoSQL key-value store, Cassandra has become a viable alternative to HDFS for web applications that rely on fast data access. <a href="http://www.datastax.com">DataStax</a>, a startup commercializing the Cassandra database, has <a href="http://gigaom.com/cloud/datastax-gets-11m-fuses-nosql-and-hadoop/">fused Hadoop atop Cassandra</a> to provide web applications fast access to data processed by Hadoop, and Hadoop fast access to data streaming into Cassandra from web users.</p>
<p><strong>Ceph<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/stack-copy.jpg"><img  title="stack copy" src="http://gigaom2.files.wordpress.com/2012/07/stack-copy.jpg?w=300&#038;h=279" alt="" width="300" height="279" class="alignright size-medium wp-image-541758" /></a>Ceph is an open source, multi-pronged storage system that was recently <a href="http://gigaom.com/cloud/inktank-launches-to-change-the-face-of-open-source-storage/"> commercialized by a startup called Inktank</a>. Among its features is a high-performance parallel file system that <a href="http://www.itworld.com/big-datahadoop/262612/ceph-extends-storage-open-scalability">some think makes it a candidate for replacing HDFS</a> (and then some) in Hadoop environments. Indeed, some researchers started <a href="www.soe.ucsc.edu/~carlosm/Papers/eestolan-nsdi10-abstract.pdf">looking at this possibility as far back as 2010</a>.</p>
<p><strong>Dispersed Storage Network (Cleversafe)<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/object-based-access-methods.gif"><img  title="object-based-access-methods" src="http://gigaom2.files.wordpress.com/2012/07/object-based-access-methods.gif?w=300&#038;h=208" alt="" width="300" height="208" class="alignright size-medium wp-image-541757" /></a>Cleversafe <a href="http://www.cleversafe.com/press-releases/cleversafe-first-to-deliver-breakthrough-capabilities-for-combined-storage-and-massive-computation">got into the HDFS-replacement business on Monday</a>, announcing a product that will fuse Hadoop MapReduce with the company&#8217;s Dispersed Storage Network system. By fully distributing metadata across the cluster (instead of relying on a single NameNode) and not relying on replication, Cleversafe says it&#8217;s much faster, more reliable and scalable than HDFS.</p>
<p><strong>GPFS (IBM)<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/gpfs.jpg"><img  title="gpfs" src="http://gigaom2.files.wordpress.com/2012/07/gpfs.jpg?w=300&#038;h=135" alt="" width="300" height="135" class="alignright size-medium wp-image-541756" /></a>IBM has been selling its General Parallel File System to high-performance computing customers for years (including within some of the world&#8217;s fastest supercomputers), and in 2010 it <a href="http://database-diary.com/2011/11/30/comparing-hdfs-and-gpfs-for-hadoop/">tuned GPFS for Hadoop</a>. IBM claims the GPFS-SNC (Shared Nothing Cluster) edition is so much faster than Hadoop in part because it runs at the kernel level as opposed to atop the OS like HDFS.</p>
<p><strong>Isilon (EMC)<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/isilon-hadoop.jpg"><img  title="isilon hadoop" src="http://gigaom2.files.wordpress.com/2012/07/isilon-hadoop.jpg?w=300&#038;h=199" alt="" width="300" height="199" class="alignright size-medium wp-image-541753" /></a>EMC has offered its own Hadoop distributions for more than a year, but in January 2012 it unveiled a new method for making HDFS enterprise-class &#8212; <a href="http://gigaom.com/cloud/emc-delivers-on-isilon-hadoop-bundle/">replace it with EMC Isilon&#8217;s OneFS file system</a>. Technically, as EMC&#8217;s Chuck Hollis <a href="http://chucksblog.emc.com/chucks_blog/2012/01/hdfs-coming-to-an-array-near-you.html">explained at the time</a>, because Isilon can read NFS, CIFS and HDFS protocols, a single Isilon NAS system can serve to intake, process and analyze data.</p>
<p><strong>Lustre</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/lustre.jpg"><img  title="lustre" src="http://gigaom2.files.wordpress.com/2012/07/lustre.jpg?w=300&#038;h=205" alt="" width="300" height="205" class="alignright size-medium wp-image-541761" /></a><a href="http://wiki.lustre.org/index.php/Main_Page">Lustre</a> is a an open source high-performance file system that some claim can make for an HDFS alternative where performance is a major concern. Truth be told, I haven&#8217;t heard of this combination running anywhere in the wild, but HPC storage provider Xyratex <a href="http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_MapReduce_1-4.pdf">wrote a paper on the combination in 2011</a>, claiming a Lustre-based cluster (even with InfiniBand) will be faster and cheaper than an HDFS-based cluster.</p>
<p><strong>MapR File System<br />
</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/compsol-diag3-1.jpg"><img  title="compsol-diag3-1" src="http://gigaom2.files.wordpress.com/2012/07/compsol-diag3-1.jpg?w=300&#038;h=266" alt="" width="300" height="266" class="alignright size-medium wp-image-541754" /></a>The MapR File System is probably the best-known HDFS alternative, as it&#8217;s the basis of MapR&#8217;s increasingly popular &#8212; <a href="http://gigaom.com/cloud/investors-make-20m-bet-on-mapr-to-win-hadoop-war/">and well-funded</a> &#8212; Hadoop distribution. Not only does MapR claim its file system is two to five times faster than HDFS on average (although, <a href="http://www.mapr.com/products/only-with-mapr/scalable">really, up to 20 times faster</a>), but it has features such as mirroring, snapshots and high availability that enterprise customers love.</p>
<p><strong>NetApp Open Solution for Hadoop</strong></p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/netapp.jpg"><img  title="netapp" src="http://gigaom2.files.wordpress.com/2012/07/netapp.jpg?w=300&#038;h=279" alt="" width="300" height="279" class="alignright size-medium wp-image-541755" /></a>OK, the <a href="http://www.netapp.com/us/solutions/infrastructure/hadoop.html">NetApp Open Solution for Hadoop</a> isn&#8217;t so much an HDFS replacement as it is an HDFS <em>improvement</em>, <a href="http://gigaom.com/cloud/netapp-does-network-attached-hadoop/">according to NetApp and early partner Cloudera</a>. The offering still relies on HDFS, but it reenvisions the physical Hadoop architecture by putting HDFS on a RAID array. This, NetApp claims, means faster, more reliable and more secure Hadoop jobs.</p>
<p>This might be a good place to say rest in peace to two other HDFS alternatives that are effectively no longer with us &#8212; <a href="http://code.google.com/p/kosmosfs/">KosmosFS</a> (aka CloudStore) and <a href="http://gigaom.com/2010/03/15/appistry-joins-cloudscale-storage-fray-and-brings-hadoop-with-it/">Appistry CloudIQ Storage</a>. The former was created by Kosmix (<a href="http://gigaom.com/2011/09/14/what-media-companies-can-learn-from-walmart/">since bought by @WalmartLabs</a>) and released to the open source world in 2007, but no longer has an active community. The latter was an attempt by Appistry in 2010 to get a piece of the Hadoop pie with its computational storage technology, but the company has since switched its focus from selling the technology to <a href="http://gigaom.com/2012/03/22/appistry-structure-data-2012/">providing high-performance computing services based on it</a>.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-177808p1.html">Shutterstock user Panos Karapanagiotis</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=541225&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=642104"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=642104" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=541225+because-hadoop-isnt-perfect-8-ways-to-replace-hdfs&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=541225+because-hadoop-isnt-perfect-8-ways-to-replace-hdfs&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=541225+because-hadoop-isnt-perfect-8-ways-to-replace-hdfs&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li><li><a href="http://pro.gigaom.com/2012/01/how-amazons-dynamodb-is-rattling-the-big-data-and-cloud-markets/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=541225+because-hadoop-isnt-perfect-8-ways-to-replace-hdfs&utm_content=dharrisstructure">Amazon’s DynamoDB: rattling the cloud market</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/07/shutterstock_16533076.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/07/shutterstock_16533076.jpg?w=150" medium="image">
			<media:title type="html">achilles heel</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/shutterstock_16533076.jpg?w=300" medium="image">
			<media:title type="html">achilles heel</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/datastax_marketecture_a1-copy.jpg?w=300" medium="image">
			<media:title type="html">datastax_marketecture_A1 copy</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/stack-copy.jpg?w=300" medium="image">
			<media:title type="html">stack copy</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/object-based-access-methods.gif?w=300" medium="image">
			<media:title type="html">object-based-access-methods</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/gpfs.jpg?w=300" medium="image">
			<media:title type="html">gpfs</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/isilon-hadoop.jpg?w=300" medium="image">
			<media:title type="html">isilon hadoop</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/lustre.jpg?w=300" medium="image">
			<media:title type="html">lustre</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/compsol-diag3-1.jpg?w=300" medium="image">
			<media:title type="html">compsol-diag3-1</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/netapp.jpg?w=300" medium="image">
			<media:title type="html">netapp</media:title>
		</media:content>
	</item>
		<item>
		<title>Why the days are numbered for Hadoop as we know it</title>
		<link>http://gigaom.com/2012/07/07/why-the-days-are-numbered-for-hadoop-as-we-know-it/</link>
		<comments>http://gigaom.com/2012/07/07/why-the-days-are-numbered-for-hadoop-as-we-know-it/#comments</comments>
		<pubDate>Sat, 07 Jul 2012 17:30:54 +0000</pubDate>
		<dc:creator>Mike Miller, Cloudant</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[BigQuery]]></category>
		<category><![CDATA[Dremel]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[graph databases]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hbase]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Pregel]]></category>
		<category><![CDATA[real-time processing]]></category>
		<category><![CDATA[Storm]]></category>
		<category><![CDATA[Web Infrastructure]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=540391</guid>
		<description><![CDATA[For better or worse, Hadoop has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. But is the enterprise buying into a technology whose best day has already passed?<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=540391&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://gigaom2.files.wordpress.com/2012/07/elephant-walking-away.jpg"><img  title="elephant walking away" src="http://gigaom2.files.wordpress.com/2012/07/elephant-walking-away-e1341677481803.jpg?w=300&#038;h=200" alt="" width="300" height="200" class="alignright size-medium wp-image-540408" /></a>Hadoop is everywhere. For better or worse, it has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. Want to be big bata or enterprise analytics or BI-compliant?  You better play well with Hadoop.</p>
<p>It&#8217;s therefore far from controversial to say that Hadoop is firmly planted in the enterprise as the big data standard and will likely remain firmly entrenched for at least another decade. But, <a href="http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/">building on some previous discussion</a>, I’m going to go out on a limb and ask, “Is the enterprise buying into a technology whose best day has already passed?”</p>
<h2>First, there were Google File System and Google MapReduce</h2>
<p>To study this question we need to return to Hadoop’s inspiration – Google’s MapReduce. Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawat architected (and published!) two seminal systems: the <a href="http://research.google.com/archive/gfs.html">Google File System</a> (GFS) and <a href="http://research.google.com/archive/mapreduce.html">Google MapReduce</a> (GMR). The former was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware. The latter was an equally brilliant <em>implementation </em>of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.</p>
<p>GMR’s brilliance was to make big data processing approachable to Google’s typical user/developer and to make it fast and fault tolerant. Simply put, it boiled data processing at scale down to the bare essentials and took care of everything else. GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at google.com. This was clearly a major advantage for Google.</p>
<p>Enter reverse engineering in the open source world, and, voila, <a href="http://hadoop.apache.org">Apache Hadoop</a> &#8212; comprised of the Hadoop Distributed File System and Hadoop MapReduce &#8212; was born in the image of GFS and GMR. Yes, Hadoop is developing into an ecosystem of projects that touch nearly all parts of data management and processing. But, at its core, it is a MapReduce system. Your code is turned into map and reduce <em>jobs</em>, and Hadoop runs those <em>jobs</em> for you.</p>
<h2>Then Google evolved. Can Hadoop catch up?</h2>
<p>Most interesting to me, however, is that GMR no longer holds such prominence in the Google stack. Just as the enterprise is locking into MapReduce, Google seems to be moving past it. In fact, many of the technologies I’m going to discuss below aren’t even new; they date back the second half of the last decade, mere years after the seminal GMR paper was in print.</p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/wheel.jpg"><img  title="wheel" src="http://gigaom2.files.wordpress.com/2012/07/wheel.jpg?w=708" alt=""   class="aligncenter size-full wp-image-540411" /></a></p>
<p>Here are technologies that I hope will ultimately seed the post-Hadoop era. While many Apache projects and commercial Hadoop distributions are actively trying to address some of the issues below via technologies and features such as <a href="http://hbase.apache.org/">HBase</a>, <a href="http://hive.apache.org/">Hive</a> and <a href="http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html">Next-Generation MapReduce (aka YARN)</a>, it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology. (A more technical exposition with published benchmarks is available at <a href="http://www.slideshare.net/mlmilleratmit/gluecon-miller-horizonhttp://">http://www.slideshare.net/mlmilleratmit/gluecon-miller-horizon</a>.)</p>
<p><strong>Percolator for incremental indexing and analysis of frequently changing datasets</strong>. Hadoop is a big machine. Once you get it up to speed it’s great at crunching your data. Get the disks spinning forward as fast as you can. However, each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset. If your dataset is always growing, this means your analysis time also grows without bound.</p>
<p>So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called <a href="[5] http://research.google.com/pubs/pub36726.html"><strong>Percolator</strong></a>. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!</p>
<p>Coming from the Large Hadron Collider (an ever-growing big data corpus), this topic is near and dear to my heart. Some datasets simply never stop growing. It is why we baked a similar approach deep into the Cloudant data layer service, it is why trigger-based processing is now available in HBase, and it is a primary reason that <a href="http://gigaom.com/cloud/twitter-to-open-source-hadoop-like-tool/">Twitter Storm is gaining momentum</a> for real-time processing of stream data.</p>
<p><strong>Dremel for ad hoc analytics</strong>. Google and the Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From <a href="http://research.google.com/archive/sawzall.html">Sawzall</a> through <a href="http://pig.apache.org/">Pig</a> and Hive, many interface layers have been built. Yet, for all of the SQL-like familiarity, they ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (<em>jobs</em>). It is baked from the core for workflows, not ad hoc exploration.</p>
<p>In stark contrast, many BI/analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Not only is writing map and reduce workflows prohibitive for many analysts, but waiting minutes for jobs to start and hours for workflows to complete is not conducive to the interactive experience. Therefore, Google invented <a href="http://research.google.com/pubs/pub36632.html"><strong>Dremel</strong></a> (now <a href="http://gigaom.com/cloud/google-opens-up-its-biq-query-data-analytics-service-to-all/">exposed as the BigQuery product</a>) as a purpose-built tool to allow analysts to scan over petabytes of data in seconds to answer ad hoc queries and, presumably, power compelling visualizations.</p>
<div id="attachment_540412" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2012/07/big_banner.jpg"><img  title="big_banner" src="http://gigaom2.files.wordpress.com/2012/07/big_banner.jpg?w=604&#038;h=230" alt="" width="604" height="230" class="size-large wp-image-540412" /></a><p class="wp-caption-text">Google BigQuery</p></div>
<p>Google&#8217;s Dremel paper says it is “capable of running aggregation queries over trillions of rows in seconds,” and the same paper notes that running identical queries in standard MapReduce is approximately 100 times slower than in Dremel. Most impressive, however, is real world data from production systems at Google, where the vast majority of Dremel queries complete in less than 10 seconds, a time well below the typical latencies of even beginning execution of a MapReduce workflow and its associated jobs.</p>
<p>Interestingly, I’m not aware of any compelling open source alternatives to Dremel at the time of this writing and consider this a fantastic BI/analytics opportunity.</p>
<p><strong>Pregel for analyzing graph data</strong>. Google MapReduce was purpose-built for crawling and analyzing the world’s largest graph data structure – the internet. However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other graph data structures. For example, calculation of the single-source shortest path (SSSP) through a graph requires copying the graph forward to future MapReduce passes, an amazingly inefficient approach and simply untenable at scale.</p>
<p><a href="http://gigaom2.files.wordpress.com/2012/07/bigdata_goldenorb-graph-1.jpeg"><img  title="bigdata_goldenorb-graph (1)" src="http://gigaom2.files.wordpress.com/2012/07/bigdata_goldenorb-graph-1.jpeg?w=300&#038;h=156" alt="" width="300" height="156" class="alignleft size-medium wp-image-540413" /></a>Therefore, Google built <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html"><strong>Pregel</strong></a>, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines. The results are impressive. In contrast to Hadoop, which often causes exponential data amplification in graph processing, Pregel is able to naturally and efficiently execute graph algorithms such as SSSP or PageRank in dramatically shorter time and with significantly less complicated code. Most stunning is the published data demonstrating processing on billions of nodes with trillions of edges in mere minutes, with a near linear scaling of execution time with graph size.</p>
<p>At the time of writing, the only viable option in the open source world is <a href="http://giraph.apache.org/">Giraph</a>, an early Apache incubator project that leverages HDFS and Zookeeper. There&#8217;s another project called <a href="http://goldenorbos.org/">Golden Orb</a> available on GitHub.</p>
<p>In summary, Hadoop is an incredible tool for large-scale data processing on clusters of commodity hardware. But if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm. Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data. I would be shocked if they don’t have a similar impact on IT as Google’s original big three of GFS, GMR, and BigTable have had.</p>
<p><em>Mike Miller (<a href="https://twitter.com/mlmilleratmit">@mlmilleratmit</a>) is chief scientist and co-founder at Cloudant, and Affiliate Professor of Particle Physics at University of Washington.</em></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-375532p1.html">Shutterstock user Jason Prince</a>; evolution of the wheel image courtesy of <a href="http://www.shutterstock.com/gallery-66151p1.html">Shutterstock user James Steidl</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=540391&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=531894"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=531894" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=540391+why-the-days-are-numbered-for-hadoop-as-we-know-it&utm_content=gigaguest">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=540391+why-the-days-are-numbered-for-hadoop-as-we-know-it&utm_content=gigaguest">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=540391+why-the-days-are-numbered-for-hadoop-as-we-know-it&utm_content=gigaguest">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2011/03/defining-hadoop-the-players-technologies-and-challenges-of-2011/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=540391+why-the-days-are-numbered-for-hadoop-as-we-know-it&utm_content=gigaguest">Defining Hadoop: the Players, Technologies and Challenges of 2011</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/07/07/why-the-days-are-numbered-for-hadoop-as-we-know-it/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/07/elephant-walking-away-e1341677481803.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/07/elephant-walking-away-e1341677481803.jpg?w=150" medium="image">
			<media:title type="html">elephant walking away</media:title>
		</media:content>

		<media:content url="http://1.gravatar.com/avatar/4411542bbd7a2a9a2fc2a1b38809e45c?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigaguest</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/elephant-walking-away-e1341677481803.jpg?w=300" medium="image">
			<media:title type="html">elephant walking away</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/wheel.jpg" medium="image">
			<media:title type="html">wheel</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/big_banner.jpg?w=604" medium="image">
			<media:title type="html">big_banner</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/07/bigdata_goldenorb-graph-1.jpeg?w=300" medium="image">
			<media:title type="html">bigdata_goldenorb-graph (1)</media:title>
		</media:content>
	</item>
	</channel>
</rss>
