<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GigaOM &#187; elastic-mapreduce</title>
	<atom:link href="http://gigaom.com/tag/elastic-mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://gigaom.com</link>
	<description></description>
	<lastBuildDate>Wed, 22 May 2013 12:14:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gigaom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/0db8f6557d022075dbbf010c54d46d93?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GigaOM &#187; elastic-mapreduce</title>
		<link>http://gigaom.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gigaom.com/osd.xml" title="GigaOM" />
	<atom:link rel='hub' href='http://gigaom.com/?pushpress=hub'/>
		<item>
		<title>Netflix shows off how it does Hadoop in the cloud</title>
		<link>http://gigaom.com/2013/01/10/netflix-shows-off-its-hadoop-architecture/</link>
		<comments>http://gigaom.com/2013/01/10/netflix-shows-off-its-hadoop-architecture/#comments</comments>
		<pubDate>Fri, 11 Jan 2013 05:13:24 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Netflix]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=600969</guid>
		<description><![CDATA[Netflix is at it again, this time showing off its homemade architecture for running Hadoop workloads in the Amazon Web Services cloud. It's all about the flexibility of being able to run, manage and access multiple clusters while eliminating as many barriers as possible.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=600969&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Netflix is the undeniable king of computing in the cloud &#8212; running almost entirely on the Amazon Web Services platform &#8212; and its reign expands into big data workloads, too. In a Thursday evening blog post, the company <a href="http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html">shared the details of its AWS-based Hadoop architecture</a> and a homemade Hadoop Plaform as a Service that it calls Genie.</p>
<p>That Netflix is a heavy Hadoop user is hardly news, though. In June, I <a href="http://gigaom.com/2012/06/14/netflix-analyzes-a-lot-of-data-about-your-viewing-habits/">explained just how much data Netflix collects</a> about users and some the methods it uses to analyze that data. Hadoop is the storage and processing engine for much of this work.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/01/hadoop-nflx.jpg"><img  alt="hadoop nflx" src="http://gigaom2.files.wordpress.com/2013/01/hadoop-nflx.jpg?w=300&#038;h=259" width="300" height="259" class="alignleft size-medium wp-image-601005" /></a>As blog post author Sriram Krishnan points out, however, Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod cluster of Elastic MapReduce instances, there&#8217;s another equally sized cluster for extract-transform-load (ETL) workloads &#8212; essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various &#8220;development&#8221; clusters as needed, presumably for ad hoc experimental jobs.</p>
<p>And while Netflix&#8217;s data-analysis efforts are pretty interesting, the cloud makes its Hadoop architecture pretty interesting, too. For starters, Krishnan explains how using S3 as the storage layer instead of the Hadoop Distributed File System means, among other things, that Netflix can run all of its clusters separately while sharing the same data set. It does, however, use HDFS at some points in the computation process to make up for the inherently slower method of accessing data via S3.</p>
<p>Netflix also built its own PaaS-like layer for Amazon Elastic MapReduce, which it called Genie. This lets engineers submit jobs via a REST API and without having to know the specifics of the underlying infrastructure. This is important because it means Hadoop users can submit jobs to whatever clusters happen to be available at any given time (Krishnan goes into some detail about the resource-management aspects of Genie) and without worrying about the sometimes-transient nature of cloud resources.</p>
<p>We&#8217;ve long been pushing the intersection of big data and cloud computing, although the reality is that there aren&#8217;t really a lot of commercial options that mix user-friendliness and heavy-duty Hadoop workload management. There&#8217;ll no doubt be more offerings in the future &#8212; <a href="http://gigaom.com/2012/08/07/infochimps-makes-its-big-data-for-developers-platform-real-time/">Infochimps</a> and <a href="http://gigaom.com/2012/10/23/ex-yahoo-facebook-big-data-vets-launch-paas-for-hadoop/">Continuuity</a> are certainly working in this direction, and <a href="http://gigaom.com/2012/11/30/why-amazon-thinks-big-data-was-made-for-the-cloud/">Amazon is also pushing its big data offerings forward</a> &#8212; but, for now, leave it to Netflix to build its own. (And if you&#8217;re interested in custom-built Hadoop tools, <a href="http://gigaom.com/2012/11/08/facebook-open-sources-corona-a-better-way-to-do-webscale-hadoop/">check out our recent coverage</a> of Facebook&#8217;s latest effort.)</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=600969&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=197032"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=197032" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=600969+netflix-shows-off-its-hadoop-architecture&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2010/12/9-companies-that-pushed-the-infrastructure-discussion-in-2010/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=600969+netflix-shows-off-its-hadoop-architecture&utm_content=dharrisstructure">9 Companies that Pushed the Infrastructure Discussion in 2010</a></li><li><a href="http://pro.gigaom.com/2012/06/cloud-computing-infrastructure-2012-and-beyond/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=600969+netflix-shows-off-its-hadoop-architecture&utm_content=dharrisstructure">Cloud computing infrastructure: 2012 and beyond</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=600969+netflix-shows-off-its-hadoop-architecture&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/01/10/netflix-shows-off-its-hadoop-architecture/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/01/hadoop-nflx1-e1357879885117.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/01/hadoop-nflx1-e1357879885117.jpg?w=150" medium="image">
			<media:title type="html">hadoop nflx</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/01/hadoop-nflx.jpg?w=300" medium="image">
			<media:title type="html">hadoop nflx</media:title>
		</media:content>
	</item>
		<item>
		<title>Pinterest, Flipboard and Yelp tell how to save big bucks in the cloud</title>
		<link>http://gigaom.com/2012/12/02/pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud/</link>
		<comments>http://gigaom.com/2012/12/02/pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud/#comments</comments>
		<pubDate>Sun, 02 Dec 2012 21:30:39 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[AWS re: Invent]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Flipboard]]></category>
		<category><![CDATA[Greg Scallan]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[iaas]]></category>
		<category><![CDATA[Pinterest]]></category>
		<category><![CDATA[Web Infrastructure]]></category>
		<category><![CDATA[webscale]]></category>
		<category><![CDATA[yelp]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=590008</guid>
		<description><![CDATA[At the AWS Re: Invent conference, engineers from Pinterest, Flipboard and Yelp detailed some of the strategies their companies employ in order to keep costs low as computing demand increases. The keys are keeping an eagle eye on usage and using the right types of resources.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=590008&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Amazon Web Services can be a great platform for startups when they&#8217;re small, but costs can outpace revenue growth pretty quick &#8212; especially if you&#8217;re offering a a free consumer service. At AWS&#8217;s Re: Invent user conference last week, engineers from Pinterest, Flipboard and Yelp shared their impressive and sometimes ingenious techniques for keeping costs under control and their bottom lines healthy.</p>
<p>Pinterest Operations Engineer Ryan Park had the stage to himself for a session on Wednesday, while Flipboard Chief Architect Greg Scallan and Yelp Engineering Manager Jim Blomo teamed up with Kleiner Perkins Caufield Byers Partner Ray Bradford to form a trifecta of wisdom on Thursday.</p>
<h2>Know &#8212; and measure &#8212; your costs</h2>
<p>Flipboard&#8217;s Scallan had a paradoxical lesson for the audience when it comes to managing cloud-based infrastructure: Embrace the cloud, but be afraid of the cloud. Yes, it&#8217;s flexible and affordable if done right, but all it takes is poor planning or a handful of servers left running ad infinitum, and the costs can begin to grow out of control. That&#8217;s why Flipboard assigns members of its engineering team the title of &#8220;chief miser,&#8221; which means they&#8217;re the ones who decide that applications are using the right resources and using them wisely.</p>
<p>Thanks to a variety of practices, including its miserly ways, Scallan said Flipboard is now running about 900 instances at any given time. That&#8217;s down from a peak of about 1,500.</p>
<div id="attachment_590210" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2012/12/20121129_1528212.jpg"><img  alt="Some stats on Flipboard's AWS usage" src="http://gigaom2.files.wordpress.com/2012/12/20121129_1528212.jpg?w=604&#038;h=453" height="453" width="604" class="size-large wp-image-590210" /></a><p class="wp-caption-text">Some stats on Flipboard&#8217;s AWS usage</p></div>
<p>One way to help ensure this sort lean operation is to understand your business inputs and outputs, Kleiner Perkins&#8217;s Bradford explained. He suggests companies ask, for example, what it costs them to serve a free user on their platform and how does that change with scale or affect the experience they can offer premium users. Pick metrics that really matter, he said (e.g., infrastructure cost per user per month) and then consider how long your current  architecture can sustain that cost before it&#8217;s time to retool.</p>
<h2>The secret weapon: Source your instances wisely</h2>
<p>Pinterest, Yelp and Flipboard all swear by <a href="http://aws.amazon.com/ec2/reserved-instances/">AWS&#8217;s pre-paid Reserved Instances</a> in order to save money over the long haul. In fact, Flipboard&#8217;s Scallan said, the e-reading startup sees cost savings of about 80 percent over three years by using heavy-duty Reserved Instances instead of on-demand instances for its base workloads, and the break-even point might be only eight or nine months. Pinterest&#8217;s Park cited savings of about 70 percent over three years using them.</p>
<div id="attachment_590209" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2012/12/20121129_154538.jpg"><img  alt="20121129_154538" src="http://gigaom2.files.wordpress.com/2012/12/20121129_154538.jpg?w=300&#038;h=225" height="225" width="300" class="size-medium wp-image-590209" /></a><p class="wp-caption-text">The trick is queuing another job to take up the waste.</p></div>
<p>Yelp&#8217;s Blomo said his company is a heavy Elastic MapReduce (EMR) user, peaking at more than 350 Elastic MapReduce instances when many developers run their Hadoop jobs simultaneously or when it&#8217;s doing nightly analysis of its log files. In order to keep costs in check, Yelp uses Reserved Instances whenever possible to save on hourly bills and has implemented a job-flow pooling system to keep Hadoop jobs running continuously as resources become available. This helps avoid the situation where a job completes in 61 minutes, for example, thus triggering the charge for a full hour of resources even though it only used a minute worth of the second hour.</p>
<p>In order to best gauge when it should use what type instance, Yelp <a href="http://engineeringblog.yelp.com/2012/07/introducing-emrio-optimize-your-aws-bills.html">created a tool called EMRio</a> that analyzes past usage to determine what resources are the most-efficient choice for any given job.</p>
<div id="attachment_590216" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2012/12/emrio.jpg"><img  alt="emrio" src="http://gigaom2.files.wordpress.com/2012/12/emrio.jpg?w=604&#038;h=455" height="455" width="604" class="size-large wp-image-590216" /></a><p class="wp-caption-text">The results of EMRio</p></div>
<p>When it comes to optimizing costs on AWS, though, Pinterest appears to have it all figured out &#8212; even how to make use of the somewhat tricky <a href="http://aws.amazon.com/ec2/spot-instances/">Spot Instances</a> that are priced based on demand and can be terminated without notice if the market price outgrows a user&#8217;s bid. Park explained how Pinterest uses the heck out of Reserved Instances and created its own auto-scaling &#8220;watchdog&#8221; service that decides whether to use Spot Instances or on-demand instances when more resources are required.</p>
<div id="attachment_590236" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2012/12/20121128_105509.jpg"><img  alt="Ryan Park dropping knowledge -- and graphs" src="http://gigaom2.files.wordpress.com/2012/12/20121128_105509.jpg?w=300&#038;h=225" height="225" width="300" class="size-medium wp-image-590236" /></a><p class="wp-caption-text">Ryan Park dropping knowledge &#8212; and graphs</p></div>
<p>Although Spot Instance prices <a href="http://gigaom.com/2011/12/27/how-to-deal-with-amazons-spot-server-price-spikes/">occasionally spike through the roof</a>, Park&#8217;s experience is that they typically remain stable and can result in &#8220;massive&#8221; savings if you know how to use them effectively. Using Spot Instances to power Pinterest&#8217;s approximately 80 front-end servers costs only about $20 per hour, he said. All told, Pinterest has reduced its daily computing bill to about $440 from about $1,200.</p>
<p>All this being said, though, Park, Blomo and Scallan all acknowledged that the flexibility of being able to mix on-demand, reserved and spot servers might not be all it&#8217;s cracked up to be if you don&#8217;t understand how they all work. Reserved Instances are inflexible in terms of size and region once you reserve them, and Spot Instances must be used wisely for jobs or applications that can handle their easy come, easy go nature. And now there&#8217;s even more to consider <a href="http://gigaom.com/cloud/want-to-buy-or-sell-amazon-instances-now-you-can/">because Reserved Instances can be resold</a> via AWS&#8217;s spot marketplace.</p>
<p>&#8220;It gets a little tricky,&#8221; Blomo said.</p>
<h2>Pick your challenges</h2>
<p>Although decisions such database type and structure are largely architectural, there might be elements of cost efficiency at play, as well. Maybe Kleiner Perkins&#8217;s Bradford put it best while leading off the session with Scallan and Blomo. Bradford presented a slide containing a simple quote from Instagram Founder Mike Krieger: &#8220;Your users around the world don&#8217;t care that you wrote your own database.&#8221; Sometimes, Bradford added, it might be best to use what works &#8212; maybe even a managed service &#8212; rather than whatever&#8217;s trending highest on Hacker News.</p>
<p>Pinterest&#8217;s Park expressed a similar sentiment during his session, citing a lesson his team learned about trying out too many new databases. The site used to use MongoDB, Cassandra, Redis and other databases simultaneously, but learning all the new technologies and managing them became burdensome. Now, he said, Pinterest uses good, old-fashioned MySQL (granted, <a href="http://www.slideshare.net/eonarts/mysql-meetup-july2012scalingpinterest">it sharded MySQL 4,000 times</a>) and memcached &#8212; as well as Redis &#8212; because they have strong communities and new engineers are more likely to know how to work with them.</p>
<p>After explaining EMRio and some other custom-built Hadoop tools to the crowd, Yelp&#8217;s Blomo noted that companies should carefully consider whether the time and money it takes to build stuff will actually result in commensurate savings once those tools or systems are in production. That can require some tough balancing of criteria such as cost, performance, flexibility and user experience.</p>
<p>But it&#8217;s important to use human resources wisely. As Bradford said during his presentation, &#8220;There&#8217;s no free lunch when it comes to developer time.&#8221;</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=590008&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=368630"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=368630" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=590008+pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/07/cloud-and-data-second-quarter-2012-analysis-and-outlook-2/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=590008+pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud&utm_content=dharrisstructure">Takeaways from the second quarter in cloud and data</a></li><li><a href="http://pro.gigaom.com/2012/12/how-direct-access-solutions-can-speed-up-cloud-adoption/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=590008+pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud&utm_content=dharrisstructure">How direct-access solutions can speed up cloud adoption</a></li><li><a href="http://pro.gigaom.com/2012/08/understanding-and-managing-the-cost-of-the-cloud/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=590008+pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud&utm_content=dharrisstructure">Understanding and managing the cost of the cloud</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/12/02/pinterest-flipboard-and-yelp-tell-how-to-save-big-bucks-in-the-cloud/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/12/20121129_154230-e1354479446379.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/12/20121129_154230-e1354479446379.jpg?w=150" medium="image">
			<media:title type="html">Yelp chart</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/12/20121129_1528212.jpg?w=604" medium="image">
			<media:title type="html">Some stats on Flipboard&#039;s AWS usage</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/12/20121129_154538.jpg?w=300" medium="image">
			<media:title type="html">20121129_154538</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/12/emrio.jpg?w=604" medium="image">
			<media:title type="html">emrio</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/12/20121128_105509.jpg?w=300" medium="image">
			<media:title type="html">Ryan Park dropping knowledge -- and graphs</media:title>
		</media:content>
	</item>
		<item>
		<title>Why Amazon thinks big data was made for the cloud</title>
		<link>http://gigaom.com/2012/11/30/why-amazon-thinks-big-data-was-made-for-the-cloud/</link>
		<comments>http://gigaom.com/2012/11/30/why-amazon-thinks-big-data-was-made-for-the-cloud/#comments</comments>
		<pubDate>Fri, 30 Nov 2012 19:53:12 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[high-performance computing]]></category>
		<category><![CDATA[supercomputers]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=589797</guid>
		<description><![CDATA[According to Amazon Web Services Chief Data Scientist Matt Wood, big data and cloud computing are nearly a match made in heaven. Limitless, on-demand and inexpensive resources open up new worlds of possibility, and a central platform makes it easy for communities to share huge datasets.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=589797&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>For Amazon Web Services Chief Data Scientist Matt Wood, the day isn&#8217;t filled performing data alchemy on behalf of his employer; he&#8217;s entertaining its customers. Wood helps AWS users build big data architectures that use the company&#8217;s cloud computing resources, and then take what he learns about those users&#8217; needs and turn them into products &#8212; such as the Data Pipeline Service and <a href="http://gigaom.com/cloud/amazons-new-data-warehousing-service-takes-aim-at-old-guard-it-giants/">Redshift data warehouse</a> AWS announced this week.</p>
<div id="attachment_589879" class="wp-caption alignleft" style="width: 150px"><a href="http://gigaom2.files.wordpress.com/2012/11/20120820170634_matt-wood.jpg"><img  alt="Matt Wood" src="http://gigaom2.files.wordpress.com/2012/11/20120820170634_matt-wood.jpg?w=708"   class="size-full wp-image-589879" /></a><p class="wp-caption-text">Matt Wood</p></div>
<p>He and I sat down this week at AWS&#8217;s inaugural Re: Invent conference and talked about many things, including what he&#8217;s seen in the field and where cloud-based big data efforts are headed. Here are the highlights.</p>
<h2>The end of contstraint-based thinking</h2>
<p>Not so long ago, computer scientists understood many of the concepts that we now call data science, but limited resources meant they were hamstrung in the types of analysis they could attempt to do. &#8220;That can be very limiting, very constraining when you&#8217;re working with data,&#8221; Wood said.</p>
<p>Now, however, data storage and processing resources are relatively inexpensive and abundant &#8212; so much so that they&#8217;ve actually made the concept of big data possible. Cloud computing has only made those resources cheaper and more abundant. The result, Wood said, is that people working with data are undergoing a shift from that mindset of limiting their data analysis to the resources they have available to one where they think about business needs first.</p>
<p>If they&#8217;re able to get past traditional notions of sampling and days-long processing times,  he added, individuals can focus their attention on what they <em>can</em> do because they have so many resources available. He noted how Yelp gave developers relatively free rein early on the use of Elastic MapReduce, saving them from having to formally request resources just &#8220;to see if the crazy idea [someone] had over coffee is going to play out.&#8221; Yelp was able to spot a shift in mobile traffic volume years ago and get a headstart on its mobile efforts because of that, Wood added.</p>
<h2>Data problems aren&#8217;t just about scale</h2>
<p>Generally speaking, Wood said, solving customers&#8217; data problems isn&#8217;t just about figuring out how to store ever greater volumes for every cheaper prices. &#8220;You don&#8217;t have to be at a petabyte scale in order to get some insight on who&#8217;s using your social game,&#8221; he said.</p>
<p>In fact, access to limitless storage and processing is a solution to one problem that actually creates another. Companies want to keep <em>all</em> the data they generate, and that creates complexity, Wood explained. As that data piles up in various repositories &#8212; perhaps in Amazon&#8217;s S3 and DynamoDB services, as well as on some physical machines with a company&#8217;s data center &#8212; moving it from place to place in order to reuse it becomes a difficult process.</p>
<p>Wood said AWS built its <a href="http://gigaom.com/cloud/amazon-preps-data-pipeline-service-to-automate-and-orchstrate-big-data-workflows/">new Data Pipeline Service</a> in order to address this problem. Pipelines can be &#8220;arbitrarily complex,&#8221; he explained &#8212; from running a simple piece of business logic against data to running whole batches through Elastic MapReduce &#8212; but the idea is to automate the movement and processing so users don&#8217;t have to build these flows themselves and then manually run them.</p>
<p><a href="http://gigaom2.files.wordpress.com/2012/11/aws_data_pipeline_console_1-copy.jpg"><img  alt="aws_data_pipeline_console_1 copy" src="http://gigaom2.files.wordpress.com/2012/11/aws_data_pipeline_console_1-copy.jpg?w=708"   class="aligncenter size-full wp-image-589908" /></a></p>
<h2>The cloud isn&#8217;t just for storing tweets</h2>
<p>People sometimes question the relevance of cloud computing for big data workloads, if only because any data generated on in-house systems has to make its way to the cloud over inherently slow connections. The bigger the dataset, the longer the upload time.</p>
<p>Wood said AWS is trying hard to alleviate these problems. For example, <a href="http://gigaom.com/cloud/is-consumer-content-up-next-for-aspera/">partners such as Aspera</a> and even some open source projects enable customers to move large files at fast speeds over the internet (Wood said he&#8217;s seen consistent speeds of 700 megabits per second). This is also why AWS has eliminated data-transfer fees for inbound data, has turned on parallel uploads for large files and <a href="http://gigaom.com/cloud/amazon-gives-users-dedicated-links-to-its-cloud/">created its Direct Connect program</a> with data center operators that provide dedicated connections to AWS facilities.</p>
<p>And if datasets are too large for all those methods, customers<a href="http://gigaom.com/2010/06/10/when-amazon-resorts-to-snail-mail-theres-a-business-opportunity/"> can just send AWS their physical disks</a>. &#8220;We definitely receive hard drives,&#8221; Wood said.</p>
<h2>Collaboration is the future</h2>
<p>Once data makes its way to the cloud, it opens up entirely new methods of collaboration where researchers or even entire industries can access and work together on shared datasets too big to move around. &#8220;This sort of data space is something that&#8217;s becoming common in fields where there are very large datasets,&#8221; Wood said, citing as an example the <a href="http://www.1000genomes.org/">1000 Genomes project</a> dataset that AWS houses.</p>
<div id="attachment_419764" class="wp-caption aligncenter" style="width: 614px"><a href="http://gigaom2.files.wordpress.com/2011/10/dnanexus.jpg"><img  alt="DNAnexus's cloud-based architecture" src="http://gigaom2.files.wordpress.com/2011/10/dnanexus.jpg?w=604&#038;h=517" height="517" width="604" class="size-large wp-image-419764" /></a><p class="wp-caption-text">DNAnexus&#8217;s cloud-based architecture</p></div>
<p>As we&#8217;ve covered recently, <a href="http://gigaom.com/data/why-data-is-the-key-to-better-medicine-and-maybe-a-cure-for-cancer/">the genetics space is drooling over the promise of cloud computing</a>. The 1000 Genomes database is only 200TB, Wood explained, but very few project leads could get the budget to store that much data and make it accessible to their peers, much less the computation power required to process it. And even in fields such as pharmaceuticals, Amazon CTO Werner Vogels <a href="http://gigaom.com/cloud/amazons-vogels-on-21st-century-apps-and-it-life-events/">told me during an earlier interview</a>, companies are using the cloud to collaborate on certain datasets so companies don&#8217;t have to spend time and money reinventing the wheel.</p>
<h2>No more supercomputers?</h2>
<p>Wood seemed very impressed with the work that AWS&#8217;s high-performance computing customers have been doing on the platform &#8212; work that previously would have been done on supercomputers or other physical systems. Thanks to AWS partner Cycle Computing, he noted, the Morgridge Institute at the University of Wisconsin <a href="http://gigaom.com/cloud/gene-research-in-the-cloud-could-help-cure-diseases-in-the-lab/">was able to perform 116 years worth of computing in just one week</a>. In the past, access to that kind of power would have required waiting in line until resources opened up on a supercomputer somewhere.</p>
<p>The collaborative efforts Wood discussed certainly facilitate this type of extreme computation, as does AWS&#8217;s continuous efforts to beef up its instances with more and more power. Whatever users might need, from the new 250GB RAM on-demand instances to <a href="http://gigaom.com/cloud/amazon-gets-graphic-with-cloud-gpu-instances/">GPU-powered Cluster Compute Instances</a>, Wood said AWS will try to provide it. Because cost sometimes matters, AWS has opened Cluster Compute Instances and Elastic MapReduce to its spot market for buying capacity on the cheap.</p>
<p>But whatever data-intensive workloads organizations want to run, many will always look to the cloud now. Because cloud computing and big data &#8212; Hadoop, especially &#8212; have come of age roughly in parallel with each other, Wood hypothesized, they often go hand-in-hand in people&#8217;s minds.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-641209p1.html">Shutterstock user winui</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=589797&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=66810"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=66810" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=589797+why-amazon-thinks-big-data-was-made-for-the-cloud&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/its-time-for-cloud-security-and-big-data-to-come-together/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=589797+why-amazon-thinks-big-data-was-made-for-the-cloud&utm_content=dharrisstructure">It&#8217;s time for cloud security and big data to come together</a></li><li><a href="http://pro.gigaom.com/2011/10/buying-into-big-data-appliances/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=589797+why-amazon-thinks-big-data-was-made-for-the-cloud&utm_content=dharrisstructure">Buying into big data appliances</a></li><li><a href="http://pro.gigaom.com/2010/12/9-companies-that-pushed-the-infrastructure-discussion-in-2010/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=589797+why-amazon-thinks-big-data-was-made-for-the-cloud&utm_content=dharrisstructure">9 Companies that Pushed the Infrastructure Discussion in 2010</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/11/30/why-amazon-thinks-big-data-was-made-for-the-cloud/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_94487455-e1354305591139.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/11/shutterstock_94487455-e1354305591139.jpg?w=150" medium="image">
			<media:title type="html">cloud data</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/20120820170634_matt-wood.jpg" medium="image">
			<media:title type="html">Matt Wood</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/aws_data_pipeline_console_1-copy.jpg" medium="image">
			<media:title type="html">aws_data_pipeline_console_1 copy</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2011/10/dnanexus.jpg?w=604" medium="image">
			<media:title type="html">DNAnexus&#039;s cloud-based architecture</media:title>
		</media:content>
	</item>
		<item>
		<title>Rackspace versus Amazon: The big data edition</title>
		<link>http://gigaom.com/2012/10/29/rackspace-versus-amazon-the-big-data-edition/</link>
		<comments>http://gigaom.com/2012/10/29/rackspace-versus-amazon-the-big-data-edition/#comments</comments>
		<pubDate>Mon, 29 Oct 2012 17:19:49 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[DynamoDB]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[OpenStack]]></category>
		<category><![CDATA[Rackspace]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=577799</guid>
		<description><![CDATA[Rackspace is busy building a Hadoop service, giving the company one more avenue to compete with cloud kingpin Amazon Web Services. However, the two services -- along with several others on the market -- highlight just how different seemingly similar cloud services can be.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=577799&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Rackspace has been on a tear over the past few months <a href="http://gigaom.com/cloud/rackspace-breaks-out-block-storage-with-disk-and-ssd-options/">releasing new features that map closely to the core features</a> of the Amazon Web Services platform, <a href="http://gigaom.com/cloud/rackspace-ceo-were-playing-a-different-game-than-amazon/">only with a Rackspace flavor</a> that favors service over scale. Its next target is <a href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>, which Rackspace will be countering with its own Hadoop service in 2013. If AWS and Rackspace are, indeed, the No. 1 and No. 2 cloud computing providers around, it might be easy enough to make a decision between the two platforms.</p>
<p>In the cloud, however, the choices are never as simple as black or white.</p>
<h2>Amazon versus Rackspace is a matter of control</h2>
<p>Discussing <a href="http://www.rackspace.com/blog/taking-elephants-to-the-openstack-cloud-a-new-initiative-with-hortonworks/">its forthcoming Hadoop service</a> during a phone call on Friday, Rackspace CTO John Engates highlighted the fundamental product-level differences between his company and its biggest competitor, AWS. Right now, for users, it&#8217;s primarily a question of how much control they want over the systems they&#8217;re renting &#8212; and Rackspace comes down firmly on the side of maximum control.</p>
<div id="attachment_578160" class="wp-caption alignright" style="width: 271px"><a href="http://gigaom2.files.wordpress.com/2012/10/jengates-1.jpg"><img  title="jengates 1" alt="" src="http://gigaom2.files.wordpress.com/2012/10/jengates-1.jpg?w=708"   class="size-full wp-image-578160" /></a><p class="wp-caption-text">John Engates</p></div>
<p>For Hadoop specifically, Engates said Rackspace&#8217;s service will &#8220;really put [users] in the driver&#8217;s seat in terms of how they&#8217;re running it&#8221; by giving them granular control over how their systems are configured and how their jobs run (courtesy of the OpenStack APIs, of course). Rackspace is even working on optimizing a portion of its cloud so the Hadoop service will run on servers, storage and networking gear designed specifically for big data workloads. Essentially, Engates added, Rackspace wants to give users the experience of owning a Hadoop cluster without actually owning any of the hardware.</p>
<p>&#8220;It&#8217;s not MapReduce as a service,&#8221; he added, &#8220;it&#8217;s more Hadoop as a service.&#8221;</p>
<p>The company partnered with Yahoo spinoff Hortonworks on this in part because of its expertise and in part because <a href="http://gigaom.com/cloud/hortonworks-teams-with-vmware-to-keep-hadoop-running/">its open source vision for Hadoop</a> aligns closely with Rackspace&#8217;s vision around OpenStack. &#8220;The guys at Hortonworks are really committed to the real open source flavor of Hadoop,&#8221; Engates said.</p>
<p>Rackspace&#8217;s forthcoming Hadoop service appears to contrast somewhat with Amazon&#8217;s three-year-old and <a href="http://gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and-climate-corp-hadoop-jobs/">generally well-received Elastic MapReduce service</a>. The latter lets users write their own MapReduce jobs and choose the number and types of servers they want, but doesn&#8217;t give users system-level control on par with what Rackspace seems to be planning. For the most part, it comports with AWS&#8217;s tried-and-true strategy of giving users some control of their underlying resources, but generally trying to offload as much of the operational burden as possible.</p>
<p>Elastic MapReduce also isn&#8217;t open source, but is an Amazon-specific service designed around Amazon&#8217;s existing S3 storage system and other AWS features. When AWS did choose to offer a version of Elastic MapReduce running a commercial Hadoop distribution, <a href="http://gigaom.com/cloud/amazon-taps-mapr-for-high-powered-elastic-mapreduce/">it chose MapR&#8217;s high-performance but partially proprietary flavor of Hadoop</a>.</p>
<h2>It doesn&#8217;t stop with Hadoop</h2>
<p>Rackspace is also considering getting into the NoSQL space, perhaps with hosted versions of the open source Cassandra and MongoDB databases, and here too it likely will take a different tact than AWS. For one, Rackspace still has a dedicated hosting business to tie into, where some customers still run EMC storage area networks and NetApp network-attached storage arrays. That means Rackspace can&#8217;t afford to lock users into a custom-built service that doesn&#8217;t take their existing infrastructure into account or that favors raw performance over enterprise-class features.</p>
<p>Rackspace needs stuff that&#8217;s &#8220;open, readily available and not unique to us,&#8221; Engates said. Pointing specifically to <a href="http://gigaom.com/cloud/amazon-launches-home-grown-nosql-database/">AWS&#8217;s fully managed and internally developed DynamoDB service</a>, he suggested, &#8220;I don&#8217;t think it&#8217;s in the fairway for most customers that are using Amazon today.&#8221;</p>
<p>Perhaps, but <a href="http://aws.amazon.com/dynamodb/testimonials/">early DynamoDB success stories</a> such as IMDb, SmugMug and Tapjoy suggest the service isn&#8217;t without an audience willing to pay for its promise of a high-performance, low-touch NoSQL data store.</p>
<p><iframe src="http://www.youtube.com/embed/oz-7wJJ9HZ0" height="315" width="560"></iframe></p>
<h2>Which is better? Maybe neither</h2>
<p>There&#8217;s plenty of room for debate over whose approach is better, but the answer for many would-be customers might well be neither. When it comes to hosted Hadoop services, both Rackspace and Amazon have to contend with Microsoft&#8217;s <a href="https://www.hadooponazure.com/">newly available HDInsight service</a> on its Windows Azure platform, as well as IBM&#8217;s <a href="http://gigaom.com/cloud/ibm-doing-hadoop-as-a-service-in-its-cloud/">BigInsights service on its SmartCloud platform</a>. Google <a href="http://www.mapr.com/company/press-releases/mapr-and-google-compute-engine-set-new-world-record-for-hadoop-terasort">appears to have something cooking</a> in the Hadoop department, as well. For developers who think all these infrastructure-level services are too much work, higher-level services such as <a href="http://gigaom.com/cloud/exclusive-the-brains-behind-hive-launch-on-demand-hadoop-service/">Qubole</a>, <a href="http://gigaom.com/cloud/how-infochimps-wants-to-become-heroku-for-hadoop/">Infochimps</a> or <a href="http://gigaom.com/cloud/if-you-can-code-mortar-data-says-you-can-use-its-hadoop-service/">Mortar Data</a> might look more appealing.</p>
<p>The NoSQL space is <a href="http://gigaom.com/cloud/cloud-databases-101-who-builds-em-and-what-they-do/">rife with cloud services, too</a>, primarily focused on MongoDB but also including hosted Cassandra and CouchDB-based services.</p>
<p>In order to stand apart from the big data crowd, Engates said Rackspace is going to stick with its company-wide strategy of differentiation through user support. Thanks to its partnership with Hortonworks and the hybrid nature of OpenStack, for example, Rackspace is already helping customers deploy Hadoop in their private cloud environments while its public cloud service is still in the works. &#8220;We want to go where the complexity is,&#8221; he said, &#8220;where the customers value our [support] and expertise.&#8221;</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-343612p1.html">Shutterstock user Graeme Shannon</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=577799&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=390396"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=390396" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=577799+rackspace-versus-amazon-the-big-data-edition&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=577799+rackspace-versus-amazon-the-big-data-edition&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li><li><a href="http://pro.gigaom.com/2012/01/how-amazons-dynamodb-is-rattling-the-big-data-and-cloud-markets/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=577799+rackspace-versus-amazon-the-big-data-edition&utm_content=dharrisstructure">Amazon’s DynamoDB: rattling the cloud market</a></li><li><a href="http://pro.gigaom.com/2011/04/infrastructure-q1-iaas-comes-down-to-earth-big-data-takes-flight/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=577799+rackspace-versus-amazon-the-big-data-edition&utm_content=dharrisstructure">Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/10/29/rackspace-versus-amazon-the-big-data-edition/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/10/shutterstock_70904386.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/10/shutterstock_70904386.jpg?w=150" medium="image">
			<media:title type="html">Fighting elephants</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/10/jengates-1.jpg" medium="image">
			<media:title type="html">jengates 1</media:title>
		</media:content>
	</item>
		<item>
		<title>All aboard the Hadoop money train</title>
		<link>http://gigaom.com/2012/05/07/all-aboard-the-hadoop-money-train/</link>
		<comments>http://gigaom.com/2012/05/07/all-aboard-the-hadoop-money-train/#comments</comments>
		<pubDate>Mon, 07 May 2012 17:50:00 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[IDC]]></category>
		<category><![CDATA[Mapr]]></category>
		<category><![CDATA[mapreduce]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=518419</guid>
		<description><![CDATA[Market research firm IDC released the first legitimate market forecast for Hadoop on Monday, claiming the ecosystem around the de facto big data platform will sell almost $813 million worth of software by 2016. But Hadoop's actual economic impact is likely much, much larger.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=518419&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://gigaom2.files.wordpress.com/2012/05/bronze-elephant-e1317338128377.jpeg"><img title="bronze-elephant-e1317338128377" src="http://gigaom2.files.wordpress.com/2012/05/bronze-elephant-e1317338128377.jpeg?w=708" alt=""   class="alignleft size-full wp-image-518580"></a>Market research firm IDC <a href="http://www.businesswire.com/news/home/20120507005611/en/IDC-Releases-Worldwide-Hadoop-MapReduce-Ecosystem-Software-Forecast">released the first legitimate market forecast for Hadoop</a> on Monday, claiming the ecosystem around the de facto big data platform will sell almost $813 million worth of software by 2016. But IDC’s forecast doesn’t tell the whole story. Hadoop’s actual economic impact is likely much, much larger.</p>
<p>Viewed alone, IDC’s forecast offers impressive enough numbers — a $77 million Hadoop market growing at a 60 percent compound annual growth rate until it hits the $812.8 million mark in 2016. The number would be higher, the report concludes, if Hadoop’s open source status didn’t drive down the prices that vendors pushing proprietary products could charge. Indeed, with vendors such as Hortonworks pushing the all-open-source approach to Hadoop, others do have to keep their license fees in check.</p>
<p>According to separate emails from report co-authors Carl Olofson and Dan Vesset, IDC’s revenue calculations take into account software, maintenance and software-as-a-service revenue. That means they account for <a href="http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/?utm_source=cloud&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=518419+all-aboard-the-hadoop-money-train&amp;utm_content=dharrisstructure">the usual suspects such as Cloudera, MapR and Hortonworks at the distribution layer</a> (GigaOM Pro subscription req’d), as well as <a href="http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/">the myriad vendors working a layer above on Hadoop-based applications, databases and management tools</a>. IDC also accounted for cloud-based Hadoop services such as Amazon Web Services’ Elastic MapReduce (and, presumably, upstarts such as <a href="http://gigaom.com/cloud/how-infochimps-wants-to-become-heroku-for-hadoop/">Infochimps</a> and <a href="http://gigaom.com/cloud/if-you-can-code-mortar-data-says-you-can-use-its-hadoop-service/">Mortar Data</a>).</p>
<p>Time will tell if IDC’s forecast is accurate, but it’s definitely welcome. Hadoop isn’t a new version of the relational database, <a href="http://www.businessinsider.com/the-guy-that-invented-an-amazing-big-data-tech-says-oracle-and-microsoft-should-not-be-afraid-2012-5">as some have suggested</a>, but a whole new platform that sits beside and likely won’t replace legacy software. In theory, then, any money spent on Hadoop is new money. And because it began as an open source project and is coming of age in the cloud computing era, it’s not too easy to look to technologies past for guidance.</p>
<h2>An impending Hadoop explosion in the cloud</h2>
<p>It’s in the cloud where things could get particularly interesting. Olofson noted that he and Vesset estimated  ”little if any revenue for [Elastic MapReduce] in 2011,” although I’m not certain I agree. That service is actually quite popular and accounts for some serious Hadoop use — some users run <a href="http://gigaom.com/cloud/how-climate-corp-is-pitting-big-data-against-mother-nature/">several thousand nodes at a time</a>. If it’s generating even a few million dollars — just <a href="http://gigaom.com/cloud/dont-look-now-but-aws-might-be-a-billion-dollar-biz/">a fraction of AWS’s estimated overall revenue</a> — that’s a not-insignificant piece of a $77 million Hadoop space.</p>
<p>But cloud computing is already having a meaningful impact on Hadoop in other ways that will only expand. One is as the de facto deployment model of choice for many web startups, more and more of which are finding a way to make big data a part of their business model. Whether they’re <a href="http://gigaom.com/cloud/9-more-companies-putting-a-cloud-spin-on-big-data/">big data applications</a> or just <a href="http://gigaom.com/cloud/accel-forms-100m-fund-to-feed-big-data-apps/">applications that use big data</a>, they will likely use Hadoop, and they’re likely not going to pay a lot of money for it. If they’re not hosting and managing their own Apache Hadoop cluster, they’ll probably use a cloud-based Hadoop offering, which could mean significant growth in that segment of the ecosystem.</p>
<h2>The externalities of Hadoop</h2>
<p>Moreover, every company — startup or established — that offers a service powered in some part by Hadoop adds to the platform’s overall economic impact. Hadoop is a big data storage-and-processing framework as well as a positive-externality generator. <a href="http://gigaom.com/cloud/facebook-hadoop-cluster/">Facebook</a>, <a href="http://gigaom.com/cloud/how-twitter-is-doing-its-part-to-democratize-big-data/">Twitter</a>, <a href="http://gigaom.com/cloud/think-youre-unique-let-yahoos-data-trove-be-the-judge/">Yahoo</a>, <a href="http://gigaom.com/cloud/how-etsy-handcrafted-a-big-data-strategy/">Etsy</a>, <a href="http://gigaom.com/cloud/hadoop-kills-zombies-too-is-there-anything-it-cant-solve/">ipTrust</a>, <a href="http://gigaom.com/cloud/bloomreach-wants-to-save-your-site-with-big-data/">BloomReach</a>, <a href="http://gigaom.com/cloud/satellite-imagery-and-hadoop-mean-70m-for-skybox/">Skybox</a>, <a href="http://gigaom.com/cloud/how-climate-corp-is-pitting-big-data-against-mother-nature/">Climate Corporation</a>, <a href="http://gigaom.com/cloud/how-hadoop-can-help-keep-your-money-in-the-bank/">Zions Bancorporation</a> — every sale made, fraud thwarted or page view generated thanks to Hadoop means a healthier economy. The dollar amount directly attributable to Hadoop probably isn’t calculable, but it’s likely rather large and growing.</p>
<p>It’s difficult to even fathom the external effects of Hadoop in 2016. A projected $812.8 million market means Hadoop is becoming fairly ubiquitous, especially if one considers how many companies are using free software in addition to those paying for it. Already, I’ve been told, <a href="http://gigaom.com/cloud/hadoop-bigger-than-spring-jboss-and-mysql-combined/">the majority of Fortune 500 companies are at least experimenting with Hadoop</a>. It might not be a godsend, and might be just the starting point of a meaningful big data strategy, but Hadoop is going to have a major impact on how businesses do business.</p>
<p><em>Image <a href="http://creativecommons.org/licenses/by/2.0/">courtesy</a> of <a href="http://www.flickr.com/photos/rachelscott/932930452/" target="_blank">Flickr user RachScottHalls</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=518419&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=948019"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=948019" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=518419+all-aboard-the-hadoop-money-train&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=518419+all-aboard-the-hadoop-money-train&utm_content=dharrisstructure">2012: The Hadoop infrastructure market booms</a></li><li><a href="http://pro.gigaom.com/2012/07/cloud-and-data-second-quarter-2012-analysis-and-outlook-2/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=518419+all-aboard-the-hadoop-money-train&utm_content=dharrisstructure">Takeaways from the second quarter in cloud and data</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=518419+all-aboard-the-hadoop-money-train&utm_content=dharrisstructure">A near-term outlook for big data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/05/07/all-aboard-the-hadoop-money-train/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/05/bronze-elephant-e1317338128377.jpeg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/05/bronze-elephant-e1317338128377.jpeg?w=150" medium="image">
			<media:title type="html">bronze-elephant-e1317338128377</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/05/bronze-elephant-e1317338128377.jpeg" medium="image">
			<media:title type="html">bronze-elephant-e1317338128377</media:title>
		</media:content>
	</item>
		<item>
		<title>2012: The Hadoop infrastructure market booms</title>
		<link>http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/</link>
		<comments>http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/#comments</comments>
		<pubDate>Thu, 26 Apr 2012 19:22:32 +0000</pubDate>
		<dc:creator><a href="http://pro.gigaom.com/members/jomaitland/" rel="author">Jo Maitland</a></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Adaptive Computing]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[apnatek]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[axceleon]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[bioteam]]></category>
		<category><![CDATA[BusinessObjects]]></category>
		<category><![CDATA[cascadeo]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[clustercorp]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Cycle Computing]]></category>
		<category><![CDATA[data privacy]]></category>
		<category><![CDATA[data scientists]]></category>
		<category><![CDATA[data storage]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[data-analytics]]></category>
		<category><![CDATA[data-security]]></category>
		<category><![CDATA[Datameer]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[db2]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[enterprise IT]]></category>
		<category><![CDATA[Foursquare]]></category>
		<category><![CDATA[Fujitsu]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hadoop-stack]]></category>
		<category><![CDATA[Hbase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[hp-vertica]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[informatica]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[jaspersoft]]></category>
		<category><![CDATA[karmasphere]]></category>
		<category><![CDATA[legacy-systems]]></category>
		<category><![CDATA[lexisnexis]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[microstrategy]]></category>
		<category><![CDATA[namenode-file-system]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[nube-technologies]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[pentaho]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[Platfora]]></category>
		<category><![CDATA[platform-computing]]></category>
		<category><![CDATA[Quantivo]]></category>
		<category><![CDATA[quest]]></category>
		<category><![CDATA[Rackspace]]></category>
		<category><![CDATA[RainStor]]></category>
		<category><![CDATA[razorfish]]></category>
		<category><![CDATA[SAP]]></category>
		<category><![CDATA[splunk]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[stack-iq]]></category>
		<category><![CDATA[tableau]]></category>
		<category><![CDATA[tco]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[the-apache-foundation]]></category>
		<category><![CDATA[the-mathworks]]></category>
		<category><![CDATA[think-big-analytics]]></category>
		<category><![CDATA[TicketMaster]]></category>
		<category><![CDATA[total-cost-of-ownership]]></category>
		<category><![CDATA[univa-ud]]></category>
		<category><![CDATA[unstructured data]]></category>
		<category><![CDATA[VoltDB]]></category>
		<category><![CDATA[Wipro]]></category>
		<category><![CDATA[Yahoo]]></category>
		<category><![CDATA[yelp]]></category>
		<category><![CDATA[zettaset]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://pro.gigaom.com/?p=105677</guid>
		<description><![CDATA[There are now more than half a dozen commercial Hadoop distributions in the market, and almost every enterprise with big data challenges is tinkering with the Apache Foundation-licensed software. A new report examines the key disruptive trends shaping the Hadoop platform market.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=514890&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>For years, technologists have been promising software that will make it easier and cheaper to analyze vast amounts of data in order to revolutionize business. More than one solution exists, but today Hadoop is fast becoming the most talked about name in enterprises. There are now more than half a dozen commercial Hadoop distributions in the market, and almost every enterprise with big data challenges is tinkering with the Apache Foundation–licensed software. This report examines the key disruptive trends shaping the Hadoop platform market, from integration with legacy systems to ensuring data security, and where companies like Cloudera, IBM, Hortonworks and others will position themselves to gain share and increase revenue.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=514890&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=370626"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=370626" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=514890+sector-roadmap-hadoop-platforms-2012&utm_content=gigaedit">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=514890+sector-roadmap-hadoop-platforms-2012&utm_content=gigaedit">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/03/defining-hadoop-the-players-technologies-and-challenges-of-2011/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=514890+sector-roadmap-hadoop-platforms-2012&utm_content=gigaedit">Defining Hadoop: the Players, Technologies and Challenges of 2011</a></li><li><a href="http://pro.gigaom.com/2011/07/infrastructure-q2-big-data-and-paas-gain-more-momentum/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=514890+sector-roadmap-hadoop-platforms-2012&utm_content=gigaedit">Infrastructure Q2: Big data and PaaS gain more momentum</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="https://gigaom-pro-files.s3.amazonaws.com/files/2012/04/elephant.jpg?w=150" />
		<media:content url="https://gigaom-pro-files.s3.amazonaws.com/files/2012/04/elephant.jpg?w=150" medium="image">
			<media:title type="html">elephant</media:title>
		</media:content>

		<media:content url="http://1.gravatar.com/avatar/4f3860069d181dbeeb398304f5940a9e?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigaedit</media:title>
		</media:content>
	</item>
		<item>
		<title>It&#8217;s time for cloud security and big data to come together</title>
		<link>http://pro.gigaom.com/2012/03/its-time-for-cloud-security-and-big-data-to-come-together/</link>
		<comments>http://pro.gigaom.com/2012/03/its-time-for-cloud-security-and-big-data-to-come-together/#comments</comments>
		<pubDate>Thu, 01 Mar 2012 22:00:14 +0000</pubDate>
		<dc:creator>Jo Maitland</dc:creator>
				<category><![CDATA[pro-infrastructure]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[cloud security]]></category>
		<category><![CDATA[data-analytics]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[RSA]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://pro.gigaom.com/?p=99721</guid>
		<description><![CDATA[RSA says the way to handle security threats is by harnessing big data, but the tools aren't out there yet for mainstream business [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=492535&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>RSA says the way to handle security threats is by harnessing big data, but the tools aren&#8217;t out there yet for mainstream business users.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=492535&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=844377"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=844377" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=492535+its-time-for-cloud-security-and-big-data-to-come-together&utm_content=gigaguest">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/01/12-tech-leaders-resolutions-for-2012/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=492535+its-time-for-cloud-security-and-big-data-to-come-together&utm_content=gigaguest">12 tech leaders’ resolutions for 2012</a></li><li><a href="http://pro.gigaom.com/2012/12/big-data-2013-key-trends-and-companies-to-watch/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=492535+its-time-for-cloud-security-and-big-data-to-come-together&utm_content=gigaguest">Big data 2013: key trends and companies to watch</a></li><li><a href="http://pro.gigaom.com/2012/12/cloud-computing-2013-how-to-navigate-without-a-map/?utm_source=pro&utm_medium=editorial&utm_campaign=auto3&utm_term=492535+its-time-for-cloud-security-and-big-data-to-come-together&utm_content=gigaguest">Cloud computing 2013: how to navigate without a map</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://pro.gigaom.com/2012/03/its-time-for-cloud-security-and-big-data-to-come-together/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/4411542bbd7a2a9a2fc2a1b38809e45c?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigaguest</media:title>
		</media:content>
	</item>
		<item>
		<title>How Etsy handcrafted a big data strategy</title>
		<link>http://gigaom.com/2011/11/02/how-etsy-handcrafted-a-big-data-strategy/</link>
		<comments>http://gigaom.com/2011/11/02/how-etsy-handcrafted-a-big-data-strategy/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 17:23:44 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Amazon Web Services]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cascading]]></category>
		<category><![CDATA[Concurrent]]></category>
		<category><![CDATA[elastic-mapreduce]]></category>
		<category><![CDATA[Etsy]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[machine data]]></category>
		<category><![CDATA[server logs]]></category>
		<category><![CDATA[splunk]]></category>
		<category><![CDATA[Web Infrastructure]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=431705</guid>
		<description><![CDATA[E-commerce site Etsy has grown to 25 million unique visitors and 1.1 billion page views per month, and it's generating the data volumes to match. Using tools such as Hadoop and Splunk, Etsy is turning terabytes of data per day into a better product.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=431705&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://gigaom2.files.wordpress.com/2011/11/etsy.jpg"><img  title="Etsy" src="http://gigaom2.files.wordpress.com/2011/11/etsy-e1320250561591.jpg?w=300&#038;h=199" alt="" width="300" height="199" class="alignleft size-medium wp-image-431827" /></a>Etsy, the e-commerce site specializing in homemade and vintage goods, has grown to more than 11 million users, resulting in 25 million unique visitors and 1.1 billion page views per month, and it&#8217;s generating the data volumes to match. Today, for example, Etsy <a href="http://www.marketwire.com/press-release/etsy-selects-splunk-managing-big-data-to-keep-small-businesses-up-and-running-1580916.htm">detailed some of its work with Splunk</a> to manage and analyze up to a terabyte of machine data per day.</p>
<p>This is a huge increase &#8212; about 200x &#8212; since 2007, when Etsy signed on with Splunk, and was capturing a mere 5GB of data per day. Etsy&#8217;s usage of Splunk has probably evolved, too, from a focus on troubleshooting (i.e., noticing a problem and tracking down the cause) to a focus on what <a href="http://gigaom.com/cloud/how-splunk-is-riding-it-search-toward-an-ipo/">Splunk calls &#8220;operational intelligence.&#8221;</a> Because users can search and analyze server logs and other machine-generated data pretty much as it streams in, they can, for example, monitor traffic patterns in real time to uncover ongoing issues that might be causing visitors to drop off pages or leave the site.</p>
<p>Splunk isn&#8217;t Etsy&#8217;s only big data solution &#8212; it&#8217;s also a big Hadoop user. Etsy runs dozens of Hadoop workflows each night on Amazon&#8217;s cloud-based Elastic MapReduce Hadoop service. According to this very detailed (and technical) presentation (PDF <a href="http://assets.en.oreilly.com/1/event/61/Ephemeral%20Hadoop%20Clusters%20in%20the%20Cloud%20Presentation.pdf">here</a>, video <a href="http://www.youtube.com/watch?v=NF6zwHlbh_I&amp;feature=player_embedded">here</a>) explaining Etsy&#8217;s Hadoop usage, it ran nearly 5,000 Hadoop jobs in May 2011 to analyze both internal operational data as well as external activity such as customer behavior. Etsy actually uses MATLAB within its Elastic MapReduce clusters to analyze the data and perform predictive analytics. The presentation also highlights Etsy&#8217;s experimentation with Tableau to visually display the results of its internal data after it has been cleaned up by Hadoop.</p>
<p>At the product level, Hadoop powers Etsy&#8217;s <a href="http://www.etsy.com/tastetest">Taste Test</a> feature that helps the site determine what products best suit a particular customer&#8217;s tastes. It also helps with a feature that analyzes Facebook profile information in order to let visitors shop for their friends. At Hadoop World next week, an <a href="http://www.hadoopworld.com/session/data-mining-for-product-search-ranking/">Etsy engineer will discuss</a> how Etsy uses Hadoop to improve its search recommendation engine.</p>
<p>Operationally, Hadoop helps Etsy analyze server logs to figure out what customers are doing on the site and how they&#8217;re accessing it.</p>
<p>Etsy, like so many other companies &#8212; especially on the web &#8212; is both drowning in data and trying to leverage it. That&#8217;s why we&#8217;re <a href="http://gigaom.com/cloud/so-much-hadoop-in-so-many-places/">seeing such a huge focus on Hadoop</a> among all varieties of enterprise data-management vendors, and why you can&#8217;t escape the omnipresent references to &#8220;big data.&#8221; Although, as Etsy proves, dealing with big data requires a multi-pronged approach that goes well beyond simply deploying a Hadoop cluster and watching the insights pour in.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=431705&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=186356"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=186356" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=431705+how-etsy-handcrafted-a-big-data-strategy&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/04/the-big-machine-creating-value-out-of-machine-driven-big-data/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=431705+how-etsy-handcrafted-a-big-data-strategy&utm_content=dharrisstructure">Creating value out of machine-driven big data</a></li><li><a href="http://pro.gigaom.com/2012/12/big-data-2013-key-trends-and-companies-to-watch/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=431705+how-etsy-handcrafted-a-big-data-strategy&utm_content=dharrisstructure">Big data 2013: key trends and companies to watch</a></li><li><a href="http://pro.gigaom.com/2012/07/cloud-and-data-second-quarter-2012-analysis-and-outlook-2/?utm_source=cloud&utm_medium=editorial&utm_campaign=auto3&utm_term=431705+how-etsy-handcrafted-a-big-data-strategy&utm_content=dharrisstructure">Takeaways from the second quarter in cloud and data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2011/11/02/how-etsy-handcrafted-a-big-data-strategy/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2011/11/etsy-e1320250561591.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2011/11/etsy-e1320250561591.jpg?w=150" medium="image">
			<media:title type="html">Etsy</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2011/11/etsy-e1320250561591.jpg?w=300" medium="image">
			<media:title type="html">Etsy</media:title>
		</media:content>
	</item>
	</channel>
</rss>
