<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GigaOM &#187; Hadoop</title>
	<atom:link href="http://gigaom.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://gigaom.com</link>
	<description></description>
	<lastBuildDate>Wed, 19 Jun 2013 02:03:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gigaom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/0db8f6557d022075dbbf010c54d46d93?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GigaOM &#187; Hadoop</title>
		<link>http://gigaom.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gigaom.com/osd.xml" title="GigaOM" />
	<atom:link rel='hub' href='http://gigaom.com/?pushpress=hub'/>
		<item>
		<title>Accel Partners putting another $100M toward big data apps</title>
		<link>http://gigaom.com/2013/06/17/accel-partners-putting-another-100m-toward-big-data-apps/</link>
		<comments>http://gigaom.com/2013/06/17/accel-partners-putting-another-100m-toward-big-data-apps/#comments</comments>
		<pubDate>Tue, 18 Jun 2013 04:00:03 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Accel Partners]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=658345</guid>
		<description><![CDATA[Accel has launched its Big Data Fund 2, a followup on the equally large fund the venture capital firm started in November 2011. Rather than seeking products that target data scientists, it wants those targeting business users.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=658345&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Venture capital firm Accel Partners is doubling down on its big data investments, announcing on Monday evening that it&#8217;s launching its second $100 million fund dedicated to analytic software and applications. The aptly named Big Data Fund 2 follows on <a href="http://gigaom.com/2011/11/08/accel-forms-100m-fund-to-feed-big-data-apps/">the firm&#8217;s initial Big Data Fund</a> that it announced in November 2011.</p>
<p>Since then, Accel has put a name on the types of companies it&#8217;s seeking to fund with the new allocation &#8212; namely, those selling what it calls &#8220;data-driven software.&#8221; That&#8217;s a fancy way of saying that it&#8217;s not looking to fund infrastructure-level software such as Hadoop or NoSQL databases, but rather software that leverages these technologies and others in order to make analytics simpler. It wants to fund startups targeting business users rather than data scientists.</p>
<div id="attachment_614655" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/02/1z5o3444.jpg"><img  alt="Structure 2011: Avery Lyford – Chairman Elect, Churchill Club; Michael Goguen – Partner, Sequoia Capital; Satish Dharmaraj – Partner, Redpoint Ventures; Ping Li – Partner, Accel Partners; John Vrionis – Managing Director, Lightspeed Venture Partners" src="http://gigaom2.files.wordpress.com/2013/02/1z5o3444.jpg?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-614655" /></a><p class="wp-caption-text">Accel Partner Ping Li (second from right) at Structure 2011. (c) Pinar Ozger</p></div>
<p>This type of company isn&#8217;t too difficult to come by anymore. Just about everywhere you look, someone is trying to put a big data spin on an old problem or invent some new methods for doing business intelligence. Accel has recently funded a number of them including RelateIQ, <a href="http://gigaom.com/2012/11/19/opower-the-big-data-energy-player-to-beat/">Opower</a>, <a href="http://gigaom.com/2012/11/28/log-data-startup-sumo-logic-raises-30m/">Sumo Logic</a>  and <a href="http://gigaom.com/2013/02/06/exclusive-causata-raises-7-5m-and-steps-up-its-game-in-targeted-ads/">Causata</a>. Among the non-Accel-funded startups GigaOM has covered in just the past few months are <a href="http://gigaom.com/2013/01/16/has-ayasdi-turned-machine-learning-into-a-magic-bullet/">Ayasdi</a>, <a href="http://gigaom.com/2013/05/31/wise-io-wants-to-make-machine-learning-available-to-all/">Wise.io</a>, <a href="http://gigaom.com/2013/06/10/spinnakr-brings-data-science-spin-to-tracking-web-traffic/">Spinnakr</a>, <a href="http://gigaom.com/2013/03/17/statwing-wants-to-make-your-data-and-armchair-quarterback-dreams-come-true/">Statwing</a> and <a href="http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/">BloomReach</a>.</p>
<p>All this interest in data-driven software is no doubt inspired by the proven utility and wildly successful initial public offerings by enterprise data software companies such as <a href="http://gigaom.com/2012/04/19/splunk-ipo-kills-lives-up-to-expectations/">Splunk</a> and <a href="http://gigaom.com/2013/05/17/tableau-closes-day-1-as-a-2-9-billion-public-company-up-64-percent/">Tableau</a>. Entrepreneurs can see the value in rethinking legacy business software or processes for the era of big data and cloud computing, and investors have dollar signs in their eyes as they <a href="http://gigaom.com/2013/01/18/alchemist-accelerator-shows-off-as-enterprise-investment-picks-up/">try to get a piece of the most-promising companies</a>.</p>
<p>As with all trends, much of this startup and investing activity will prove to be overkill, but there&#8217;s no denying the promise that the right products have for everyone involved. Businesses really are hurting for better ways to make sense of all the data they&#8217;re generating and being exposed to, and they&#8217;ll pay handsomely to software vendors that can solve the problem.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=658345&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=358318"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=358318" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=658345+accel-partners-putting-another-100m-toward-big-data-apps&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=658345+accel-partners-putting-another-100m-toward-big-data-apps&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/03/big-data-budgets-on-the-rise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=658345+accel-partners-putting-another-100m-toward-big-data-apps&utm_content=dharrisstructure">Big data budgets on the rise</a></li><li><a href="http://pro.gigaom.com/2010/10/will-hadoop-vendors-profit-from-banks-big-data-woes/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=658345+accel-partners-putting-another-100m-toward-big-data-apps&utm_content=dharrisstructure">Will Hadoop Vendors Profit from Banks&#8217; Big Data Woes?</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/17/accel-partners-putting-another-100m-toward-big-data-apps/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/03/shutterstock_125574617.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/03/shutterstock_125574617.jpg?w=150" medium="image">
			<media:title type="html">Big Data</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/02/1z5o3444.jpg?w=300" medium="image">
			<media:title type="html">Structure 2011: Avery Lyford – Chairman Elect, Churchill Club; Michael Goguen – Partner, Sequoia Capital; Satish Dharmaraj – Partner, Redpoint Ventures; Ping Li – Partner, Accel Partners; John Vrionis – Managing Director, Lightspeed Venture Partners</media:title>
		</media:content>
	</item>
		<item>
		<title>A real-time bonanza: Facebook&#8217;s Wormhole and Yahoo&#8217;s streaming Hadoop</title>
		<link>http://gigaom.com/2013/06/14/a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop/</link>
		<comments>http://gigaom.com/2013/06/14/a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop/#comments</comments>
		<pubDate>Fri, 14 Jun 2013 16:57:54 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[real-time]]></category>
		<category><![CDATA[Storm]]></category>
		<category><![CDATA[stream processing]]></category>
		<category><![CDATA[Web Infrastructure]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=657636</guid>
		<description><![CDATA[This week, both Facebook and Yahoo detailed new efforts to manage real-time data flows within their myriad systems. Yahoo's work is an open source implementation of Storm designed to run on the same cluster as Hadoop and even share resources.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=657636&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>If you’re into systems that can share data among each other in real time, this has been a good week. On Tuesday, Yahoo <a href="http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html">open sourced its version</a> of the popular Storm stream-processing software that’s able to run inside Hadoop clusters. Then, on Thursday, Facebook <a href="https://www.facebook.com/notes/facebook-engineering/wormhole-pubsub-system-moving-data-through-space-and-time/10151504075843920">detailed a system called Wormhole</a> that informs the platform’s myriad applications when changes have occurred in another, so that each one is working from the newest data possible.</p>
<p>The Yahoo work is actually pretty important. Among the features Hadoop users have been demanding from the platform is a transition from batch-processing-only mode <a href="http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/">into something that can actually deal with data in real time</a>. The reason for the demand is quite simple: Although being able to analyze or transform data minutes to hours after it’s generated is helpful for certain analytic tasks, it’s not too helpful if you want an application to be able to act on data as it hits the system.</p>
<p>A service like Twitter is a prime example of where Storm can be valuable. Twitter uses Storm to handle tweets so users’ Timelines are up to date and do things like real-time analytics and spotting emerging trends. In fact, <a href="http://gigaom.com/2011/08/04/twitter-to-open-source-hadoop-like-tool/">it was Twitter that open sourced Storm in 2011</a> after buying Storm creator Backtype in order to get access to the technology and its developers.</p>
<p>Among web companies, Storm has become quite popular as a stream-processing complement to Hadoop since then. And now Yahoo has made possible a much tighter integration between the two — even to the point that Storm can borrow cycles from batch-processing nodes if it needs some extra juice. That’s a valuable feature — just last week I heard Twitter engineer Krishna Gade bemoan Storm’s auto-scaling limitations during a talk at Facebook’s <a href="http://analyticswebscale.splashthat.com/">Analytics @ Web Scale</a> event.</p>
<div id="attachment_657687" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/06/img_20130606_120037.jpg"><img alt="Krishna Gade talking Storm at the Facebook event." src="http://gigaom2.files.wordpress.com/2013/06/img_20130606_120037.jpg?w=708&#038;h=531" width="708" height="531" class="size-large wp-image-657687"></a><p class="wp-caption-text">Krishna Gade talking Storm at the Facebook event.</p></div>
<p>The Storm-on-Hadoop work is among the first of many promised improvements to come thanks to <a href="http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/">YARN</a>, a major update to the Apache Hadoop 2.0 code that lets Hadoop clusters run multiple processing frameworks simultaneously. Twitter <a href="http://gigaom.com/2012/04/19/twitter-backs-fave-big-data-projects-with-apache-sponsorship/">has been using the open source Mesos resource manager</a> to achieve the same general capabilities, but Gade’s colleague Dmitriy Ryaboy said during the same talk that the company plans to begin using YARN for some big data workloads when it upgrades to Hadoop 2.0. He expects — probably correctly — much more community effort will go toward continuously improving its capabilities and building applications for YARN.</p>
<p>Facebook’s Wormhole project isn’t open source (as far as I can tell), but its lessons are still valuable (and LinkedIn has <a href="http://blog.linkedin.com/2011/01/11/open-source-linkedin-kafka/">open sourced a similar technologies named Kafka</a> and <a href="http://data.linkedin.com/projects/databus">Databus</a>). It’s what’s called a publish-subscribe system, which is essentially a concise way of saying that it manages communications between applications that publish information (e.g., updates to a database) and subscribe to the information their fellow applications are publishing. At Facebook, for example, Wormhole sends changes to Facebook’s master user database to Graph Search so that search results are as up to date as possible, or to its Hadoop environment so analytics jobs have the newest data.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/06/wormhole.png"><img alt="wormhole" src="http://gigaom2.files.wordpress.com/2013/06/wormhole.png?w=708&#038;h=584" width="708" height="584" class="aligncenter size-large wp-image-657677"></a></p>
<p>Of course, like all things Facebook (its <a href="http://gigaom.com/2013/06/06/facebook-unveils-presto-engine-for-querying-250-pb-data-warehouse/">new Presto interactive query engine</a> comes to mind), Wormhole is built to scale. Latency is in the low milliseconds and, blog post author Laurent Demailly notes</p>
<blockquote id="quote-wormhole-processes-o"><p>“Wormhole processes over <b>1 trillion</b> messages every day (significantly more than 10 million messages every second). Like any system at Facebook’s scale, Wormhole is engineered to deal with failure of individual components, integrate with monitoring systems, perform automatic remediation, enable capacity planning, automate provisioning and adapt to sudden changes in usage pattern.”</p></blockquote>
<p>Although they were developed within separate companies, there’s actually a tie that binds Yahoo’s Storm-in-Hadoop work and Facebook’s Wormhole. As web companies grow from their initial applications into sprawling business composed of numerous applications and services, so too do their infrastructures. To address the differing needs of their various systems at the data level, the companies have begun breaking them down by their latency requirements (i.e., real-time, near real-time and batch, however they choose to word them) and then building tools such as Storm and Wormhole to manage to flow of data between the systems.</p>
<p>We’ve previously explained in some detail <a href="http://gigaom.com/2013/03/03/how-and-why-linkedin-is-becoming-an-engineering-powerhouse/">how LinkedIn</a> and <a href="http://gigaom.com/2013/03/28/3-shades-of-latency-how-netflix-built-a-data-architecture-around-timeliness/">Netflix</a> have built their data architectures around these principles, and we’ll hear a lot more about how they and other web companies are tackling this situation at <a href="http://event.gigaom.com/structure/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=657636+a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop&amp;utm_content=dharrisstructure">Structure next week</a>. Among the speakers are senior engineers and technology executives from Facebook, Google, LinkedIn, Box, Netflix and Amazon.</p>
<p><em><strong>Update: </strong>This post was updated at 1:46 p.m. to clarify that Twitter is not eliminating Mesos for all its workloads. </em></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-553555p1.html">Shutterstock user agsandrew</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=657636&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=715975"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=715975" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657636+a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/11/unlocking-big-datas-potential-with-search/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657636+a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop&utm_content=dharrisstructure">How search can unlock the power of big data</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657636+a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657636+a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop&utm_content=dharrisstructure">Dissecting the data: 5 issues for our digital future</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/14/a-real-time-bonanza-facebooks-wormhole-and-yahoos-streaming-hadoop/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/06/shutterstock_122114275.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/06/shutterstock_122114275.jpg?w=150" medium="image">
			<media:title type="html">streaming real time</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/img_20130606_120037.jpg?w=708" medium="image">
			<media:title type="html">Krishna Gade talking Storm at the Facebook event.</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/wormhole.png?w=708" medium="image">
			<media:title type="html">wormhole</media:title>
		</media:content>
	</item>
		<item>
		<title>Ex-Yahoo CTO launches Altiscale, hardcore Hadoop as a service</title>
		<link>http://gigaom.com/2013/06/13/ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service/</link>
		<comments>http://gigaom.com/2013/06/13/ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service/#comments</comments>
		<pubDate>Thu, 13 Jun 2013 13:40:45 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Altiscale]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=657333</guid>
		<description><![CDATA[Raymie Stata spent seven years working on the guts of Hadoop as a VP, chief architect and CTO at Yahoo. His new Hadoop startup, called Altiscale, has raised a $12 million from some prominent investors.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=657333&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Raymie Stata knows a lot about Hadoop. It was Stata who helped bring Hadoop creator Doug Cutting to Yahoo in 2006, and as during a seven-year stint as chief architect and then CTO at Yahoo, Stata was instrumental in helping position Hadoop <a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/">as the technology famously “behind every click”</a> at the web portal. Now, Stata is trying his hand at the Hadoop startup game, launching a new startup called <a href="http://www.altiscale.com/">Altiscale</a> that recently closed a $12 million Series A round from Sequoia Capital and General Catalyst Partners, as well as Accel Partners, Jerry Yang’s AME Ventures and a few individual investors.</p>
<p>Altiscale is in some ways a manifestation of Stata’s seven years of experience helping turn Hadoop from a cute little project into a production system running across 42,000 nodes. It might not be not pretty, but it gets the job done. And, thanks to the handful of former senior Yahoo, Google and LinkedIn engineers that joined Stata (who’s the company’s CEO) at Altiscale, the company knows Hadoop cold.</p>
<div id="attachment_657399" class="wp-caption aligncenter" style="width: 540px"><a href="http://gigaom2.files.wordpress.com/2013/06/team-photo.jpg"><img alt="Team Altiscale (State is middle row, second from left). Source: Altiscale" src="http://gigaom2.files.wordpress.com/2013/06/team-photo.jpg?w=708"   class="size-full wp-image-657399"></a><p class="wp-caption-text">Team Altiscale (State is middle row, second from left). Source: Altiscale</p></div>
<p>The deep knowledge of Hadoop shows itself in the product design and business model. The company is “all Hadoop, all the time,” he explained, and everything — including the hardware and the network — is optimized for particular aspects of Hadoop workloads and operations. Essentially, Stata told me, Altiscale wants to be companies’ Hadoop dial tone — when users need to run a job, the service should just be there ready to do it.</p>
<p>So, although Altiscale is a hosted service, it’s not exactly a<em> cloud</em> service as many people would define it. Rather than charge by the hour, for example, Stata’s experience suggests Hadoop services are best charged based on a monthly baseline usage with room even built in for reasonable overages. This is because companies familiar with Hadoop usually understand their baseline requirements, give or take a handful of additional jobs, and would prefer to be able to budget for that each month.</p>
<p>He compares traditional hourly cloud billing to cell-phone billing in the 1990s: “At the end of the month,” he joked, “you were typically surprised on the wrong side.” Altiscale is more like a wireless plan with a maximum amount of minutes per month and some rollover minutes included. In fact, Stata said,  ”We’re pretty forgiving in terms of the limits. … As long as you’re not abusive, you don’t get charged more for it.”</p>
<p>And unlike many other Hadoop services, Altiscale isn’t immediately <a href="http://gigaom.com/2012/11/28/mortar-data-wants-to-become-a-hadoop-developers-best-friend/">going after developers who want to try their hand at big data</a> or deal with data through a wizbang interface. Rather, its initial audience is current Hadoop users — companies and data scientists — who know how the technology works but just want a better way to consume it. Right now, users access Altiscale by SSHing into a “desktop” environment (that’s actually hosted on Amazon Web Services) that gives them access to their favorite Hadoop tools such as MapReduce, Hive, Pig and Flume, as well as to data science tools such as R.</p>
<p>“We call that the scaling down problem,” Stata said.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/06/infographic.jpg"><img alt="altiscale" src="http://gigaom2.files.wordpress.com/2013/06/infographic.jpg?w=708"   class="aligncenter size-full wp-image-657407"></a></p>
<p>What that means is that it takes a lot of effort to build a true self-service model that greenhorn Hadoop users can dive right into, and Altiscale would be irrelevant if waited to launch until it had figured that out. Part of that is a design problem, and part of that is a matter of Hadoop being designed to run better at scale. Plus, Stata added, the folks who got to first or second gear with Hadoop and then got stuck are way underserved right now.</p>
<p>However, although Altiscale might be about serving experienced Hadoop users with a more-managed experience, it’s not about serving legacy workloads. A lot of companies are using Hadoop today to somehow perform traditional enterprise data warehouse tasks or tie tightly into existing IT environments, he explained, but “we go after what I call ‘new data problems.’” That means online advertising and any workloads — servers log analysis, smart grid data, logistics, etc. — relying heavily on lots of sensor- or machine-generated data that can stream right into Hadoop.</p>
<p>Stata acknowledges it won’t be easy trying to win customer away from established Hadoop vendors such as Cloudera, MapR and Hortonworks (which <a href="http://gigaom.com/2011/06/27/exclusive-yahoo-launching-hadoop-spinoff-this-week/">many of Stata’s former Yahoo comrades founded</a>), but, he told me a few months ago, he thinks its very doable. That’s because no matter how easy they make it to manage Hadoop, there’s a class of customers that’s just better served with a cloud service rather than trying to scale their operations staff and energy bill along with their Hadoop cluster.</p>
<p>“Self-managed Hadoop, essentially, is [those vendors'] ultimate goal,” Stata said. “Our goal is to to just take on the management responsibility, to take on all those management things the Yahoos and Googles do under the covers and just run Hadoop as a managed service. The winds of change are in our favor.”</p>
<p><em>If you want to hear more about where Hadoop is head, stop by our <a href="http://event.gigaom.com/structure/schedule/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=657333+ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service&amp;utm_content=dharrisstructure">Structure conference</a> next week, where I’ll be discussing that topic with Google Fellow and MapReduce creator Jeff Dean. Other webscale speakers include Facebook VP of Engineering Jay Parikh, Box VP of Engineering Sam Schillace and Amazon CTO Werner Vogels.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=657333&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=562353"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=562353" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657333+ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657333+ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/06/cloud-computing-infrastructure-2012-and-beyond/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657333+ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service&utm_content=dharrisstructure">Cloud computing infrastructure: 2012 and beyond</a></li><li><a href="http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=657333+ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service&utm_content=dharrisstructure">2012: The Hadoop infrastructure market booms</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/13/ex-yahoo-cto-launches-altiscale-hardcore-hadoop-as-a-service/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/06/team-photo1.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/06/team-photo1.jpg?w=150" medium="image">
			<media:title type="html">team-photo</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/team-photo.jpg" medium="image">
			<media:title type="html">Team Altiscale (State is middle row, second from left). Source: Altiscale</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/infographic.jpg" medium="image">
			<media:title type="html">altiscale</media:title>
		</media:content>
	</item>
		<item>
		<title>Under the covers of the NSA&#8217;s big data effort</title>
		<link>http://gigaom.com/2013/06/07/under-the-covers-of-the-nsas-big-data-effort/</link>
		<comments>http://gigaom.com/2013/06/07/under-the-covers-of-the-nsas-big-data-effort/#comments</comments>
		<pubDate>Sat, 08 Jun 2013 02:15:19 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Accumulo]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cybersecurity]]></category>
		<category><![CDATA[data privacy]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[intelligence]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[NSA]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[spying]]></category>
		<category><![CDATA[sqrrl]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=655599</guid>
		<description><![CDATA[There's much debate still to be had over the NSA's recently uncovered data-collection practices, but some of the technologies underlying them are out in the open. Here's what we know already.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=655599&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The <a href="http://gigaom.com/2013/06/07/through-a-prism-darkly-tracking-the-ongoing-nsa-surveillance-story/">NSA&#8217;s data collection practices</a> have much of America &#8212; and certainly the tech community &#8212; on edge, but sources familiar with the agency&#8217;s technology are saying the situation isn&#8217;t as bad as it seems. Yes, the agency has a lot of data and can do some powerful analysis, but, the argument goes, there are strict limits in place around how the agency can use it and who has access. Whether that&#8217;s good enough is still an open debate, but here&#8217;s what we know about the technology that&#8217;s underpinning all that data.</p>
<h2 id="what-is-accumulo">What is Accumulo?</h2>
<p>The technological linchpin to everything the NSA is doing from a data-analysis perspective is <a href="http://en.wikipedia.org/wiki/Apache_Accumulo">Accumulo</a> &#8212; an open-source database the agency built in order to store and analyze huge amounts of data. Adam Fuchs knows Accumulo well because he helped build it during a nine-year stint with the NSA; he&#8217;s now co-founder and CTO of a company called <a href="http://www.sqrrl.com/">Sqrrl</a> that sells a commercial version of the database system. I spoke with him earlier this week, days before news broke of the NSA collecting data from Verizon and the country&#8217;s largest web companies.</p>
<div id="attachment_655914" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/06/fuchs.jpg"><img  alt="fuchs" src="http://gigaom2.files.wordpress.com/2013/06/fuchs.jpg?w=300&#038;h=173" width="300" height="173" class="size-medium wp-image-655914" /></a><p class="wp-caption-text">Adam Fuchs</p></div>
<p>The NSA began building Accumulo in late 2007, Fuchs said, because they were trying to do automated analysis for tracking and discovering new terrorism suspects. &#8220;We had a set of applications that we wanted to develop and we were looking for the right infrastructure to build them on,&#8221; he said.</p>
<p>The problem was those technologies weren&#8217;t available. He liked what projects like HBase were doing by using Hadoop to mimic Google&#8217;s famous BigTable data store, but it still wasn&#8217;t up to the NSA requirements around scalability, reliability or security. So, they began work on a project called CloudBase, which eventually was renamed Accumulo.</p>
<p>Now, Fuchs said, &#8220;It&#8217;s operating at thousands-of-nodes scale&#8221; within the NSA&#8217;s data centers. There are multiple instances each storing tens of petabytes (1 petabyte equals 1,000 terabyes or 1 million gigabytes) of data and it&#8217;s the backend of the agency&#8217;s most widely used analytical capabilities. Accumulo&#8217;s ability to handle data in a variety of formats (a characteristic called <a href="http://stackoverflow.com/questions/15589184/what-does-being-schema-less-mean-for-a-nosql-database">&#8220;schemaless&#8221;</a> in database jargon) means the NSA can store data from numerous sources all within the database and add new analytic capabilities in days or even hours.</p>
<p>&#8220;It&#8217;s quite critical,&#8221; he added.</p>
<h2 id="what-the-nsa-can-and-cant-do-w">What the NSA can and can&#8217;t do with all this data</h2>
<p>As I <a href="http://gigaom.com/2013/06/06/heres-how-the-nsa-analyzes-all-that-call-data/">explained on Thursday</a>, Accumulo is especially adept at analyzing trillions of data points in order to build massive graphs that can detect the connections between them and the strength of the connections. Fuchs didn&#8217;t talk about the size of the NSA&#8217;s graph, but he did say the database is designed to handle months or years worth of information and let analysts move from query to query very fast. When you&#8217;re talking about analyzing call records, it&#8217;s easy to see where this type of analysis would be valuable in determining how far a suspected terrorist&#8217;s network might spread and who might be involved.</p>
<p>Stewart Baker, former NSA general counsel under George W. Bush, <a href="http://www.skatingonstilts.com/skating-on-stilts/2013/06/the-fisa-court-order-flap-take-a-deep-breath.html">wrote on his blog Thursday</a> that this type of data could also be used for for general pattern recognition &#8212; the kinds of stuff that targeted advertisers love to do. Only, instead of the system serving someone an ad because of what they&#8217;ve been searching for and the operating system they&#8217;re using, Baker presented the hypothetical of &#8220;[an] American who makes a call to Yemen at 11 a.m., Sanaa time, hangs up after a few seconds, and then gets a call from a different Yemeni number three hours later.&#8221;</p>
<p>The big legal question here is around probable cause and whether the government should further investigate this caller based on call patterns similar to those of known terrorists, but the big data question is around false positives. Baker&#8217;s hypothetical might appear pretty cut and dry but, data scientist <a href="http://www.linkedin.com/in/turian">Joseph Turian</a> explains, call records in general probably don&#8217;t offer too strong of a signal and could lead to situations where innocent behavior patterns looks a lot like nefarious ones. &#8220;But once you start connecting the dots with other pieces of information you have from other sources,&#8221; he said via email, &#8220;you can start making more predictions.&#8221;</p>
<p>This is where a program like PRISM, the NSA&#8217;s reported effort to collect data straight from the likes of Google, Facebook and Apple could come into play. If you&#8217;re able to tie a name or web account to a phone number, you can figure out all sorts of information. If you can prove that certain people are radical Islamists, for example, you can start to infer more things about the others in that social graph.</p>
<p>And if Sqrrl&#8217;s capabilities are any indicator of what Accumulo is supporting within the NSA, the agency can perform a lot of simpler functions on its data as well. In addition to graph processing, said Ely Kahn, Sqrrl&#8217;s co-founder and VP of business development, their product includes pre-packaged analytic capabilities around SQL queries and full-text search, and also supports streaming data. This means Sqrrl&#8217;s version can support any number of interesting use cases &#8212; from processing data as it hits the system to keeping a massive index that can be searched in the same way someone searches the web.</p>
<h2 id="how-much-data-is-the-nsa-colle">How much data is the NSA collecting? Follow the money</h2>
<p>We&#8217;re not quite sure how much data the two programs that came to light this week are actually collecting, but the evidence suggests it&#8217;s not that much &#8212; at least from a volume perspective. Take the PRISM program that&#8217;s gathering data from web properties including Google, Facebook, Microsoft, Apple, Yahoo and AOL. It seems the NSA would have to be selective in what it grabs.</p>
<p>Assuming it includes every cost associated with running the program, the $20 million per year allocated to PRISM, <a href="http://www.washingtonpost.com/wp-srv/special/politics/prism-collection-documents/">according to the slides published by the</a> <em>Washington Post</em>, wouldn&#8217;t be nearly enough to store all the raw data &#8212; much less new datasets created from analyses &#8212; from such large web properties. Yahoo alone, I&#8217;m told, was spending over $100 million a year to operate its <a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/">approximately 42,000-node Hadoop environment</a>, consisting of hundreds of petabytes, a few years ago. Facebook users <a href="http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/">are generating more than 500 terabytes of new data</a> every day.</p>
<p>Using about the least-expensive option around for mass storage &#8212; cloud storage provider Backblaze&#8217;s <a href="http://gigaom.com/2013/02/20/it-turns-out-a-lot-of-companies-like-building-their-own-storage-gear/">open source storage pod designs</a> &#8212; just storing 500 terabytes of Facebook data a day would cost more than $10 million in hardware alone over the course of a year. Using higher-performance hard drives or other premium gear &#8212; things Backblaze eschews because it&#8217;s concerned primarily about cost and scalability rather than performance &#8212; would cost even more.</p>
<p>Even at the Backblaze price point, though, which is pocket change for the NSA, the agency would easily run over $20 million trying to store too many emails, chats, Skype calls, photos, videos and other types data from the other companies it&#8217;s working with.</p>
<p>Actually, it&#8217;s possible the intelligence community is taking advantage of the Backblaze designs. In September 2011, Backblaze CEO Gleb Budman says, he met with CIA representatives who discussed that agency&#8217;s five-year plan &#8220;to centralize data services into a large private cloud&#8221; and how Backblaze&#8217;s technology might fit into it. Its plans for analyzing this data, as illustrated in the slide below (and <a href="http://gigaom.com/2013/03/20/even-the-cia-is-struggling-to-deal-with-the-volume-of-real-time-social-data/">discussed by CIA CTO Ira &#8220;Gus&#8221; Hunt at Structure: Data</a> in March), seem to mirror what the NSA has in mind.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/06/cia-big-data.jpg"><img  alt="cia big data" src="http://gigaom2.files.wordpress.com/2013/06/cia-big-data.jpg?w=708&#038;h=531" width="708" height="531" class="aligncenter size-large wp-image-655904" /></a>Whatever type of gear the NSA is using, though, and how ever much it&#8217;s spending on the Verizon data or PRISM specifically, we do know the agency is spending a lot of money on its data infrastructure. There are those dozens (at least) of petabytes of overall data in Accumulo, and the agency is famously building a 1-million-square-foot, $1.5 billion data center in Utah. It <a href="http://www.datacenterknowledge.com/archives/2013/06/06/nsa-to-build-860-million-hpc-center-in-maryland/">recently began construction on a 600,000-square-foot, $860 million facility</a> in Maryland.</p>
<h2 id="policies-are-in-place">Policies are in place</h2>
<p>Sqrrl&#8217;s Kahn &#8212; who previously served as director of cybersecurity strategy at the National Security Staff in the White House &#8212; says even with all the effort it&#8217;s putting into data collection and analysis, the NSA really is concerned about privacy. Not only are there strict administrative and legal limitations in place about when the agency can actually search through collected data (something Stewart Baker <a href="http://www.skatingonstilts.com/skating-on-stilts/2013/06/stewart-baker-fisa-nsa-law.html">explains in more detail</a> in a Friday blog post), but Accumulo itself was designed with privacy in mind.</p>
<p>The system itself is designed to make sure there&#8217;s not a free-for-all on data, another individual familiar with Accumulo said.</p>
<p>It has what Kahn and Sqrrl CTO Fuchs described as &#8220;cell-level&#8221; security, meaning administrators can manage access to individual pieces of data within a table. Furthermore, Fuchs explained, those policies stick with the data as it&#8217;s transformed as part of the analysis process, so someone prohibited from seeing it won&#8217;t be able to see it just because it&#8217;s now part of a different dataset. When data would come into the NSA from the CIA, he said, there were policies in place around who could see it, and Accumulo helped enforce them.</p>
<p>Even agencies within the Department of Homeland Security are using or experimenting with Accumulo, Kahn added, because <a href="http://gigaom.com/2012/04/11/cispa-isnt-sopa-but-it-isnt-ideal-and-it-might-become-law/">proposed legislation</a> would put them in charge of ensuring privacy as cybersecurity data exchanges hands between the government and private corporations.</p>
<p>It&#8217;s ironic he acknowledged, but Accumulo actually flips the presumed paradigm that stricter security and privacy regulations mean less sharing. That might be a shallow victory for citizens concerned about their civil liberties, but data collection and sharing don&#8217;t seem likely to stop any time soon. At least it&#8217;s something.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=655599&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=777323"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=777323" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=655599+under-the-covers-of-the-nsas-big-data-effort&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=655599+under-the-covers-of-the-nsas-big-data-effort&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2010/10/with-scalable-data-stores-around-is-nosql-a-non-starter/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=655599+under-the-covers-of-the-nsas-big-data-effort&utm_content=dharrisstructure">With Scalable Data Stores Around, Is NoSQL a Non-Starter?</a></li><li><a href="http://pro.gigaom.com/2009/12/will-the-real-time-web-bring-high-performance-to-a-system-near-you/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=655599+under-the-covers-of-the-nsas-big-data-effort&utm_content=dharrisstructure">Will the Real-Time Web Bring High Performance to a System Near You?</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/07/under-the-covers-of-the-nsas-big-data-effort/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/02/shutterstock_37622056.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/02/shutterstock_37622056.jpg?w=150" medium="image">
			<media:title type="html">sql statement</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/fuchs.jpg?w=300" medium="image">
			<media:title type="html">fuchs</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/cia-big-data.jpg?w=708" medium="image">
			<media:title type="html">cia big data</media:title>
		</media:content>
	</item>
		<item>
		<title>Here&#8217;s how the NSA analyzes all that call data</title>
		<link>http://gigaom.com/2013/06/06/heres-how-the-nsa-analyzes-all-that-call-data/</link>
		<comments>http://gigaom.com/2013/06/06/heres-how-the-nsa-analyzes-all-that-call-data/#comments</comments>
		<pubDate>Thu, 06 Jun 2013 22:13:39 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Accumulo]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data privacy]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[graph analysis]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[intelligence]]></category>
		<category><![CDATA[NSA]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=654984</guid>
		<description><![CDATA[How does the NSA analyze all the data it's collecting from cell phone users? With a massive database system built with just such scale and workloads in mind.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=654984&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The National Security Agency might not have the names of Verizon&#8217;s wireless customers, but the agency probably can figure out what they&#8217;re up to if it&#8217;s so inclined. The metadata Verizon has provided the NSA &#8212; phone numbers, numbers called, duration of calls, location &#8212; is a veritable treasure trove to an organization with the right analytic skills and the right tools. The NSA has both.</p>
<p>There are numerous methods the NSA could use to extract some insights from what must be a mind-blowing number of phone calls and text messages, but graph analysis is likely the king. As <a href="http://gigaom.com/2013/05/14/were-witnessing-the-rise-of-the-graph-in-big-data/">we&#8217;ve explained numerous times over the past few months</a>, graph analysis is ideal for identifying connections among pieces of data. It&#8217;s what powers social graphs, product recommendations and even some fairly complex medical research.</p>
<div id="attachment_645089" class="wp-caption aligncenter" style="width: 677px"><a href="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg"><img  alt="My LinkedIn social graph" src="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg?w=708"   class="size-full wp-image-645089" /></a><p class="wp-caption-text">My LinkedIn social graph</p></div>
<p>But now it has really come to the fore as a tool for fighting crime (or intruding on civil liberties, however you want to look at it). The NSA is storing all those Verizon (and, presumably, other carrier records) in a <a href="http://en.wikipedia.org/wiki/Apache_Accumulo">massive database system called Accumulo</a>, which it built itself (on top of Hadoop) a few years ago because there weren&#8217;t any other options suitable for its scale and requirements around stability or security. The NSA is currently storing tens of petabytes of data in Accumulo.</p>
<p><strong>For a more thorough description of Accumulo and the NSA infrastructure, read our post <a href="http://gigaom.com/2013/06/07/under-the-covers-of-the-nsas-big-data-effort/">&#8220;Under the covers of the NSA&#8217;s big data effort.&#8221;</a></strong></p>
<p>In graph parlance, vertices are the individual data points (e.g., phone numbers or social network users) and edges are the connections among them. In late May, the NSA <a href="http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf">released a slide presentation</a> detailing how fast fast Accumulo is able to process a 4.4-trillion-node, 70-trillion-edge graph. By way of comparison, the graph behind Facebook&#8217;s Graph Search feature <a href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-indexing-and-ranking-in-graph-search/10151361720763920">contains billions of nodes and trillions of edges</a>. (In the low trillions, from what I understand.)</p>
<p>So, yes, the NSA is able to easily analyze the call and text-message records of hundreds of million of mobile subscribers. It&#8217;s also <a href="http://www.datacenterknowledge.com/archives/2013/06/06/nsa-to-build-860-million-hpc-center-in-maryland/">building out some massive data center real estate</a> to support all the data it&#8217;s collecting.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/06/nsa.jpg"><img  alt="nsa" src="http://gigaom2.files.wordpress.com/2013/06/nsa.jpg?w=708&#038;h=530" width="708" height="530" class="aligncenter size-large wp-image-655344" /></a></p>
<p>How might a graph analysis work within the NSA? The easy answer, <a href="http://arstechnica.com/tech-policy/2013/06/white-house-spying-on-us-citizens-critical-tool-for-fighting-terror/#p3">which the government has acknowledged</a>, is to figure out who else is in contact with suspected terrorists. If there&#8217;s a strong connection between you and Public Enemy No. 1, the NSA will find out and get to work figuring out who you are. That could be via a search warrant or wiretap authorization, or it could conceivably <a href="http://gigaom.com/2013/03/28/when-theres-no-such-thing-as-anonymous-data-does-privacy-just-mean-security/">figure out who someone likely is by using location data</a>.</p>
<p>Having such a big database of call records also provides the NSA with an easy way to go back and find out information about someone should their number pop up in a future investigation. Assuming the number is somewhere in their index, agents can track it down and get to work figuring out who it&#8217;s related to and from where it has been making calls.</p>
<p>Presumably, agents could begin with location data, too. If a bomb went off at Location X, bringing up all the numbers making calls from towers in that area might be a good starting point for investigation. Tracking someone&#8217;s movement from location data could be helpful, too.</p>
<p>If this all sounds a little creepy, maybe it should. After all, the world&#8217;s biggest, baddest intelligence agency can pretty much figure out who you are, who you know and where you go. And unlike web and retail companies that <a href="http://gigaom.com/2013/06/05/will-the-latest-nsa-surveillance-scandal-be-a-wake-up-call-for-the-power-of-data/">collect and analyze so much data about us</a>, the government can put you in jail.</p>
<p>It might be even creepier when you consider <a href="http://gigaom.com/2012/01/24/supreme-court-sidesteps-digital-privacy-for-now/">how much other data law enforcement agencies can collect about you </a>without a warrant.</p>
<p>However, someone familiar with NSA policy told me, the good news is that the vast majority of people are still anonymous even in this sea of data: There&#8217;s just too much data to care until someone pops up in the bad guys&#8217; networks or gets on the agency&#8217;s radar.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=654984&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=394943"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=394943" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=654984+heres-how-the-nsa-analyzes-all-that-call-data&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=654984+heres-how-the-nsa-analyzes-all-that-call-data&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=654984+heres-how-the-nsa-analyzes-all-that-call-data&utm_content=dharrisstructure">Dissecting the data: 5 issues for our digital future</a></li><li><a href="http://pro.gigaom.com/report/the-new-economics-of-enterprise-data-warehousing/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=654984+heres-how-the-nsa-analyzes-all-that-call-data&utm_content=dharrisstructure">How data warehousing is now a cost-effective solution for businesses</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/06/heres-how-the-nsa-analyzes-all-that-call-data/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/06/nsa1.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/06/nsa1.jpg?w=150" medium="image">
			<media:title type="html">nsa</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg" medium="image">
			<media:title type="html">My LinkedIn social graph</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/nsa.jpg?w=708" medium="image">
			<media:title type="html">nsa</media:title>
		</media:content>
	</item>
		<item>
		<title>Cloudera adds search to Hadoop distro and says it&#8217;s just getting started</title>
		<link>http://gigaom.com/2013/06/04/cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started/</link>
		<comments>http://gigaom.com/2013/06/04/cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started/#comments</comments>
		<pubDate>Tue, 04 Jun 2013 17:01:41 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Apache Solr]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[enterprise-search]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[SOLR]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=653888</guid>
		<description><![CDATA[Cloudera's new search feature, based on the Apache Solr project, is the latest move by the company to expand the utility of its Hadoop distribution. It's also far from the last.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=653888&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>According to Cloudera CEO Mike Olson, his company has &#8220;decades&#8221; in front of it in which to enhance its Hadoop platform to become the go-to place for data storage and analysis. At a Tuesday event in San Francisco, Cloudera announced the latest feature meant to further that strategy &#8212; full-text search. It comes just weeks after the company&#8217;s Impala interactive SQL query engine <a href="http://gigaom.com/2013/04/30/with-impala-now-ga-clouderas-ceo-sizes-up-the-sql-on-hadoop-market/">became publicly available</a>.</p>
<p>The general idea behind adding search (something competitor MapR <a href="http://gigaom.com/2013/05/01/mapr-releases-m7-its-commercial-hbase-distro/">actually did in May</a>), is to let people without deep technical skills find the information they need within a Hadoop cluster in a way that&#8217;s familiar to them. &#8220;You don&#8217;t even have to understand what SQL is. You can just type words into a box,&#8221; Olson said during a recent phone call, comparing Cloudera&#8217;s search to the process of finding information online or within your Gmail history.</p>
<div id="attachment_603561" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/01/1z5o1503.jpg"><img  alt="Structure Data 2012: Michael Olson – CEO, Cloudera" src="http://gigaom2.files.wordpress.com/2013/01/1z5o1503.jpg?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-603561" /></a><p class="wp-caption-text">Cloudera CEO Mike Olson at Structure: Data 2012<br />(c) 2012 Pinar Ozger pinar@pinarozger.com</p></div>
<p>&#8220;Think about it,&#8221; he added. &#8220;You got a petabyte of data, you can&#8217;t use folders anymore.&#8221;</p>
<p>Even though search is easier than SQL, though, it seems pretty obvious that hourly workers and front-desk staff probably won&#8217;t be rooting around in Hadoop searching for data (although that&#8217;s possible in theory if the right application was in place).</p>
<p>However, a couple of examples from the search feature&#8217;s private beta users (it&#8217;s now available in public beta and will be generally available in the third quarter) help illustrate what Olson is talking about and how it might apply in corporate settings. Agri-business giant Monsanto is using search to help index &#8212; and later find information from &#8212; its collections of images that track plant characteristics through their lifecycle, a process that used to require lots of manual work within a database not designed to handle images and metadata. Health care customer Exlorys is using Cloudera&#8217;s search tool to consolidate and index its server logs so it can track down IT issues more easily and maintain SLAs for its applications.</p>
<p>Discussing MapR&#8217;s new search feature in April, VP of Marketing Jack Norris suggested a use case wherein users might use MapReduce to cluster a group of customers and then use search to drill down further into their behavior.</p>
<p>Cloudera&#8217;s search is powered by the <a href="http://lucene.apache.org/solr/">Apache Solr project</a>, which happens to be based on the Apache Lucene project that Cloudera Chief Architect Doug Cutting founded before <a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/">he founded Hadoop</a>. Exact features of Cloudera Search, as well as a quote from private beta user Dell, are <a href="http://www.marketwire.com/press-release/cloudera-democratizes-apache-hadoop-enterprise-end-users-with-open-source-interactive-1798110.htm">available in the product&#8217;s press release</a>. MapR&#8217;s search is powered by <a href="http://www.lucidworks.com">LucidWorks</a>, a commercial search platform based on the Solr and Lucene projects.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/06/search.jpg"><img  alt="search" src="http://gigaom2.files.wordpress.com/2013/06/search.jpg?w=708&#038;h=390" width="708" height="390" class="aligncenter size-large wp-image-654152" /></a></p>
<p>Olson said Cloudera is dedicated to continually improving the capabilities of Solr now that it&#8217;s officially part of the company&#8217;s Hadoop distribution. When asked about predictive and semantic search like consumers now experience with Google and Microsoft Bing, he pointed to a feature called <a href="http://www.cloudera.com/content/cloudera/en/products/cloudera-navigator.html">Navigator</a> &#8212; which keeps track of who touched pieces of data, what systems it passed through, what types of queries people run on it and various other attributes &#8212; as the possible foundation for such features in an enterprise environment. He&#8217;s not sure exactly what that might look like in practice, but, Olson added, &#8220;I think there&#8217;s lots of opportunity for advancement there.&#8221;</p>
<h2 id="one-platform-to-rule-them-all">One platform to rule them all?</h2>
<p>The bigger picture here, though, is the encroachment of the open source Hadoop technology &#8212; whether sold by Cloudera, Hortonworks, MapR or whomever &#8212; into the lucrative data management and analytics space once (and still) dominated by vendors selling expensive software and big-iron systems. For now, Olson said, technologies like Cloudera Impala and the new search feature will be less functional than their legacy counterparts (Teradata for data warehousing and Autonomy for enterprise search, for example), but that could change over time.</p>
<p>&#8220;We have decades of life in front of this company in order to enhance [our platform],&#8221; Olson said.</p>
<p>Further, inertia can kick in as companies place more data into Hadoop, making it less appealing to move that data onto another system if the work can just as easily be done within Hadoop. When it comes to how many workloads and how much money Hadoop could ultimately steal from legacy vendors, Olson &#8212; <a href="http://gigaom.com/2013/05/29/why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end/">like his peers at other Hadoop vendors</a> &#8212; is hedging for the time being. &#8220;I don&#8217;t want to make audacious and unsupportable claims,&#8221; he said. &#8220;&#8230; We can all make up numbers.&#8221;</p>
<p>However, he did take some credit fore Teradata&#8217;s recent lackluster quarter(s), stating that even though Cloudera customers aren&#8217;t ripping out their legacy systems, they&#8217;re also not really investing more money into them. &#8220;It is true to say folks are looking at what they&#8217;re running on Teradata and rationalizing those decisions,&#8221; Olson said. &#8220;&#8230; [They're trying to] concentrate first-class spend on a first-class workload.&#8221;</p>
<p>For more on how the SQL-on-Hadoop community, specifically, intends to take on the legacy vendors, check out this panel from our Structure: Data conference in March.</p>
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='604' height='370' src='http://www.youtube.com/embed/neo6TE41I8I?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=653888&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=371369"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=371369" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=653888+cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/11/unlocking-big-datas-potential-with-search/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=653888+cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started&utm_content=dharrisstructure">How search can unlock the power of big data</a></li><li><a href="http://pro.gigaom.com/2010/10/the-case-for-open-source-search-in-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=653888+cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started&utm_content=dharrisstructure">The Case for Open Source Search in the Enterprise</a></li><li><a href="http://pro.gigaom.com/2010/07/the-incredible-growing-commercial-hadoop-market/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=653888+cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started&utm_content=dharrisstructure">The Incredible, Growing, Commercial Hadoop Market</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/06/04/cloudera-adds-search-to-hadoop-distro-and-says-its-just-getting-started/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/01/1z5o1503.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/01/1z5o1503.jpg?w=150" medium="image">
			<media:title type="html">Structure Data 2012: Michael Olson – CEO, Cloudera</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/01/1z5o1503.jpg?w=300" medium="image">
			<media:title type="html">Structure Data 2012: Michael Olson – CEO, Cloudera</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/06/search.jpg?w=708" medium="image">
			<media:title type="html">search</media:title>
		</media:content>
	</item>
		<item>
		<title>Why Hortonworks is riding a faster Hive to the bitter end</title>
		<link>http://gigaom.com/2013/05/29/why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end/</link>
		<comments>http://gigaom.com/2013/05/29/why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end/#comments</comments>
		<pubDate>Wed, 29 May 2013 23:14:51 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[data warehouse]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[ebay]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[SQL on Hadoop]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=650170</guid>
		<description><![CDATA[While the rest of the Hadoop world is trying to distance itself from Hive with new interactive engines, Hortonworks is trying to make it faster. It might actually be a sound strategy.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=650170&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Hortonworks isn’t about to get off the Apache Hadoop elephant just because everyone around it is now trying to <a href="http://gigaom.com/2013/04/30/with-impala-now-ga-clouderas-ceo-sizes-up-the-sql-on-hadoop-market/">ride impalas</a>. The company released version 1.3 of its Hortonworks Data Platform on Wednesday, a major aspect of which is an improved iteration of <a href="http://hive.apache.org/">Apache Hive</a> that the company claims runs 50 times faster the previous version. Over the next year or so, Hortonworks expects to improve the speed of Hive by 100x its previous limits — this while its competitors are all but leaving Hive in the dust in favor of newer, faster analytic systems.</p>
<p>If you’re unfamiliar with Hive, it’s a project that Facebook developed in 2008 to make Hadoop function more like a traditional enterprise data warehouse. Hive stores data inside the Hadoop Distributed File System in structured format, and then allows users to query it using a language very similar to SQL. Until very recently, Hive has been the de facto method for querying (in a traditional sense) data stored in Hadoop, and it has proven immensely popular as more companies have begun tackling their big data woes with Hadoop.</p>
<h2 id="hive-wasnt-built-for-speed">Hive wasn’t built for speed</h2>
<p>However, as more companies got used to Hadoop, they also began to notice its shortcomings. One of them is around MapReduce, a powerful but not-exactly-speedy method of processing data that requires running the job across every node in the cluster in order to find the right data. Although the Hive interface is that of a SQL query, it relies on on MapReduce as the processing engine.</p>
<p>(For more on how Hadoop and its flavor of MapReduce came to be, read <a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/">this post on the history of Hadoop</a>. To see me speak with Google Fellow and MapReduce creator Jeff Dean about how far Google has moved from a MapReduce-centric computing model, <a href="http://event.gigaom.com/structure/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=650170+why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end&amp;utm_content=dharrisstructure">come to Structure next month</a>.)</p>
<p>Users wanted faster, more-interactive query processing on top of Hadoop, similar to what they had grown accustomed to with data warehouse systems such as Teradata, Greenplum and Netezza. Hadoop vendors such as Cloudera (with Impala), MapR (with <a href="http://gigaom.com/2012/08/17/for-fast-interactive-hadoop-queries-drill-may-be-the-answer/">Drill</a>), IBM (with <a href="http://gigaom.com/2013/05/06/look-ibm-is-doing-sql-on-hadoop-too/">Big SQL</a>) — as well as <a href="http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/">a spate of startups</a> — have obliged with their own new technologies that in various ways blend the familiarity of SQL with the scalability of Hadoop. EMC Greenplum, now Pivotal, has <a href="http://gigaom.com/2013/02/25/emc-to-hadoop-competition-see-ya-wouldnt-wanna-be-ya/">transplanted its existing database system</a> inside of Hadoop.</p>
<p>Even <a href="http://www.qubole.com/">Qubole,</a> a cloud-based startup from Hive creators Ashish Thusoo and Joydeep Sen Sarma, is <a href="http://gigaom.com/2013/04/23/hadoop-startup-qubole-raises-7m-for-hive-as-a-service/">keeping an eye on how projects such as Impala and Shark</a> (from <a href="http://gigaom.com/2013/04/17/welcome-to-berkeley-where-hadoop-isnt-nearly-fast-enough/">the University of California, Berkeley’s AMPLab</a>) might factor into its plans.</p>
<h2 id="giving-hive-a-better-stinger">Giving Hive a better “Stinger”</h2>
<p>Hortonworks, the Yahoo spinoff dedicated to driving the Apache Hadoop bus, is sticking with Hive. But is has a plan, and a point.</p>
<p>Essentially, VP of Products Bob Page told me during a recent briefing, “It just makes more sense from our view to have everything done in one place.” He means that Hive is already the method by which most people are already comfortable using SQL to access Hadoop data, so there’s no use rocking the boat by adding yet another technology into the mix. Hortonworks will just make Hive faster to the point (100x) where it’s at least in the ballpark of what these entirely new systems are capable of doing, but where users still use the same tools for interactive and batch queries.</p>
<p>It has in place a three-phase plan, under the <a href="http://hortonworks.com/stinger/">“Stinger” codename</a>, in order to make this happen. The first phase, now available as part of the Hive 0.11 release, is a new set of analytic functions and a columnar file format that Page says has resulted in a 50x performance increase over the previous version. The next phase is to move <del>YARN</del> Hive off of MapReduce and onto a still-under-development processing framework called <a href="http://wiki.apache.org/incubator/TezProposal">Tez</a>.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/05/stinger.png"><img alt="stinger" src="http://gigaom2.files.wordpress.com/2013/05/stinger.png?w=708"   class="aligncenter size-full wp-image-650283"></a>“You’ll see phase two come to bear later this year,” Page said, once <a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">YARN</a> — a new resource manager that lets Hadoop clusters run multiple processing engines simultaneously — is ready for production.</p>
<p>The third phase is a whole new vector query engine for Hive and new tools for intelligent query planning. Page didn’t have a target date in mind for that phase, except to note that “we’re not talking about a five-year cycle.”</p>
<h2 id="sql-isnt-the-end-game-for-hado">SQL isn’t the end game for Hadoop</h2>
<p>It would be easy to dismiss Page’s and Hortonworks’ optimism about Stinger as a sweet lemons type of rationalization — the company was founded around Apache Hadoop and can’t really go about developing entirely new products outside that foundation — but they also appear to have their eyes focused on a future where SQL isn’t too big a differentiator.</p>
<p>SQL is the way folks used to data for the last 30 years can see how Hadoop fits in their environment, Page said, but the compelling thing about Hadoop “is it really unlocks a new way about how one thinks about storing and processing data.” Once YARN is ready to go, he added, there will be new avenues of innovation in areas like graph analysis and stream processing.</p>
<p>Page comes from a place of credibility when he talks about this evolution in thinking. Before coming to Hortonworks in March, he was vice president of analytics platform and delivery at eBay, <a href="http://gigaom.com/2012/01/31/under-the-covers-of-ebays-big-data-operation/">a company that knows its way around big data</a>. When people get all their data in one place, they want to do more things with it, he explained. The thinking becomes less about using Hadoop to lower cost and more about “How do I use Hadoop to increase my top line?”.</p>
<p>Besides, Page noted (echoing the sentiment of just about everybody else in the Hadoop space, <a href="http://gigaom.com/2013/04/30/with-impala-now-ga-clouderas-ceo-sizes-up-the-sql-on-hadoop-market/">including Cloudera CEO Mike Olson</a>), even as companies turn Hadoop into their primary data store, it’s difficult to see Hadoop ever entirely replacing high-value relational data warehouse systems like Teradata. One could argue, then, that there’s no real purpose in trying too hard to match those systems in terms of capabilities.</p>
<p>At eBay, he said, they ran an in-depth analysis to see if it was economically or technologically feasible to collapse its big data workloads onto a single system. eBay has dozens of petabytes stored in Hadoop and <a href="http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/">possibly more within various Teradata appliances</a>. The result: “We just couldn’t find a way in which we could justify collapsing everything we do into one system.”</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-486163p1.html">Shutterstock user vblinov</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=650170&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=740955"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=740955" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=650170+why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=650170+why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/report/sql-on-hadoop-roadmap-2013/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=650170+why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end&utm_content=dharrisstructure">Sector RoadMap: SQL-on-Hadoop platforms in 2013</a></li><li><a href="http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=650170+why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end&utm_content=dharrisstructure">2012: The Hadoop infrastructure market booms</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/29/why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_57327433.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_57327433.jpg?w=150" medium="image">
			<media:title type="html">wasps nest</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/stinger.png" medium="image">
			<media:title type="html">stinger</media:title>
		</media:content>
	</item>
		<item>
		<title>WibiData gets $15M to help it become the Hadoop application company</title>
		<link>http://gigaom.com/2013/05/23/wibidata-gets-15m-to-help-it-become-the-hadoop-application-company/</link>
		<comments>http://gigaom.com/2013/05/23/wibidata-gets-15m-to-help-it-become-the-hadoop-application-company/#comments</comments>
		<pubDate>Thu, 23 May 2013 11:31:17 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hbase]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[OPower]]></category>
		<category><![CDATA[predictive analytics]]></category>
		<category><![CDATA[WibiData]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=648663</guid>
		<description><![CDATA[Startup WibiData has raised another $15 million and wants to turn the lessons it has learned in the field into generic software that can let anyone build predictive applications on Hadoop.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=648663&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.wibidata.com/">WibiData</a> &#8212; the big data startup from Cloudera Co-founder Christophe Bisciglia and Aaron Kimball &#8212; doesn&#8217;t have <em>overly</em> big plans. It only wants to become one of the first, if not the first, company selling off-the-shelf software that lets other companies build valuable, customer-facing applications on Hadoop. On Thursday, WibiData announced $15 million in Series B funding from Canaan Partners, as well as existing investors NEA and Google Chairman Eric Schmidt, to help make the goal a reality. </p>
<p>Kidding aside, that&#8217;s actually quite an ambitious goal in a Hadoop market that&#8217;s big and growing, but that&#8217;s exemplified by expensive consulting arrangements and purpose-built applications. Even more so for companies that want to do something other than transforming unstructured data into structured data (often called ETL) or run back-office analytics jobs. In fact, WibiData has spent the last 18 months doing just this type of deal, and Bisciglia says every single customer has already engaged with one of the big three Hadoop vendors (Cloudera, Hortonworks and MapR). </p>
<p>Home energy-management startup <a href="http://gigaom.com/2012/11/19/opower-the-big-data-energy-player-to-beat/">Opower</a> is a good example of this process. It&#8217;s actually one of Cloudera&#8217;s banner customers, but &#8220;when they wanted to take [their software-as-a-service tool] beyond batch analysis and ETL workloads,&#8221; Bisciglia said, Opower came to WibiData. So whereas the Opower service was originally focused on nightly data analysis comparing users&#8217; energy usage against that of other users, it&#8217;s now working on dynamic recommendations for users and letting them engage with the application in new ways.</p>
<div id="attachment_648685" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/05/wibi-kiji.jpg"><img  alt="The WibiData architecture" src="http://gigaom2.files.wordpress.com/2013/05/wibi-kiji.jpg?w=300&#038;h=224" width="300" height="224" class="size-medium wp-image-648685" /></a><p class="wp-caption-text">The WibiData architecture</p></div>
<p>During these engagements, WibiData <a href="http://gigaom.com/2012/03/22/wibidata-structure-data-2012/">has been building up its core technology</a> for connecting those brawny back-office Hadoop environments to predictive customer-facing applications &#8211; a collection of HBase, data-formatting tools and machine learning algorithms that the company <a href="http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/">has been slowly open-sourcing under the Kiji banner</a>. It has also been learning the similarities among the applications it&#8217;s building for customers in the same field, figuring out what&#8217;s repeatable. What does any given company in the retail space, for example, need to get started on <a href="http://gigaom.com/2013/05/08/why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you/">its own recommendation engine</a>? </p>
<p>And now, Bisciglia says, WibiData is going to double down on building application software based on what it has learned. The first two industries it targets will likely be financial services and retail, two areas where the company has seen a lot of traction. He envisions the finished product including some pre-defined schema for formatting data and some pre-built predictive models, both broadly applicable across that industry rather than specific to a single user. </p>
<p>There will also be different interfaces that allow different types of users (e.g., data scientists, systems engineers and business users) to interact with the data in the ways they need to. </p>
<p>Time will tell if WibiData can actually accomplish its goal of turning Hadoop into a collection of somewhat specialized software packages, but someone has to. Even industry heavyweights like Cloudera see the need, but their hands are full just getting Hadoop integrated into existing environments and getting those early uses up and running. As Cloudera CEO Mike Olson <a href="http://gigaom.com/2012/03/21/cloudera-structure-data-2012/">said at Structure: Data in 2012</a> to anyone ambitious enough to tackle the Hadoop-application gap, &#8220;Call me, I’ll connect you with funding. The money is out there.&#8221; </p>
<p>If you want to hear more about the need for Hadoop applications, check out this panel from Structure: Data 2013, where I speak with WibiData&#8217;s Omer Trajman, Continuuity&#8217;s Jonathan Gray and Pivotal&#8217;s Muddu Sudhakar. <span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='604' height='370' src='http://www.youtube.com/embed/z7BhGEQX9BQ?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=648663&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=330172"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=330172" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648663+wibidata-gets-15m-to-help-it-become-the-hadoop-application-company&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648663+wibidata-gets-15m-to-help-it-become-the-hadoop-application-company&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/07/cloud-and-data-second-quarter-2012-analysis-and-outlook-2/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648663+wibidata-gets-15m-to-help-it-become-the-hadoop-application-company&utm_content=dharrisstructure">Takeaways from the second quarter in cloud and data</a></li><li><a href="http://pro.gigaom.com/2011/12/why-the-big-data-startup-boom-will-likely-be-short-lived/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648663+wibidata-gets-15m-to-help-it-become-the-hadoop-application-company&utm_content=dharrisstructure">Why the big data startup boom will likely be short-lived</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/23/wibidata-gets-15m-to-help-it-become-the-hadoop-application-company/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/wibi-founders.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/wibi-founders.png?w=150" medium="image">
			<media:title type="html">wibi founders</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/wibi-kiji.jpg?w=300" medium="image">
			<media:title type="html">The WibiData architecture</media:title>
		</media:content>
	</item>
		<item>
		<title>Concurrent is building a Hadoop assembly line in open source</title>
		<link>http://gigaom.com/2013/05/22/concurrent-is-building-a-hadoop-assembly-line-in-open-source/</link>
		<comments>http://gigaom.com/2013/05/22/concurrent-is-building-a-hadoop-assembly-line-in-open-source/#comments</comments>
		<pubDate>Wed, 22 May 2013 19:21:16 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[Cascading]]></category>
		<category><![CDATA[Concurrent]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lingual]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[Pattern]]></category>
		<category><![CDATA[predictive analytics]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[statistical analysis]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=648186</guid>
		<description><![CDATA[Cascading creator Concurrent has developed a new open source tool called Pattern for running machine learning models on Hadoop clusters. When combined with its SQL tool called Lingual, users can move data from one stage to another easily.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=648186&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>If you know Java, R or SAS, doing machine learning on Hadoop data just got a lot easier. <a href="http://www.concurrentinc.com/">Concurrent</a> <em>(</em><em>see disclosure)</em>, the company behind the popular <a href="http://www.cascading.org/">Cascading</a> framework for writing big data jobs, has developed a new open source tool called <a href="http://www.cascading.org/pattern/">Pattern</a> that lets users export their models from statistical analysis applications and run THEM? at scale on Hadoop data with little to no code change.</p>
<p>The reason for creating Pattern is pretty simple, according to Concurrent founder and CTO Chris Wensel: &#8220;Hadoop is never used alone.&#8221; It&#8217;s always part of a data environment that also includes databases, visualization tools, analytics software and/or statistical analysis tools that arguably do the really valuable work. Hadoop&#8217;s real value is an integration platform that can feed data into these other systems and, ideally, put their outputs to work across much larger datasets.</p>
<p>Developers <em>can</em> use the Pattern Java API to create machine learning jobs, but they can also simply export a Predictive Model Markup Language (PMML) file from software like R, SAS and MicroStrategy that Pattern will read and run them as a Cascading workflow. Models are useless unless you can run them in production, Wensel said, and Pattern lets them run across more data, stored in Hadoop, than you can use to build them with those other tools.</p>
<p>However, Wensel noted, &#8220;The real takeaway isn&#8217;t Pattern itself.&#8221;</p>
<p>From his perspective, the real story is Pattern plus Cascading plus <a href="http://www.cascading.org/lingual/">Lingual</a>, the open source SQL-to-Hadoop tool that Concurrent recently developed and released. Lingual is the tie that binds everything together, creating a sort of assembly line for data as it works its way from generation to delivering some value. For example, someone might create a Cascading job that adds structure to incoming data, and then pull some of the data into R using Lingual. Once a model is created in R and exported to the Hadoop cluster using Pattern, Lingual can feed the MapReduce output file back to R so a data scientist can test the model&#8217;s accuracy.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/05/arch-diagram.png"><img  alt="arch-diagram" src="http://gigaom2.files.wordpress.com/2013/05/arch-diagram.png?w=708"   class="aligncenter size-full wp-image-648347" /></a></p>
<p>And actually, Wensel said, Lingual could have a positive effect on companies&#8217; bottom lines. Airbnb recently replaced a departed engineer with Lingual for monthly migrations of data from Hadoop and into SQL environments. Climate Corporation, <a href="http://gigaom.com/2012/05/02/how-climate-corp-is-pitting-big-data-against-mother-nature/">a massive Hadoop and Cascading user</a>, could use Lingual to let its crop-and-weather insurance customers access their data from the company&#8217;s Hadoop store.</p>
<p>Lingual and Pattern should help Concurrent finally make some money, too. Both of them, as well as the Cascading framework that underpins them, will always be open source, Wensel said, but it plans to create &#8220;a suite of products that will make your life much better if &#8230; you standardize on Cascading.&#8221;</p>
<p>For example, the company has the ability to monitor jobs at the application level rather than the cluster level, meaning it can tell you the details of that job that&#8217;s locking up all the resources and whether you really want to kill it (it might be an important report for the CFO &#8230;). &#8220;We can do some really interesting things,&#8221; Wensel said.</p>
<p><em><strong>Disclosure</strong>: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.</em></p>
<p><em>This post was updated at 2:48pm PT to correct Chris Wensel&#8217;s title. He is CTO.</em></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-908242p1.html">Shutterstock user PENGYOU91</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=648186&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=722706"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=722706" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648186+concurrent-is-building-a-hadoop-assembly-line-in-open-source&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/report/sql-on-hadoop-roadmap-2013/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648186+concurrent-is-building-a-hadoop-assembly-line-in-open-source&utm_content=dharrisstructure">Sector RoadMap: SQL-on-Hadoop platforms in 2013</a></li><li><a href="http://pro.gigaom.com/report/cloud-and-data-first-quarter-2013-analysis-and-outlook/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648186+concurrent-is-building-a-hadoop-assembly-line-in-open-source&utm_content=dharrisstructure">Cloud and data first-quarter 2013: analysis and outlook</a></li><li><a href="http://pro.gigaom.com/report/how-to-use-big-data-to-make-better-business-decisions/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=648186+concurrent-is-building-a-hadoop-assembly-line-in-open-source&utm_content=dharrisstructure">How to use big data to make better business decisions</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/22/concurrent-is-building-a-hadoop-assembly-line-in-open-source/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_98915513.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_98915513.jpg?w=150" medium="image">
			<media:title type="html">assembly line</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/arch-diagram.png" medium="image">
			<media:title type="html">arch-diagram</media:title>
		</media:content>
	</item>
		<item>
		<title>Database startup Drawn to Scale is closing down</title>
		<link>http://gigaom.com/2013/05/17/database-startup-drawn-to-scale-is-closing-down/</link>
		<comments>http://gigaom.com/2013/05/17/database-startup-drawn-to-scale-is-closing-down/#comments</comments>
		<pubDate>Fri, 17 May 2013 21:24:03 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[Drawn to Scale]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hbase]]></category>
		<category><![CDATA[SQL on Hadoop]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=646718</guid>
		<description><![CDATA[Database startup Drawn to Scale, creator of the SQL-on-Hadoop technology called Spire, is closing down. The company's product, Spire, was one of the first SQL-on-Hadoop technologies.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646718&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Database startup Drawn to Scale, creator of the SQL-on-Hadoop technology called Spire, is closing down. Co-founder and CEO Bradford Stephens officially <a href="http://www.roadtofailure.com/?p=11">announced the closure in a blog post</a> on Friday.</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/05/spirearchitecture-015-e1361407038325.png"><img  alt="spirearchitecture-015-e1361407038325" src="http://gigaom2.files.wordpress.com/2013/05/spirearchitecture-015-e1361407038325.png?w=300&#038;h=185" width="300" height="185" class="alignleft size-medium wp-image-646740" /></a>The company&#8217;s product, Spire, which provided full SQL support on top of the HBase NoSQL database, was one of the first products to <a href="http://gigaom.com/2012/07/24/how-one-startup-wants-to-inject-hadoop-into-your-sql/">try to blend Hadoop&#8217;s scalability with the robustness and familiarity of SQL</a>. That&#8217;s now <a href="http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/">an increasingly crowded space</a> (and has grown since that linked graphic was created). In March, Drawn to Scale <a href="http://gigaom.com/2013/03/19/drawn-to-scale-wants-to-solve-your-mongodb-scalability-problems/">expanded its support to MongoDB</a>, as well.</p>
<p>I wasn&#8217;t shocked when Stephens told me the news &#8212; questions about the four-year-old company&#8217;s financial health had been swirling for a while &#8212; but to hear of its financial woes was a bit surprising. His account in the post pretty much echoes what I had heard from others:</p>
<blockquote id="quote-it-seemed-we-had-eve"><p>&#8220;It seemed we had everything going for us — paid customers such as American Express, Orange Telecom, Flurry, and 4 others. Our technology worked brilliantly, we had a big hiring pipeline, and we had great media presence against our competitors who raised 10-100x more cash.&#8221;</p></blockquote>
<p>He added:</p>
<blockquote id="quote-yet-five-days-before2"><p>&#8220;Yet five days before we signed term sheets for a big A round or sold the company, we started getting hit by a series of black swans — and we just didn’t have what we needed to recover. I’ll leave the public detail at that level, but I will say that paying employees’ health insurance out of your meager savings is a powerful incentive to change course.&#8221;</p></blockquote>
<p>Up to this point, the company <a href="http://gigaom.com/2012/03/08/drawn-to-scale-raises-money-to-make-sql-big-data-ready/">had raised $925,000</a> from RTP Ventures, IA Ventures and SK Ventures. There&#8217;s no word yet on what will come of the company&#8217;s intellectual property.</p>
<p>As Stephens &#8212; who&#8217;s now doing an entrepreneur-in-residence gig at Ping Identity and helping out other startups (including popular wardrobe app <a href="http://www.clothapp.com/">Cloth</a>) &#8212; succinctly put it during a phone discussion, &#8220;We just don&#8217;t have the horsepower to keep running the company.&#8221;</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646718&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=982262"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=982262" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646718+database-startup-drawn-to-scale-is-closing-down&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/report/the-new-economics-of-enterprise-data-warehousing/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646718+database-startup-drawn-to-scale-is-closing-down&utm_content=dharrisstructure">How data warehousing is now a cost-effective solution for businesses</a></li><li><a href="http://pro.gigaom.com/report/sql-on-hadoop-roadmap-2013/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646718+database-startup-drawn-to-scale-is-closing-down&utm_content=dharrisstructure">Sector RoadMap: SQL-on-Hadoop platforms in 2013</a></li><li><a href="http://pro.gigaom.com/report/how-to-use-big-data-to-make-better-business-decisions/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646718+database-startup-drawn-to-scale-is-closing-down&utm_content=dharrisstructure">How to use big data to make better business decisions</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/17/database-startup-drawn-to-scale-is-closing-down/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/dtsdragon.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/dtsdragon.png?w=150" medium="image">
			<media:title type="html">dtsdragon</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/spirearchitecture-015-e1361407038325.png?w=300" medium="image">
			<media:title type="html">spirearchitecture-015-e1361407038325</media:title>
		</media:content>
	</item>
	</channel>
</rss>