<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GigaOM &#187; data science</title>
	<atom:link href="http://gigaom.com/tag/data-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://gigaom.com</link>
	<description></description>
	<lastBuildDate>Sun, 19 May 2013 01:17:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gigaom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/0db8f6557d022075dbbf010c54d46d93?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GigaOM &#187; data science</title>
		<link>http://gigaom.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gigaom.com/osd.xml" title="GigaOM" />
	<atom:link rel='hub' href='http://gigaom.com/?pushpress=hub'/>
		<item>
		<title>Black box software: a problem for science that extends to big data</title>
		<link>http://gigaom.com/2013/05/16/black-box-software-a-problem-for-science-that-extends-to-big-data-2/</link>
		<comments>http://gigaom.com/2013/05/16/black-box-software-a-problem-for-science-that-extends-to-big-data-2/#comments</comments>
		<pubDate>Thu, 16 May 2013 18:00:44 +0000</pubDate>
		<dc:creator>Amanda Alvarez</dc:creator>
				<category><![CDATA[big data analytics]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[ecology]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[scientific computing]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=646192</guid>
		<description><![CDATA[Blind trust in black box, or click-and-run, software is a growing problem in science, and the concern extends to big data and high performance computing.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646192&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>You probably don’t need to know how a calculator makes two plus two equal four, or how your favorite smartphone app works, but the way the background software is implemented can make a big difference to the output. Slight rounding errors or slow load times in these cases might be annoying, but when you scale up to big data modeling, for instance, you might want to take a closer look at the software running your calculations before you click go.</p>
<p>Blind trust in black box, or click-and-run, software is a growing problem in science, according to a <a href="http://www.sciencemag.org/lookup/doi/10.1126/science.1231535">commentary published Thursday in the journal <i>Science</i></a>, and the concern extends beyond formal research to other domains that use high performance computing.</p>
<p>The researchers who addressed the “troubling trend in scientific software use” were motivated by a growing unease that the abundance of powerful software is letting scientists derive answers without a thorough understanding of what the software is doing. Software snafus have been responsible for some high-profile <a href="http://www.ligo-wa.caltech.edu/~michael.landry/calibration/S5/getsignright.pdf">data misinterpretations and retractions</a>.</p>
<p>This wouldn’t normally cause a blip on the average citizen’s radar, but now a lot of these scientific conclusions have real-world implications, from climate modeling and weather forecasting to high volume financial trading. In any domain using big data, misplaced trust in the power of software can be problematic, particularly when the decision makers don’t know what the software they are using is doing, said lead author Lucas Joppa of Microsoft Research.</p>
<p>So what does ecology have to do with any of this? Joppa is an ecologist by training, and works on computational techniques in that field that may also have applications for big data more broadly. He and his colleagues surveyed scientists in a sub-field of ecology &#8212; species distribution modeling (SDM) &#8212; to find out how they choose software and how well they understand its inner workings.</p>
<p>“Lots of SDM techniques are only available as computational methods, but there is a lot of discourse going on in the literature about whether the methods themselves are correct,” said Joppa. Scientists use SDM to forecast where plants and animals will be in the future given current numbers, known habitats, and climate change. It’s a niche area of research, but the disquieting survey results should be noted in any domain where forecasting is done by plugging data into software.</p>
<p>Only 8 percent of the more than 400 scientists who responded had validated their modeling software against other methods. “The number speaks for itself,” said Joppa. “The real crux of the problem is the results from software being published in a peer-reviewed journal, versus the software itself having been peer-reviewed,” which is rare. Software packages, whether proprietary or not, are often black box systems that can’t be opened and inspected. Even if you can get under the proverbial hood, like with open source software, said Joppa, most people will still have no idea what they are looking at, or how to judge its quality.</p>
<p><img  alt="catch 22" src="http://gigaom2.files.wordpress.com/2013/05/91201888.jpg?w=347&#038;h=231" width="347" height="231" class="alignleft" /></p>
<p>To top it all off, having confidence in what your software is doing results in a massive computational catch-22: how do you know the software is giving you the right answer, if you can’t get the answer without running the software? The level of confusion over what algorithms are doing in the SDM field is illustrated by a debate over <a href="http://methodsblog.wordpress.com/2013/02/20/some-big-news-about-maxent/">which of two statistical techniques is superior</a>. It turns out, Joppa explained, that the two techniques were mathematically equivalent, but the ways they were implemented in software resulted in big predictive differences.</p>
<p>This sort of mix-up isn’t surprising given the messy nature of software development (if you can even call it that) in research environments. Joppa lauded efforts like Software Carpentry that teach scientists basic software fundamentals for better programming, and said the days of getting a doctorate by merely pushing a button are over.</p>
<p>“Scientists themselves can learn a bare minimum of software engineering,” said Joppa. On the flip side, he said computer science students should have more exposure to scientific methods. “People with traditional software engineering training become uncomfortable with the way scientists want to work with software, where the design and specs are constantly changing. The way that scientific software is built is fundamentally different from consumer apps.”</p>
<p>Developers of scientific software, like MathWorks or SAS, may want to watch this space. If Joppa’s suggestions are implemented, journals may start requiring that even proprietary software be opened up for inspection and peer-review. Nearly half of the surveyed ecologists report using free statistical language R as their primary software, so maybe there is hope yet, both for open, inspectable code, and for computational science becoming more accessible while yielding trustworthy, high impact results.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646192&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=579520"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=579520" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646192+black-box-software-a-problem-for-science-that-extends-to-big-data-2&utm_content=neuroamanda">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/12/sector-roadmap-health-care-and-big-data-in-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646192+black-box-software-a-problem-for-science-that-extends-to-big-data-2&utm_content=neuroamanda">Health care and big data in 2012</a></li><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646192+black-box-software-a-problem-for-science-that-extends-to-big-data-2&utm_content=neuroamanda">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2012/05/pervasive-software-retools-for-cloud-big-data-will-it-be-heard/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=646192+black-box-software-a-problem-for-science-that-extends-to-big-data-2&utm_content=neuroamanda">Pervasive Software retools for cloud, big data: will it be heard?</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/16/black-box-software-a-problem-for-science-that-extends-to-big-data-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/146799217.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/146799217.jpg?w=150" medium="image">
			<media:title type="html">black box</media:title>
		</media:content>

		<media:content url="http://2.gravatar.com/avatar/e37323b74d1f383817d82c9f906b7bcf?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">neuroamanda</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/91201888.jpg?w=708" medium="image">
			<media:title type="html">catch 22</media:title>
		</media:content>
	</item>
		<item>
		<title>This is why big data is the sweet spot for SaaS</title>
		<link>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/</link>
		<comments>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/#comments</comments>
		<pubDate>Wed, 15 May 2013 01:10:22 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[BloomReach]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[marketing]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[saas]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=645189</guid>
		<description><![CDATA[When it comes to using big data technology effectively, there's a lot to like about SaaS. When companies like BloomReach create and analyze massive web-wide data sets, they automate insights that almost no individual company could discover on its own.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645189&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>People often ask me where the smart money is in big data. I often tell them that’s a foolish question, because I’m not an investor — but if I were, I’d look to software as a service.</p>
<p>There are two primary reasons why, the first of which is obvious: Companies are tired of managing applications and infrastructure, so something that optimizes a common task using techniques they don’t know on servers they don’t have to manage is probably compelling. It’s called cloud computing.</p>
<p>The other reason is that <a href="http://gigaom.com/2013/04/29/google-research-director-and-ai-expert-peter-norvig-elected-into-aaas/">the <em>big </em>part of big data really is important</a> if you want to get a really clear picture of what’s happening in any given space. While no single end-user company can (or likely would) address search-engine optimization, for example, by building a massive store comprised of data from hundreds or thousands of companies as well as the entire web, a cloud service dedicated to that specific task can.</p>
<p>From <a href="http://gigaom.com/2012/11/28/log-data-startup-sumo-logic-raises-30m/">web security</a> to <a href="http://gigaom.com/2012/06/21/how-collective-intelligence-is-reshaping-systems-management/">systems management</a>, we’re already seeing how centralized data stores provide SaaS companies a broad view into what’s happening that can then be filtered down to serve each individual customer’s specific situation. <a href="http://www.bloomreach.com/">BloomReach</a>, a SaaS startup that helps companies optimize web-page content, is another good example of this principle in action.</p>
<h2 id="how-do-you-say-cotton-maxi-dre">How do <em>you</em> say, “cotton maxi dress”</h2>
<p>Ideally, BloomReach Head of Marketing Joelle Kaufman told me, the company wants to help customers ensure they get found in web searches by making sure they’re not invisible (buried deep down), irrelevant (not saying anything meaningful on their sites) or incompatible (not speaking their consumers’ language). On Tuesday, the company <a href="http://www.bloomreach.com/buzz/media-center-pr/continuous-quality-management/">announced a new feature called Continuous Quality Management</a>, which lets customers continuously monitor their pages to ensure they’re still featuring the right products and the right terminology. It’s the latest addition to a seemingly useful service that’s built atop a big data foundation few — if any — of its customers would ever attempt to build themselves.</p>
<p>BloomReach is able to help companies optimize their sites because it’s constantly crawling the web in order to figure out how everyone else is describing their content, laying out their pages and structuring their links. Running on the Amazon Web Services cloud, BloomReach runs more than 1,000 Hadoop jobs a day that process about 5 terabytes of data and a billion data points about users’ site behavior. With the latter, co-founder and CTO Ashutosh Garg explained, the company is trying to figure out who’s visiting sites, what they’re doing, how long they’re spending there and how they’re related in terms of behavior.</p>
<p>“You need to have the right amount of data and from the right places before we can do anything with it,” he said. “… It’s a massive machine learning problem.”</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/05/br-stack.png"><img alt="BR stack" src="http://gigaom2.files.wordpress.com/2013/05/br-stack.png?w=708&#038;h=531" width="708" height="531" class="aligncenter size-large wp-image-645359"></a></p>
<p>When you consider all the possible ways something could be described or formatted, the scale of the problem becomes more evident. Simple semantic analysis like associating “desk” and “table” is easy, Garg explained, but what if some wants a lightweight camera and you only have its exact weight listed without any indication of how it compares to other options? What if people searching for “smartphones” really mean “Android phones,” but you’re top-loading your results with BlackBerry phones and Windows phones?</p>
<p>Another of Garg’s hypotheticals has to do with consumers’ presentation biases. If, for example, they’re looking at a lot of websites that look the same or focus on the same things (e.g., megapixels for digital cameras), they’ll expect to see the same things from every site.</p>
<h2 id="10-nonillion-possibilities-cho">10 nonillion possibilities: Choose 1.</h2>
<p>From a sheer numbers perspective, things get even hairier when you’re trying to determine the relationship between any two pages in order to figure out the best path for links to to take. Garg said this is what computer scientists call an <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete problem</a>, which means the amount of time it takes to process the results is exponentially greater than the amount of content you’re analyzing. So, for example, analyzing 40 pages doesn’t take 10 times as long as analyzing 4 pages, but more like 100 times longer.</p>
<p>Actually, BloomReach CEO Raj De Datta gave me another example of this problem <a href="http://gigaom.com/2012/02/22/bloomreach-wants-to-save-your-site-with-big-data/">when we spoke in early 2012</a>. Here’s how I described it then:</p>
<blockquote id="quote-if-a-company-wants-t"><p>[I]f a company wants to display just 1,000 products across 100 pages, De Datta explained, there are 10-to-the-28th-power (10 octillion) possibilities for how to do that. When it comes time to describe those products, there are 10-to-the-30th-power (10 nonillion) possibilities.</p></blockquote>
<p>If a website has a million pages, Garg said, “it will take you longer than the life of the universe to solve that problem.”</p>
<p>Where this type of problem arises, BloomReach turns to <a href="http://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simluations</a>, a favorite technique of physicists and Wall Street quants. The method involves running lots of simulations over large data sets in order to determine approximate results in a reasonable time frame. (And if all this isn’t enough computer science and cloud infrastructure for you, I suggest attending our <a href="http://event.gigaom.com/structure/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&amp;utm_content=dharrisstructure">Structure conference</a> in June, which features a who’s who list of speakers, including Google’s Jeff Dean, Facebook’s Jay Parikh and Netflix’s Adrian Cockroft.)</p>
<h2 id="different-queries-different-pa">Different queries, different pages</h2>
<p>Things get even trickier when you’re trying to change the content of web pages in real time as people are searching for things. This isn’t the best method for organic search, where pages need to stay pretty consistent with the indexed versions, but it can be ideal in situations such as paid search and mobile. There are millions of ways to segment buyers, Garg explained, and how accurately you assess their intent and display your content can make the all the difference. Whether someone is a new or repeat visitor often matters, as does whether someone is price-conscious (e.g., the query included “cheap”) or perhaps searching for a particular brand.</p>
<div id="attachment_645358" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/05/llbean.png"><img alt="Source: BloomReach" src="http://gigaom2.files.wordpress.com/2013/05/llbean.png?w=708&#038;h=531" width="708" height="531" class="size-large wp-image-645358"></a><p class="wp-caption-text">Source: BloomReach</p></div>
<p>Around the holidays, the company actually realized something interesting: The bounce rate on queries for things like “gifts for dad” or “gifts for co-workers” was pretty high, but so was the conversion rate. The time to conversion was relatively fast, as well. It turns out, Garg explained, that people don’t like to overthink certain gifts too much, so if something is presented in a visually appealing manner and is within their price range, they’ll buy.</p>
<p>But creating these types of models involves more than meets the eye. For all the talk about machine learning — and machines do a majority of the work for BloomReach — people also play a critical role. A person might know better than a machine whether something was likely purchased as gift, Garg explained, or they might spot the offensive content on the T-shirt the machine decided was ideal.</p>
<p>“Humans are really good at creativity, thinking through stuff,” he said.</p>
<p>Smart humans are also good at knowing when they’re overmatched, which is why SaaS is so valuable in the big data era. CMOs could try doing what BloomReach or <a href="http://gigaom.com/2012/04/24/datapop-scores-7m-for-custom-built-ads/">similar companies such as DataPop</a> are doing, or they could pay someone to do it much better. Guess which route the smart ones will take.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-54269p1.html">Shutterstock user Andrea Danti</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645189&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=88475"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=88475" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/06/cloud-computing-infrastructure-2012-and-beyond/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Cloud computing infrastructure: 2012 and beyond</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_119782672.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_119782672.jpg?w=150" medium="image">
			<media:title type="html">collective intelligence</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/br-stack.png?w=708" medium="image">
			<media:title type="html">BR stack</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/llbean.png?w=708" medium="image">
			<media:title type="html">Source: BloomReach</media:title>
		</media:content>
	</item>
		<item>
		<title>We&#8217;re witnessing the rise of the graph in big data</title>
		<link>http://gigaom.com/2013/05/14/were-witnessing-the-rise-of-the-graph-in-big-data/</link>
		<comments>http://gigaom.com/2013/05/14/were-witnessing-the-rise-of-the-graph-in-big-data/#comments</comments>
		<pubDate>Tue, 14 May 2013 14:33:33 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[graph analysis]]></category>
		<category><![CDATA[graph database]]></category>
		<category><![CDATA[GraphLab]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=645059</guid>
		<description><![CDATA[Graph databases and graph-processing applications have been popping up all over the place lately, and now they're starting to go commercial. On Tuesday, popular open source project GraphLab joined the ranks of graph startups.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645059&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>GraphLab, a popular <a href="http://graphlab.org/">open source project</a> dedicated to graph analysis and machine learning, is trying to capitalize on the excitement around graphs by spinning off a commercial entity, <a href="http://graphlab.com/">GraphLab Inc.</a> GraphLab creator &#8212; and University of Washington machine learning professor &#8212; Carlos Guestrin will lead the new Seattle-based company, which has raised $6.75 million from Madrona Venture Group and NEA.</p>
<p>Graph analysis is among the hottest techniques around for making sense of large datasets, primarily by determining how tightly different data points are related or how similar they are. The term &#8220;graph&#8221; came into the broader lexicon along with social networks, which built social graphs to <a href="http://gigaom.com/2013/03/14/facebook-tweaks-its-algorithms-to-improve-graph-search-comment-search-coming/">assess the relationships among their millions of users</a>, but the technique has much broader uses.</p>
<div id="attachment_645089" class="wp-caption aligncenter" style="width: 677px"><a href="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg"><img  alt="My LinkedIn social graph" src="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg?w=708"   class="size-full wp-image-645089" /></a><p class="wp-caption-text">My LinkedIn social graph</p></div>
<p>Guestrin said GraphLab&#8217;s algorithms are used in a lot of recommender systems, but he also cites fraud detection in banking networks and intrusion detection in computer networks as potential applications. We&#8217;ve covered graphs as the analytical model of choice for everything <a href="http://gigaom.com/2013/04/22/how-hbase-converted-myspaces-mysql-champion-and-is-driving-hadoop-mainstream/">from content recommendation</a> to <a href="http://gigaom.com/2013/01/22/biotech-startup-syapse-wants-to-be-salesforce-com-for-our-genomes/">tracking lab work in genomics</a>. Really, though &#8212; especially when combined with machine learning &#8212; graph analysis <a href="http://gigaom.com/2013/01/16/has-ayasdi-turned-machine-learning-into-a-magic-bullet/">can be applied to anything</a> where there&#8217;s too much data for a person to possibly analyze the relationships between every point.</p>
<div id="attachment_601469" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/01/ayasdi-product-image-2-e1358295341371.jpg"><img  alt="One of Ayasdi's graph-like data maps" src="http://gigaom2.files.wordpress.com/2013/01/ayasdi-product-image-2-e1358295341371.jpg?w=708&#038;h=472" width="708" height="472" class="size-large wp-image-601469" /></a><p class="wp-caption-text">One of Ayasdi&#8217;s graph-like data maps</p></div>
<p>Google also famously uses <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">a graph-processing system called Pregel</a> as part of PageRank. Although a number of graph databases and other projects have popped up in the past few years, Guestrin said GraphLab is actually a contemporary of Pregel. He and some colleagues at Carnegie Mellon built a small system for their lab about five years ago, then released it into the open-source world with few expectations that it would catch on. Now, he added, Pandora and WalmartLabs are among the project&#8217;s user base.</p>
<p>Among those other projects are graph databases such as <a href="http://giraph.apache.org/">Giraph</a> (an open source, Hadoop-based Pregel clone developed at Facebook) and <a href="http://www.neo4j.org/">Neo4j</a> (which also has a commercial arm, <a href="http://gigaom.com/2012/11/02/graph-startup-neo-raises-11m-as-specialized-databases-take-hold/">called Neo Technology</a>), as well as <a href="http://engineering.twitter.com/2012/03/cassovary-big-graph-processing-library.html">Twitter&#8217;s Cassovary</a> and fellow University of Washington project <a href="http://www.cs.washington.edu/node/4217/">Grappa</a>. Guestrin said GraphLab can work with most of them, particularly if they&#8217;re not designed to do machine learning at scale like GraphLab is. Some efforts, he noted, are focused on simply storing data in graph form (e.g., databases) or in providing simple graph analysis.</p>
<p>As for when we&#8217;ll actually see the results of the effort to commercialize GraphLab, Guestrin said it will be a while. Right now, he&#8217;s focused on the next open source release of GraphLab in July. However, the company will begin engaging with commercial users over the next several months to determine what types of features they would expect in commercial graph-analysis software.</p>
<p>The bigger question to come out of all this graph activity, though, is how big a market we&#8217;ll ultimately see for graph-analysis or any other specific technique. As companies get more comfortable with big data from a technical standpoint, they&#8217;re getting more interested in the different types of analysis it allows for too. This is evidenced by the <a href="http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/">quest to make Hadoop support myriad processing frameworks</a> aside from MapReduce.</p>
<p>We already have a handful of commercial graph products on the market &#8212; including an industrial grade one called <a href="http://www.yarcdata.com/">YarcData</a> from supercomputer maker Cray &#8212; but how many will there eventually be? And if graph analysis is all the rage right now, what comes next?</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645059&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=989985"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=989985" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645059+were-witnessing-the-rise-of-the-graph-in-big-data&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645059+were-witnessing-the-rise-of-the-graph-in-big-data&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/01/12-tech-leaders-resolutions-for-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645059+were-witnessing-the-rise-of-the-graph-in-big-data&utm_content=dharrisstructure">12 tech leaders’ resolutions for 2012</a></li><li><a href="http://pro.gigaom.com/2011/11/connected-world-the-consumer-technology-revolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645059+were-witnessing-the-rise-of-the-graph-in-big-data&utm_content=dharrisstructure">Connected world: the consumer technology revolution</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/14/were-witnessing-the-rise-of-the-graph-in-big-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/graphics2-3_final_cartoon.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/graphics2-3_final_cartoon.jpg?w=150" medium="image">
			<media:title type="html">graphics2-3_final_cartoon</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/lnkdmap-1.jpg" medium="image">
			<media:title type="html">My LinkedIn social graph</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/01/ayasdi-product-image-2-e1358295341371.jpg?w=708" medium="image">
			<media:title type="html">One of Ayasdi&#039;s graph-like data maps</media:title>
		</media:content>
	</item>
		<item>
		<title>Why 3 celebrity data scientists are willing to work for free &#8212; for you</title>
		<link>http://gigaom.com/2013/05/08/why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you/</link>
		<comments>http://gigaom.com/2013/05/08/why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you/#comments</comments>
		<pubDate>Wed, 08 May 2013 16:58:30 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hilary Mason]]></category>
		<category><![CDATA[Mortar Data]]></category>
		<category><![CDATA[recommendation engines]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=643353</guid>
		<description><![CDATA[Hadoop startup Mortar Data is offering to build recommendation systems for 10 companies, with help from Hilary Mason, Drew Conway and Max Shron. It's part of a bigger plan to democratize the science behind online recommendations.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=643353&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Hadoop-in-the-cloud startup Mortar Data is on a mission to bring recommendation engines to the masses, and it has recruited three well-known data scientists to aid its cause. On Wednesday, the company will start accepting applications <a href="http://mortardata.com/">on its website</a> from companies that would like to have Mortar Data &#8212; as well as Bit.ly&#8217;s <a href="http://www.hilarymason.com/">Hilary Mason</a>, IA Ventures Scientist-in-Residence <a href="http://drewconway.com/">Drew Conway</a> and freelancer (and former OKCupid data scientist) <a href="http://shron.net/about">Max Shron</a> &#8212; build a custom recommendation system for them.</p>
<p>The way it works, said Mortar Co-founder and CEO K Young, is that his company will choose eight companies (in addition to the two it has been working with already) to implement custom systems based on their specific needs and businesses. Mason, Conway and Shron will split their time among the 10 total companies, but will be much more than advisers &#8212; they&#8217;ll actually dig into the data and work hands-on to ensure the right techniques and algorithms are applied in the right places.</p>
<p>The applicant companies will keep any custom code, but the ultimate goal from Mortar&#8217;s perspective is to learn some best practices and create reusable building blocks that will let anyone create recommendation engines without pre-existing data science knowledge. Recommendation engines <a href="http://gigaom.com/2013/01/29/you-might-also-like-to-know-how-online-recommendations-work/">are commonplace on large web sites</a> (Netflix, Spotify, iTunes, Google, Amazon, <a href="http://gigaom.com/2013/03/03/how-and-why-linkedin-is-becoming-an-engineering-powerhouse/">LinkedIn</a>, Eventbrite and the list goes on) but smaller companies can sometimes struggle to do them, or to do them well. Young hopes Mortar can establish an open source reference architecture of sorts that makes it easy to implement everything from building data pipelines to the actual algorithms that power recommendations.</p>
<p>&#8220;They&#8217;re really common and they&#8217;re really useful, but they&#8217;re really hard,&#8221; he said. &#8220;That&#8217;s why [a reference implementation] hasn&#8217;t been done before.&#8221;</p>
<div id="attachment_643436" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/05/gernres-support-1.jpg"><img  alt="They can get pretty complex, as evidence by this Netflix example." src="http://gigaom2.files.wordpress.com/2013/05/gernres-support-1.jpg?w=708&#038;h=358" width="708" height="358" class="size-large wp-image-643436" /></a><p class="wp-caption-text">They can get pretty complex, as evidence by this Netflix example.</p></div>
<p>Presently, Young explained, anyone wanting to build a recommendation system probably knows some of the algorithms to begin with and then gets to work researching how to implement them with specific processing frameworks (e.g., MapReduce) and on their specific data. Alternatively, they might have to hire a consultant that helps them build the recommendation engine. Either way, he noted, they&#8217;re probably not open sourcing it at the end because it&#8217;s presumed too valuable a competitive edge.</p>
<p>Mortar Data&#8217;s recommendation framework will be based on Pig, Python and Java, <a href="http://gigaom.com/2012/11/28/mortar-data-wants-to-become-a-hadoop-developers-best-friend/">just like the company&#8217;s flagship platform</a> for creating Hadoop jobs. Those languages will make the implementation more accessible and customizable by more people, Young said.</p>
<p>Really, he added, any web site or service that has multiple customers and deals with multiple entities &#8212; be they restaurants, songs, dating profiles, artisan necklaces, what have you &#8212; should have some sort of recommendation engine to help provide a more-intelligent customer experience. &#8220;It should become so ubiquitous that any service you go to knows enough about you to put forward the things you actually want to see,&#8221; Young said.</p>
<p>There is, however, one catch to Mortar&#8217;s plans as they stand: Because the service is hosted on Amazon Web Services, anyone interested in having Mason, Conway, Shron and Mortar work on their systems must have their data in AWS or be able to move it there. The initial reference implementation will likely be AWS-centric, too, but Young hopes contributors will use it and share methods for running it atop other platforms.</p>
<p><em>Feature image of Hilary Mason at Structure: Data 2011 courtesy of Pinar Ozger (www.pinarozger.com).</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=643353&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=772151"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=772151" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=643353+why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=643353+why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=643353+why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=643353+why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/08/why-3-celebrity-data-scientists-are-willing-to-work-for-free-for-you/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/05/hilarymason.jpeg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/05/hilarymason.jpeg?w=150" medium="image">
			<media:title type="html">hilarymason</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/gernres-support-1.jpg?w=708" medium="image">
			<media:title type="html">They can get pretty complex, as evidence by this Netflix example.</media:title>
		</media:content>
	</item>
		<item>
		<title>Four ways data scientists are using digital art to humanize data</title>
		<link>http://gigaom.com/2013/05/01/four-ways-data-scientists-are-using-digital-art-to-humanize-data/</link>
		<comments>http://gigaom.com/2013/05/01/four-ways-data-scientists-are-using-digital-art-to-humanize-data/#comments</comments>
		<pubDate>Wed, 01 May 2013 16:10:06 +0000</pubDate>
		<dc:creator>Amanda Alvarez</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data visualization]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=641094</guid>
		<description><![CDATA[The growing pains of big data were apparent at the Data 2.0 Summit on Tuesday in San Francisco. Here is a selection of visualization tools that came up at the meeting.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=641094&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The growing pains of big data were apparent at the <a href="http://data2summit.com/">Data 2.0 Summit </a>on Tuesday in San Francisco.</p>
<p>During one panel, the assertion that data science is dead was indeed debated. Along with the habitual tension between end user requirements for businesses and consumers and the “elitist” ideas of data scientists and engineers, other themes explored included increasing accessibility to data, as well as changing behaviors and encouraging better decision-making with data. Everyone from sales and marketing people to fitness enthusiasts, it turns out, can be motivated by pretty pictures.</p>
<p>As IBM’s Alah Keahey put it during a panel, “there is a hunger for friendly data,” and visualization can help to humanize those threatening terabytes. Here are a selection of new, and new-to-us, visualization tools that came up at the meeting.</p>
<p><strong>Bringing climate change home: <a href="http://databasin.org/">Databasin.org</a></strong></p>
<p style="text-align:left;">A mapping and analytics platform from the Conservation Biology Institute that has 10,000 datasets on everything you need to understand how extreme weather will impact natural resources, renewable energy, and endangered species. Here is <a href="http://databasin.org/datasets/638a938ba0f84e238b342337f7616ecd">one projection</a> of maximum temperatures in 2080.</p>
<p style="text-align:left;"><img  alt="world-map-climate-change-databasin" src="http://gigaom2.files.wordpress.com/2013/04/world-map-climate-change-databasin.png?w=367&#038;h=258" width="367" height="258" class="aligncenter  wp-image-641097" /></p>
<p><strong><a href="http://www.sparkvis.com/">Sparkvis</a> by Chloe Fan</strong></p>
<p>This app is for the quantified self junkie who loves to interpret their burned calories as abstract art. The research behind the colorful display of Fitbit (see disclosure) data is explained <a href="http://www.chloefan.com/static/files/2012-UbiComp-Fan.pdf">here</a>. <i>Image via <a href="http://quantifiedself.com/2012/05/spark-visualizing-physical-activity-using-abstract-ambient-art/">QuantifiedSelf.com</a></i></p>
<p><img  alt="sparkvis-fitbit-visualization" src="http://gigaom2.files.wordpress.com/2013/04/sparkvis-fitbit-visualization.png?w=421&#038;h=242" width="421" height="242" class="aligncenter size-medium wp-image-641098" /></p>
<p><strong><a href="http://disqus.com/gravity/">Disqus Gravity</a></strong></p>
<p>The commenting platform’s diverse content is brought together in an interactive and live visualization. Pulling from about 500 sites that use Disqus, Gravity brings together the “small” data of individual comments within the context of 11 content categories. Another visualization, <a href="http://map.labs.disqus.com/">Orbital</a>, shows realtime comments geolocated on a spinning globe.</p>
<p><img  alt="disqus-gravity-visualization" src="http://gigaom2.files.wordpress.com/2013/04/disqus-gravity-visualization.png?w=396&#038;h=231" width="396" height="231" class="aligncenter size-medium wp-image-641103" /></p>
<p><strong><a href="http://www-958.ibm.com/software/analytics/manyeyes/">IBM Many Eyes</a></strong></p>
<p>Originally conceived by visualization guru Martin Wattenberg and colleagues in 2007, Many Eyes lets you plug in any dataset and generate nifty figures. Here, for example, is the <a href="http://www-958.ibm.com/software/analytics/manyeyes/visualizations/distribution-of-us-foreign-aid-ove-3">distribution of U.S. foreign aid</a> over a 60-year period.</p>
<p><img  alt="many-eyes-visualization-foreign-aid" src="http://gigaom2.files.wordpress.com/2013/04/many-eyes-visualization-foreign-aid.png?w=444&#038;h=297" width="444" height="297" class="aligncenter size-medium wp-image-641104" /></p>
<p><em>Disclosure: Fitbit is backed by True Ventures, a venture capital firm that is an investor in the parent company of GigaOM. Om Malik, founder of GigaOM, is also a venture partner at True.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=641094&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=365988"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=365988" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=641094+four-ways-data-scientists-are-using-digital-art-to-humanize-data&utm_content=neuroamanda">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/12/big-data-2013-key-trends-and-companies-to-watch/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=641094+four-ways-data-scientists-are-using-digital-art-to-humanize-data&utm_content=neuroamanda">Big data 2013: key trends and companies to watch</a></li><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=641094+four-ways-data-scientists-are-using-digital-art-to-humanize-data&utm_content=neuroamanda">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2012/04/sector-roadmap-hadoop-platforms-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=641094+four-ways-data-scientists-are-using-digital-art-to-humanize-data&utm_content=neuroamanda">2012: The Hadoop infrastructure market booms</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/01/four-ways-data-scientists-are-using-digital-art-to-humanize-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/visualization-examples.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/visualization-examples.png?w=150" medium="image">
			<media:title type="html">visualization-examples</media:title>
		</media:content>

		<media:content url="http://2.gravatar.com/avatar/e37323b74d1f383817d82c9f906b7bcf?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">neuroamanda</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/world-map-climate-change-databasin.png?w=300" medium="image">
			<media:title type="html">world-map-climate-change-databasin</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/sparkvis-fitbit-visualization.png?w=300" medium="image">
			<media:title type="html">sparkvis-fitbit-visualization</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/disqus-gravity-visualization.png?w=300" medium="image">
			<media:title type="html">disqus-gravity-visualization</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/many-eyes-visualization-foreign-aid.png?w=300" medium="image">
			<media:title type="html">many-eyes-visualization-foreign-aid</media:title>
		</media:content>
	</item>
		<item>
		<title>USVP, UPS and Scott McNealy pump $18M into machine-learning startup Skytree</title>
		<link>http://gigaom.com/2013/04/30/usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree/</link>
		<comments>http://gigaom.com/2013/04/30/usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree/#comments</comments>
		<pubDate>Tue, 30 Apr 2013 16:30:15 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[Scott McNealy]]></category>
		<category><![CDATA[Skytree]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=640909</guid>
		<description><![CDATA[Machine learning startup Skytree has raised $18 million for its software that makes short work of pattern recognition across massive datasets.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=640909&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Machine learning is everywhere these days as companies and organizations find themselves trying to make sense of data sets far too large and complex for the human brain alone. On Tuesday, <a href="http://www.skytree.net/">Skytree</a> cashed in on the hype with with an $18 million Series A round led by U.S. Venture Partners along with delivery giant UPS and Sun Microsystems co-founder and former CEO Scott McNealy. Skytree <a href="http://gigaom.com/2012/02/23/skytree-intros-machine-learning-for-the-masses/">launched in February 2012</a> with $1.5 million in seed funding.</p>
<p>Machine learning is such a hot topic right now because data volumes are becoming so large and complex that humans alone can&#8217;t query their ways through them fast enough or intelligently enough to spot latent patterns among the mess of data. It&#8217;s the algorithmic engine that <a href="http://gigaom.com/2012/06/25/how-google-is-teaching-computers-to-see/">powers a bunch of Google services</a> and <a href="http://gigaom.com/2012/06/14/netflix-analyzes-a-lot-of-data-about-your-viewing-habits/">your Netflix recommendations</a>, as well as <a href="http://gigaom.com/2012/12/05/prismatic-gets-15m-to-build-a-recommendation-engine-for-the-world/">web content-curation service Prismatic</a> and <a href="http://gigaom.com/2012/11/19/where-machine-learning-and-human-artistry-meet-your-wallet/">alternative-underwriting platform ZestFinance</a>. As we <a href="http://gigaom.com/2013/03/22/5-ways-big-data-is-going-to-blow-your-mind-and-change-your-world/">covered in some detail at this year&#8217;s Structure: Data conference</a>, machine learning is particularly powerful when its ability to correlate tens of thousands of variables is paired with human judgment about what really matters.</p>
<div id="attachment_640923" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/04/ml-2012.jpg"><img  alt="Skytree co-founder Alexander Gray (second from left) at Structure: Data 2012. (c) Pinar Ozger" src="http://gigaom2.files.wordpress.com/2013/04/ml-2012.jpg?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-640923" /></a><p class="wp-caption-text">Skytree co-founder Alexander Gray (second from left) at Structure: Data 2012. (c) Pinar Ozger</p></div>
<p>Skytree, for its part, sells a product called Skytree Server that lets users run a wide variety of machine learning algorithms across whatever data they have. It might be an oversimplification, but Skytree is essentially a souped-up version of statistical-analysis packages like SPSS or SAS that&#8217;s designed to run fast &#8212; and, more importantly &#8212; without sampling across a scale-out server architecture. In March, the company also rolled out the beta version of <a href="http://www.skytree.net/adviser-beta/">a new product called Adviser</a> that can run on a laptop and walks more-novice users through the analysis of their data, including what methods were used and why, and whether the findings are statistically significant.</p>
<p>I suspect we&#8217;re just seeing the opening salvo in what will be a rush to fund machine learning startups over the next couple of years. Skytree is among a number of increasingly promising startups in the space, including (but certainly not limited to) <a href="http://gigaom.com/2013/01/16/has-ayasdi-turned-machine-learning-into-a-magic-bullet/">Ayasdi</a> and <a href="http://gigaom.com/2013/03/20/data-science-is-not-enough-we-need-data-intelligence-too/">Quid</a>. As more individuals see the promise of machine learning and get skilled in applying it to their particular problems and datasets &#8212; as UPS apparently has &#8212; it could become become one of the go-to analytic methods in the big data era.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=640909&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=367314"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=367314" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640909+usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640909+usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/12/sector-roadmap-health-care-and-big-data-in-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640909+usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree&utm_content=dharrisstructure">Health care and big data in 2012</a></li><li><a href="http://pro.gigaom.com/2012/07/cloud-and-data-second-quarter-2012-analysis-and-outlook-2/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640909+usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree&utm_content=dharrisstructure">Takeaways from the second quarter in cloud and data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/30/usvp-ups-and-scott-mcnealy-pump-18m-into-machine-learning-startup-skytree/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/05/machine-learning.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/05/machine-learning.jpg?w=150" medium="image">
			<media:title type="html">machine learning</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/ml-2012.jpg?w=300" medium="image">
			<media:title type="html">Skytree co-founder Alexander Gray (second from left) at Structure: Data 2012. (c) Pinar Ozger</media:title>
		</media:content>
	</item>
		<item>
		<title>How data is changing the car game for Ford</title>
		<link>http://gigaom.com/2013/04/26/how-data-is-changing-the-car-game-for-ford/</link>
		<comments>http://gigaom.com/2013/04/26/how-data-is-changing-the-car-game-for-ford/#comments</comments>
		<pubDate>Fri, 26 Apr 2013 21:49:31 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Ford Motor]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=633959</guid>
		<description><![CDATA[The advent of big data is affecting Ford Motor Co. in some significant ways, from how it analyzes its supply chain to the features it puts into its cars.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=633959&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>When most people think about how cars are built, they probably think about assembly lines, manufacturing robots, and batteries of safety and performance simulations on massive supercomputers. But at Ford, big data is having a significant impact on the parts and features of those cars before they&#8217;re ever part of a design file. From the cars in stock at the dealership to the performance of the engine in a rainstorm, big data is infiltrating nearly every aspect of the Ford experience and the company itself.</p>
<p>Obviously, data is nothing new to the automotive industry &#8212; companies have been trying to optimize supply chains and analyze sales numbers for decades &#8212; but the advent of big data, as well as related technlogies such as sensors and smartphones, is changing how companies are thinking about data. Ford isn&#8217;t alone in its quest to take advantage of these new technologies, either. For example, General Motors <a href="http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Outlook-How-Big-Data-can-fuel-bigger-growth-Strategy.pdf">collects data from its OnStar system</a> to help lower drivers&#8217; insurance premiums, and also collects lots of data on its Chevrolet Volt electric car that it <a href="http://gigaom.com/2013/01/20/chevy-volt-to-my-smartphone-you-complete-me/">feeds to drivers via a mobile app</a>. We recently noted how a luxury automobile company <a href="http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/">used big data software from Aster Data Systems</a> to determine the relationships between malfunctions so it could provide a more thorough and beneficial service-department experience.</p>
<p>But in an industry notoriously unwilling to talk about information technology, Ford&#8217;s experiences might shed a lot on what other companies are thinking and doing, as well.</p>
<h2 id="building-a-better-experience-t">Building a better experience through data</h2>
<p>According to John Ginder, manager for systems analytics with Ford Research &amp; Innovation, the company has been doing advanced business modeling for about 20 years, but big data is something else. Today&#8217;s technologies are allowing Ford to handle larger, more-diverse datasets than ever before possible, and its efforts are already beginning to bear fruit in numerous places &#8212; including in the cars themselves.</p>
<p>The most obvious example of data influencing the driving experience might be the types of data car companies are actually giving back to drivers. At Ford, its Energi line of plug-in hybrid cars generate 25 gigabytes of data per hour that&#8217;s then processed and given back to drivers <a href="http://media.ford.com/images/10031/MyFord_Mobile.pdf">via a mobile app</a>. It tells them about battery life, the nearest charging stations and other data about the vehicle&#8217;s performance.</p>
<div id="attachment_635022" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/ford-energi.jpg"><img  alt="The MyFord mobile app architecture." src="http://gigaom2.files.wordpress.com/2013/04/ford-energi.jpg?w=708&#038;h=229" width="708" height="229" class="size-large wp-image-635022" /></a><p class="wp-caption-text">The MyFord mobile app architecture.</p></div>
<p>Ginder said all that data is the result of a &#8220;convergence of need and opportunity.&#8221; The opportunity is a way to experiment with collecting and presenting vehicle data on a group of early adopters that&#8217;s probably more interested in this type of advanced technology. The need has to do with what Ginder calls &#8220;range anxiety&#8221; &#8212; when drivers are getting used to electric vehicles, they need reassurance they&#8217;re not going to run out juice.</p>
<p>However, Ginder said, the company is just scratching the surface of what&#8217;s possible, because there aren&#8217;t that many of the electric vehicles on the road yet. The goal is to better understand how drivers are using the vehicles and use that information to continuously improve the vehicles and the overall experience. Ford&#8217;s Super Duty line of pickup trucks also offers a <a href="http://crewchief.telogis.com/how-it-works/">&#8220;crew chief&#8221; package</a> that lets bosses monitor the fuel consumption, engine performance and other data about their fleets of vehicles.</p>
<p>Mike Cavaretta, technical leader for predictive analytics and data mining with Ford Research &amp; Innovation, added that Ford is really interested in collecting more data from more vehicles, but noted there&#8217;s also a privacy concern that could come into play. The potential of someone knowing where and how you&#8217;re driving might not appeal to the mainstream just yet (just look at all that data Tesla collects about its cars <a href="http://gigaom.com/2013/02/14/five-important-lessons-from-the-dustup-over-the-nyts-tesla-test-drive/">and can present if it really wants to</a>), but as with the Energi, data does present some opportunities to improve the customer experience.</p>
<p>The test cars in Ford&#8217;s research labs are collecting about 250 gigabytes of data per hour from high-resolution cameras and an array of sensors, Cavaretta noted, and the company is trying to find out what data is most useful and how it might be rolled into production vehicles.</p>
<h2 id="building-betters-cars-through-">Building betters cars through data</h2>
<p>Of course, sometimes the best data isn&#8217;t the stuff you see, but the stuff that just makes your car better. Cavaretta said Ford analyzes a lot of social media and other external data in order to figure out, for example, what customers are saying about their vehicles compared with other makes and what problems they&#8217;re having.</p>
<div id="attachment_635027" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/04/esp13_feat_technology_liftgate.jpg"><img  alt="Opens with the touch of a foot. Source: Ford" src="http://gigaom2.files.wordpress.com/2013/04/esp13_feat_technology_liftgate-e1367011341438.jpg?w=300&#038;h=132" width="300" height="132" class="size-medium wp-image-635027" /></a><p class="wp-caption-text">Opens with the touch of a foot. Source: Ford</p></div>
<p>In one recent case, the product development team was curious as to whether the Ford Escape sport-utility vehicle should have a standard liftgate (i.e., it opens manually and the rear window can flip open) or a power liftgate in which the glass and the gate are one piece. In the latter option, the gate opens automatically by tapping under the rear bumper with your foot, but the window doesn&#8217;t open at all. Regular surveys hadn&#8217;t addressed the question, so Cavaretta and his team took to social media, where people were actually talking about it quite a bit and seemed to heavily favor the power liftgate in most cases. It&#8217;s now a feature.</p>
<p>Back in 2004, Ford <a href="http://www.theinquirer.net/inquirer/news/1015284/aston-martin-gets-neural-network">built a self-learning neural network system</a> for its Aston Martin luxury brand that maintains proper engine function by recognizing engine misfires and particular driving conditions and adjusting warnings and performance accordingly.</p>
<p>Ginder said his team has been improving on that technology ever since and actually expanded its use into a system, called Smart Inventory Management System, that lets dealers ensure they have the optimal stock of vehicles and features on their lots. Historically, he said, some dealers were very sophisticated about inventory management, while others were more reactionary (&#8220;They just sold a red Mustang,&#8221; he joked, &#8220;so they think they need to go order another red Mustang.&#8221;) With SIMS, all sorts of data about vehicle sales and other locally relevant data from across the country is aggregated in Ford&#8217;s big data platform, and the neural network algorithms learn the current patterns so Ford can make better recommendations &#8212; whether or not dealers choose to heed the advice.</p>
<h2 id="selling-big-data-internally">Selling big data internally</h2>
<p>Cavaretta characterizes the division in which he and Ginder work as &#8220;an Ernst &amp; Young, but just for Ford,&#8221; an internal consultancy (as opposed to Ford&#8217;s more-traditional research and development division) in charge of solving business problems via analytics. About 80 percent of those problems come directly from those lines of business, while about 20 percent are the research division&#8217;s own ideas. However, although he&#8217;s excited about how big data can help his team answer these questions in novel ways, it&#8217;s not always an easy sell with other parts of the company.</p>
<p>Mashing up data sources such as social and sales in order to find insights is a pretty easy sell, Cavaretta explained, but getting people to put sensors in everything and collect data every second or with every transaction can still be a bit challenging. In part, this is just a lingering effect of the constraints that legacy technologies imposed on the company. It wasn&#8217;t possible to store all this data, so people just got accustomed to the status quo of summarizing data hourly, for example.</p>
<div id="attachment_635020" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/04/map_skv_9439.jpg"><img  alt="Source: Ford" src="http://gigaom2.files.wordpress.com/2013/04/map_skv_9439.jpg?w=300&#038;h=215" width="300" height="215" class="size-medium wp-image-635020" /></a><p class="wp-caption-text">Source: Ford</p></div>
<p>Now, however, he&#8217;s pushing them to &#8220;dial it down&#8221; and collect data at the lowest level possible and as often as possible. In manufacturing alone, he explained, there are between 20,000 and 25,000 parts in any given vehicle, and there&#8217;s a supply chain that spans from parts suppliers all the way up to dealerships. Getting a complete view of this process could help drive serious efficiencies and, Cavaretta said, &#8220;We don&#8217;t see anything but big data technologies that can get us there.&#8221;</p>
<p>Other areas where Ford is collecting, or wants to collect, more real-time data is from websites, call centers and the company&#8217;s credit-processing arm, he added.</p>
<h2 id="building-big-data-internally">Building big data internally</h2>
<p>In order to accomplish their lofty goals, the Research &amp; Innovation analytics team relies heavily on open source technologies, most prominently Hadoop. However, Cavaretta said, they&#8217;ve been experimenting with a variety of natural-language processing tools, too, and even did a proof-of-concept with SAP&#8217;s HANA in-memory analytic database. The NLP tools were first turned on text analysis of internal surveys and dealer network documents, but now are used pretty heavily on social media and other web data.</p>
<p>Their team has some systems numbering in the dozens of nodes in its own building, but on weekends it&#8217;s able to borrow high-performance computing cycles from Ford&#8217;s Numerically Intensive Computing Center next door in order to model recommendation engines and other tasks that demand serious computing power.</p>
<p>But as a part of a specialized research division, the work that Ginder, Cavaretta and their team do on everything from Hadoop to visualization with tools like Tableau isn&#8217;t automatically ready for primetime. In fact, Cavaretta said, it looks at &#8220;what&#8217;s the art of the possible&#8221; and tries to show the value of it. It&#8217;s like a vanguard, he added, going out and seeing what&#8217;s ahead and then reporting back.</p>
<p>At that point, projects are often handed off to Ford&#8217;s central IT team that actually puts the technologies into production. A system that took the research team weeks to deploy and start deriving insights from might take IT months to make production-ready. However, Ginder added, his team can&#8217;t just throw stuff over the wall and abandon it &#8212; it has to collaborate with the IT team and individual departments throughout the project&#8217;s lifecycle.</p>
<p>An important part of this cross-company relationship &#8212; and <a href="http://gigaom.com/2013/04/16/how-to-hire-data-scientists-and-get-hired-as-one/">something many CIOs have likely heard before</a> &#8212; is having data scientists on board that can see the world through the eyes of both technologists and businesspeople, two groups that often have different concerns and goals in mind. &#8220;We look for people who can bridge those worlds,&#8221; Ginder said. &#8220;It&#8217;s hard to find these people, but they&#8217;re hugely important to organizations.&#8221;</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-53023p1.html">Shutterstock user PhotoSmart</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=633959&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=554150"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=554150" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=633959+how-data-is-changing-the-car-game-for-ford&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=633959+how-data-is-changing-the-car-game-for-ford&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/dissecting-the-data-5-issues-for-our-digital-future/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=633959+how-data-is-changing-the-car-game-for-ford&utm_content=dharrisstructure">Dissecting the data: 5 issues for our digital future</a></li><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=633959+how-data-is-changing-the-car-game-for-ford&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/26/how-data-is-changing-the-car-game-for-ford/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_612924.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_612924.jpg?w=150" medium="image">
			<media:title type="html">car and disk drive</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/ford-energi.jpg?w=708" medium="image">
			<media:title type="html">The MyFord mobile app architecture.</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/esp13_feat_technology_liftgate-e1367011341438.jpg?w=300" medium="image">
			<media:title type="html">Opens with the touch of a foot. Source: Ford</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/map_skv_9439.jpg?w=300" medium="image">
			<media:title type="html">Source: Ford</media:title>
		</media:content>
	</item>
		<item>
		<title>How to hire data scientists and get hired as one</title>
		<link>http://gigaom.com/2013/04/16/how-to-hire-data-scientists-and-get-hired-as-one/</link>
		<comments>http://gigaom.com/2013/04/16/how-to-hire-data-scientists-and-get-hired-as-one/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 23:56:37 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientists]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[Netflix]]></category>
		<category><![CDATA[Orbitz]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=631544</guid>
		<description><![CDATA[Data scientist might be the sexiest job of the 21st century, but it's hardly an easy gig to land. Here is some advice from practitioners at Netflix, Orbitz and Hortonworks on how get hired and even do the hiring.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631544&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>As you might have heard before if you read <a href="http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation">McKinsey reports</a>, <a href="http://www.nytimes.com/2013/04/14/education/edlife/universities-offer-courses-in-a-hot-new-field-data-science.html">the New York Times</a> or <a href="http://gigaom.com/2013/01/06/why-data-scientists-matter-data-science-is-the-future-of-everything/">just about any technology news site</a>, data scientists are in high demand. Heck, the Harvard Business Review <a href="http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/">called it the sexiest job of the 21st century</a>. But landing a gig as a data scientist isn&#8217;t easy &#8212; especially a top-notch gig at a major web or e-commerce company where merely <em>talented </em>people are a dime a dozen.</p>
<p>However, companies are starting to talk openly about what they look for in data scientists, including the skills someone should have and what they&#8217;ll need to know to survive an interview. I spent a day at the <a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2013/">Predictive Analytics World</a> conference on Monday and heard both Netflix and Orbitz give their two cents. That&#8217;s also the same day Hortonworks <a href="http://hortonworks.com/blog/hortonworks-hadoop-data-science/">published a blog post about how to build a data science team</a>.</p>
<p>Granted that &#8220;data scientist&#8221; is a nebulous term &#8212; perhaps as much so as &#8220;big data&#8221; &#8212; these tips (a mashup of all three sources) are still broadly applicable. If you want to make the leap from guy who knows data to data scientist, I suggest paying attention.</p>
<h2 id="1-know-the-core-competencies">1. Know the core competencies.</h2>
<p>For most of us, there&#8217;s readin, &#8216;ritin&#8217; and &#8216;rithmetic. For data scientists, there&#8217;s SQL, statistics, predictive modeling and programming (probably Python). If you don&#8217;t have at least a grounding in these skills, you&#8217;re probably not getting through the door, in part because they form a common language that lets people from different backgrounds talk to each other.</p>
<p>Hortonworks&#8217; Ofer Mendelevitch describes the ideal data scientist as occupying a place on the spectrum between a software engineer and a research scientist. In distinguishing a great engineer, mathematician or data analyst from a data scientist, programming skills are probably the biggest variable. That&#8217;s because being able to write code means you&#8217;ll have an easier time testing out your hypotheses and algorithms, hacking through certain problems and generally thinking in ways that actually relate to the products your employer is building.</p>
<div id="attachment_631679" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/ofer1.png"><img  alt="Source: Hortonworks" src="http://gigaom2.files.wordpress.com/2013/04/ofer1.png?w=708&#038;h=77" width="708" height="77" class="size-large wp-image-631679" /></a><p class="wp-caption-text">Source: Hortonworks</p></div>
<p>Chris Pouliot, director of algorithms and analytics at Netflix, said even being able to &#8220;pseudo-code&#8221; might be good enough if someone is otherwise a strong candidate. You can pick up SQL or Python or whatever you need pretty quickly, he noted.</p>
<p>Or, hinted Orbitz VP of Advanced Analytics Sameer Chopra, you could just suck it up and learn Python now: &#8220;If you were to leave today and ask &#8216;What specific skills should I learn?&#8217;: Python.&#8221;</p>
<h2 id="2-know-a-little-more">2. Know a little more.</h2>
<p>Of course, just meeting the minimum requirements never got anybody a job (well, almost nobody). What Pouliot is <em>really </em>looking for in a candidate are: an advanced degree in a quantitative field; hands-on experience hacking data (ideally using Hive, Pig, SQL or Python); good exploratory analysis skills; the ability to work with engineering teams; and the ability to generate and create algorithms and models rather than relying on out-of-the-box ones.</p>
<p>Chopra&#8217;s advice was to get up to speed on machine learning, especially if you want to work in Silicon Valley, <a href="http://gigaom.com/2013/03/20/its-not-skynet-yet-in-machine-learning-theres-still-a-role-for-humans/">where machine learning has exploded in popularity</a>. He&#8217;s also a big fan of honing those hacking skills because <a href="http://gigaom.com/2012/10/04/how-trifacta-wants-to-teach-humans-and-data-to-work-together/">data munging is such a valuable skill</a> when you&#8217;re dealing with so many types of data that you need to process so they work together. If you can do quality analytics across myriad data sources, Chopra said, &#8220;you can write your own ticket in this day and age.&#8221;</p>
<p>Oh, and if you&#8217;re planning to work at a startup, he added, R is almost a must-know for anyone whose job will entail statistical analysis.</p>
<h2 id="3-embrace-online-learning">3. Embrace online learning.</h2>
<p>If it all sounds a little daunting, don&#8217;t be too worried, Chopra advised. That&#8217;s because there are plenty of opportunities to learn these new skills online via both massive open online courses (he&#8217;s particularly keen on Udacity&#8217;s Computer Science 101 and <a href="http://gigaom.com/2012/10/14/why-becoming-a-data-scientist-might-be-easier-than-you-think/">Andrew Ng&#8217;s machine learning course on Coursera</a>) and universities&#8217; own online curricula. Chopra also suggested joining professional groups on LinkedIn, <a href="http://gigaom.com/2012/09/21/forget-your-fancy-data-science-try-overkill-analytics/">participating in Kaggle competitons</a> and maybe even getting out of the house by going to meetups.</p>
<p>Whatever you&#8217;re curious about, though &#8212; text mining, natural language processing, deep learning &#8212; you can probably find someone willing to teach you for free or nearly free, and any additional skills will help set you apart from the crowd.</p>
<h2 id="4-learn-to-tell-a-story">4. Learn to tell a story.</h2>
<p>Last month at Structure: Data, DJ Patil told me that one of the biggest skill shortcomings in data science <a href="http://gigaom.com/2013/03/20/big-data-is-still-hard-but-it-gets-better/">is the ability to tell a story with data</a> beyond just pointing to the numbers. Chopra agreed, noting that today&#8217;s new visualization tools make it easier to display data in formats that non-scientists might be able to (or at least want to) consume. A corollary of storytelling is good, old-fashioned communication: All the charts in the world won&#8217;t make a difference if you can&#8217;t communicate to product managers or executives why your findings matter.</p>
<p>Pouliot is a little less sold on communication skills, though &#8212; at least sometimes. If you&#8217;re an engineer primarily talking to other engineers, he told the room, you probably can speak all the jargon you want. It&#8217;s only if someone has a business-facing role when communication really becomes important.</p>
<h2 id="5-prepare-to-be-tested-aka-you">5. Prepare to be tested (aka &#8220;Your pedigree means nothing&#8221;).</h2>
<p>After you&#8217;ve learned all these skills, added them to your résumé and talked to a hiring manager about how good you are at them, it&#8217;s likely testing time. Prospective Netflix data scientists go through a battery of exercises, Pouliot says, including explaining projects they&#8217;ve worked on and questions to determine the depth of their knowledge. They&#8217;ll also be asked to devise a framework that solves a problem of the interviewer&#8217;s choice.</p>
<div id="attachment_631680" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/04/20130415_150900.jpg"><img  alt="Chris Pouliot" src="http://gigaom2.files.wordpress.com/2013/04/20130415_150900.jpg?w=300&#038;h=225" width="300" height="225" class="size-medium wp-image-631680" /></a><p class="wp-caption-text">Chris Pouliot</p></div>
<p>One thing Pouliot warned about is an over-reliance on what&#8217;s on your résumé. Right off the bat, for example, he&#8217;ll test the heck out the skills or knowledge that someone claims to ensure they really know it.</p>
<p>Having a Stanford degree and work experience at Google don&#8217;t necessarily make someone a shoo-in, either. Pouliot acknowledged during a quick chat after his presentation that he&#8217;s been seduced by the perfect resume before &#8212; even going so far as to cut a few corners to get someone in for an interview &#8212; only to be disappointed in the end. Everyone has to pass the tests, he said, and some of the best applicants on paper crashed and burned very early in the process.</p>
<h2 id="6-exercise-creativity">6. Exercise creativity.</h2>
<p>It&#8217;s during the testing phase at places like Netflix that all those personal skills and experience can come into play. There&#8217;s often no right answer when it comes to answering the hypotheticals an interviewer like Pouliot might ask, and he gives bonus points for solutions he&#8217;s never seen before. &#8220;Creativity is one of the biggest things to look for when hiring data scientists,&#8221; he said. Later, he added, &#8220;Creativity is king, I think, for a great data scientist.&#8221;</p>
<h2 id="bonus-tips-for-anyone-hiring-a">Bonus tips for anyone hiring and managing data scientists</h2>
<p>Technically, Pouliot&#8217;s talk at Predictive Analytics World was about hiring data scientists, but much of the insights were probably more valuable to aspiring data scientists. Some of them, though, we&#8217;re definitely for management, possibly at the C-level. A few points to consider:</p>
<ul>
<li>Netflix has a standalone data science team that works closely with other departments but ultimately answers to itself. This helps the data scientists collaborate with one another, gives them upward mobility (i.e., they might never become director of marketing, but they could become director of data science) and makes it easier to manage them because everyone speaks the same language so an employee knows his boss knows his stuff.</li>
</ul>
<p style="padding-left:30px;">However, he noted, the alternative approach of embedding data scientists within other departments does bring its own benefits. That type of setup can result in a better alignment of research efforts and business needs, and it can help products get built faster because everyone is on the same page. Pouliot suggests one compromise might be to keep a centralized data science team but locate it physically near the other teams it will be interacting with most often, and other is just to ensure you have representatives from every stakeholder department present for meetings and problem-solving exercises.</p>
<ul>
<li>Actually, if you just cannot hire data scientists with all the skills you want them to have, Mendelevitch from Hortonworks suggests a similar tactic. It can be difficult to teach applied math to software engineers and vice versa, so, he writes, &#8220;[S]imply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had: applied scientists working together with data engineers to build large-scale computational advertising systems.&#8221;</li>
</ul>
<ul>
<li>If you want to retain your good data scientists once you&#8217;ve hired them &#8212; especially in Silicon Valley where they can walk out the door and get five offers &#8212; paying them the market rate is a good start. Additionally, Pouliot said, letting them work on challenging products will keep them happy. Micro-managing them will not.</li>
</ul>
<p><em>This post was corrected 4/23 with the correct spelling of Hortonworks&#8217; Ofer Mendelevitch.</em></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-461077p1.html">Shutterstock user Sergey Nivens</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631544&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=425187"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=425187" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631544+how-to-hire-data-scientists-and-get-hired-as-one&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631544+how-to-hire-data-scientists-and-get-hired-as-one&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2011/11/connected-world-the-consumer-technology-revolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631544+how-to-hire-data-scientists-and-get-hired-as-one&utm_content=dharrisstructure">Connected world: the consumer technology revolution</a></li><li><a href="http://pro.gigaom.com/2010/12/9-companies-that-pushed-the-infrastructure-discussion-in-2010/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631544+how-to-hire-data-scientists-and-get-hired-as-one&utm_content=dharrisstructure">9 Companies that Pushed the Infrastructure Discussion in 2010</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/16/how-to-hire-data-scientists-and-get-hired-as-one/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_115502362.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_115502362.jpg?w=150" medium="image">
			<media:title type="html">data scientist at board</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/ofer1.png?w=708" medium="image">
			<media:title type="html">Source: Hortonworks</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/20130415_150900.jpg?w=300" medium="image">
			<media:title type="html">Chris Pouliot</media:title>
		</media:content>
	</item>
		<item>
		<title>Gracenote co-founder on &#8216;iPod day&#8217; and better music through data</title>
		<link>http://gigaom.com/2013/04/15/gracenote-co-founder-on-ipod-day-and-better-music-through-data/</link>
		<comments>http://gigaom.com/2013/04/15/gracenote-co-founder-on-ipod-day-and-better-music-through-data/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 03:00:37 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Gracenote]]></category>
		<category><![CDATA[iPod]]></category>
		<category><![CDATA[itunes]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[second screen]]></category>
		<category><![CDATA[spotify]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=630346</guid>
		<description><![CDATA[More than a decade ago, Gracenote received some cryptic advice from Apple to buy more servers. What followed -- the launch of iTunes and iPod -- blew up Gracenote's database to epic proportions and laid the groundwork for a metadata empire.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=630346&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It was April 2000 when the team at Gracenote got a call from Apple that would change its business forever. Apple wouldn&#8217;t give Gracenote any specifics, but it did offer up some prescient advice: &#8220;You need to buy more servers.&#8221;</p>
<p>A few years into Steve Jobs&#8217;s second stint as Apple&#8217;s CEO, the company hadn&#8217;t yet reinvented itself as one of the world&#8217;s most-important technology companies, but it was a big-enough distribution channel for the two-year-old Gracenote. At that point, Gracenote had built a respectable business collecting and providing metadata for the compact discs that people were ripping onto their computers, and it relied on software partners to get in front of the music consumers doing the uploading. One of those partners was a popular Mac jukebox application called SoundJam MP.</p>
<div id="attachment_631331" class="wp-caption alignright" style="width: 180px"><a href="http://gigaom2.files.wordpress.com/2013/04/mgmt-ty-roberts.jpg"><img  alt="Ty Roberts Source: Gracenote" src="http://gigaom2.files.wordpress.com/2013/04/mgmt-ty-roberts.jpg?w=708"   class="size-full wp-image-631331" /></a><p class="wp-caption-text">Ty Roberts Source: Gracenote</p></div>
<p>So, Gracenote Co-founder and CTO Ty Roberts told me during a recent interview, his company heeded Apple&#8217;s warning and bought more servers. At some point around that time (details on the date of the acquisition are sketchy), <a href="http://www.maclife.com/article/feature/complete_itunes_history_soundjam_mp_itunes_9">Apple bought SoundJam MP</a>. Then, at MacWorld in January 2001, Apple released the first version of iTunes (based on the SoundJam technology) and grew Gracenote&#8217;s footprint by putting it on more machines. In October 2001, Apple released the iPod and changed Gracenote&#8217;s life forever.</p>
<p>The holiday season &#8212; particularly Christmas morning &#8212; provides a clear example of how stark the change was. &#8220;We used to call it iPod day,&#8221; Roberts explained, because the company&#8217;s servers would go crazy as people opened up their new iPods and immediately began ripping CDs onto their computers. The company&#8217;s chief scientist would stay up 20 hours a day for 5 days straight to make sure the database didn&#8217;t crash under the load.</p>
<p>From that point on, Roberts explained, a graph showing the rate at which people were uploading music to Gracenote would go from a steady incline into a vertical line. At one point the company was getting metadata from &#8212; by Roberts&#8217;s estimate &#8212; literally every CD being ripped onto personal computers. There was so much database traffic &#8212; both writing and reading &#8212; because Apple didn&#8217;t release the first version of the iTunes Store until April 2003; if users wanted to use their iPods, they had to upload music first.</p>
<h2 id="scaling-like-the-big-boys">Scaling like the big boys</h2>
<p>Today, of course, Gracenote (which <a href="http://paidcontent.org/2008/04/23/419-sony-buys-media-metadata-firm-gracenote-for-260-million/">Sony acquired for $260 million in 2008</a>) is pretty much ubiquitous, at least when it comes to metadata. It has metadata for about 130 million songs &#8212; and growing &#8212; from all over the world and provides metadata to everything from iTunes to Path to your car&#8217;s entertainment console. Even if they&#8217;re not available for sale as MP3, if someone somewhere at some point ripped a CD and entered its information, Gracenote has data on those artists and songs.</p>
<p>Its database now gets 15 billion queries a month, or 500 million a day (&#8220;We&#8217;re probably bigger than Bing,&#8221; Roberts joked), and the company&#8217;s infrastructure has scaled a few times to meet this demand. What began as a small web database running on a few servers grew into an Oracle environment that provided better performance. And when Oracle became cost-prohibitive because of Gracenote&#8217;s expanding scale, it shifted again into a highly optimized system that spans thousands of cores in four global data centers.</p>
<p>Now, GM and VP of Automatic Content Recognition Michael Jeffrey noted, almost everything from the chip level up is optimized specifically for Gracenote.</p>
<h2 id="theres-no-world-music-when-you">There&#8217;s no &#8220;world music&#8221; when you&#8217;re in the &#8220;world&#8221;</h2>
<p>And this setup lets Gracenote do a lot more than just recognize music listeners&#8217; files and give them the album art. For one, Roberts explained, it lets Gracenote be a global company. &#8220;We want to have all the music in the world,&#8221; Roberts said, &#8220;&#8230; because our customers ship their products globally.&#8221; In fact, part of the reason it&#8217;s now part of Sony is that Sony was distributing Gracenote so widely as part of the music player in its Vaio line of laptops.</p>
<p>In order to ensure that everyone has a natural experience wherever they&#8217;re accessing Gracenote, part of the job of the company&#8217;s 100-person editorial team is to categorize music hierarchically <em>by locality</em>. So, when a user in Japan uploads a CD and Gracenote returns the metadata, it&#8217;s categorized as &#8220;rock and roll,&#8221; for example, rather than a catch-all category like &#8220;world music&#8221; that a U.S. user might see.<em><br />
</em></p>
<p>&#8220;We want music to feel like a person in your country actually organized it,&#8221; Roberts said, &#8220;not some dude from California.&#8221;</p>
<h2 id="better-music-and-television-th">Better music and television through data science</h2>
<p>All that data also makes Gracenote a natural fit for recommending new music, although right now the company prefers to let partners handle the algorithms because recommendations tend to be highly product-specific. For example, the iTunes Genius feature <a href="http://gigaom.com/2010/06/04/heres-how-the-web-reads-your-mind/">is a pretty run-of-the-mill recommendation engine</a>, but, Roberts explained, Apple places a premium on accuracy because its recommendations cost users 99 cents (or more) a shot. With a subscription service like Spotify, though, trying new music is risk-free, so it can play a little faster and looser with its algorithms.</p>
<p>Because Gracenote is present in so many cars &#8212; about 35 million &#8212; the company has put a lot thought into how to optimally deliver services there, too. Until drivers can bring their interest graphs and music libraries with them to their cars, he explained, any sort of in-car recommendation engine has to be pretty simple and non-distracting &#8212; perhaps like thumbs-up or thumbs-down button on the display that will eventually be able to recognize someone&#8217;s tastes.</p>
<p>The company has even developed what Roberts calls &#8220;machine listening,&#8221; which is the ability of an algorithm to recognize the mood, tempo and other audio attributes of music. This is comparable to what Pandora offers, but Gracenote has data on pretty much any song someone could possibly have, which means it can make even your personal music library that much smarter. One idea the company is tinkering is something Roberts describes as &#8220;audio coffee.&#8221; Depending on any variety of factors &#8212; time of day, location, driving conditions or behavior &#8212; the stereo system <a href="http://gigaom.com/2013/02/19/how-gracenote-is-building-a-car-stereo-that-senses-your-driving-mood/">could pick music that either picks up a driver&#8217;s pulse or maybe relaxes him</a>.</p>
<p>For Gracenote&#8217;s next chapter, the company is banking tablets to deliver a kick like the iPod did last decade. Gracenote is already <a href="http://gigaom.com/2012/12/14/gracenote-ad-targeting/">working with television partners on real-time ad-swapping</a> and intelligent content recommendations, and now it wants to dive deep into the second-screen world. Its new product called <a href="http://www.gracenote.com/video/recognition/">Entourage</a> uses a tablet&#8217;s internal sensors to hear the television show or music playing in a room and then surface related content, perhaps from the web &#8212; like <a href="http://www.engadget.com/2013/01/03/zeebox-gracenote-entourage/">what Entourage user Zeebox provides</a> &#8212; or perhaps produced, interactive material like the SyFy channel delivers via its Sync app.</p>
<div id="attachment_631329" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/20130409_113220.jpg"><img  alt="Two ads for two different viewers." src="http://gigaom2.files.wordpress.com/2013/04/20130409_113220.jpg?w=708&#038;h=531" width="708" height="531" class="size-large wp-image-631329" /></a><p class="wp-caption-text">Two ads for two different viewers.</p></div>
<p>Later this year, GM and VP Jeffrey said, Gracenote will be doing pilots with some large sports broadcasters around a &#8220;cheer and jeer&#8221; feature that measures how hard people in a room are cheering for or booing their favorite sports teams. If you&#8217;re elated, you might see an ad for season tickets. If you&#8217;re sad, maybe it&#8217;s an an for beer.</p>
<p>Even Roberts is impressed, especially considering that the company&#8217;s first use of audio recognition was to make sure users got the right data for their exact version of a song: &#8220;I never thought the recognition would break open these kind of new fields.&#8221;</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=630346&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=74713"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=74713" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=630346+gracenote-co-founder-on-ipod-day-and-better-music-through-data&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2011/11/connected-world-the-consumer-technology-revolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=630346+gracenote-co-founder-on-ipod-day-and-better-music-through-data&utm_content=dharrisstructure">Connected world: the consumer technology revolution</a></li><li><a href="http://pro.gigaom.com/2012/01/forecast-the-evolution-of-the-digital-music-industry/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=630346+gracenote-co-founder-on-ipod-day-and-better-music-through-data&utm_content=dharrisstructure">Forecast: the future of the digital music industry</a></li><li><a href="http://pro.gigaom.com/2012/07/social-tv-apps-understanding-consumer-behavior-and-the-evolving-ecosystem/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=630346+gracenote-co-founder-on-ipod-day-and-better-music-through-data&utm_content=dharrisstructure">Social-TV apps and consumer behavior</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/15/gracenote-co-founder-on-ipod-day-and-better-music-through-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/ipod1.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/ipod1.jpg?w=150" medium="image">
			<media:title type="html">ipod</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/mgmt-ty-roberts.jpg" medium="image">
			<media:title type="html">Ty Roberts Source: Gracenote</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/20130409_113220.jpg?w=708" medium="image">
			<media:title type="html">Two ads for two different viewers.</media:title>
		</media:content>
	</item>
		<item>
		<title>We need a data democracy, not a data dictatorship</title>
		<link>http://gigaom.com/2013/04/07/we-need-a-data-democracy-not-a-benevolent-data-dictatorship/</link>
		<comments>http://gigaom.com/2013/04/07/we-need-a-data-democracy-not-a-benevolent-data-dictatorship/#comments</comments>
		<pubDate>Sun, 07 Apr 2013 20:28:46 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[consumer software]]></category>
		<category><![CDATA[consumerization]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[tableau]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=628038</guid>
		<description><![CDATA[A data democracy built to last needs tools that empower everyone to work with data rather than relying on apps and data scientists. Tableau helped ignite the data revolution, and its IPO could help it keep going.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=628038&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The <a href="http://gigaom.com/2012/06/21/kundra-democratizing-data-means-a-fundamental-shift-in-power/">democratization of data</a> is a real phenomenon, but building a sustainable data democracy means truly giving power to the people. The alternative is just a shift of power from traditional data analysts within IT departments to a new generation of data scientists and app developers. And this seems a lot more like a dictatorship than a democracy &#8212; a benevolent dictatorship, but a dictatorship nonetheless.</p>
<p>These individuals and companies aren&#8217;t entirely bad, of course, and they&#8217;re actually necessary. Apps that help <a href="http://gigaom.com/2012/10/02/prismatics-bradford-cross-first-we-understand-media-then-the-world/">predict what we want to read</a>, <a href="http://gigaom.com/2012/07/30/siri-is-so-aloof-saga-wants-to-get-to-know-you/">where we&#8217;ll want to go next</a> or <a href="http://gigaom.com/2012/12/06/pandora-founder-says-spotify-not-fundamentally-competitive-with-radio-listening/">what songs we&#8217;ll like</a> are certainly cool and even beneficial in their ability to automate and optimize certain aspects of our lives and jobs. In the corporate world, there will always be data experts who are smarter and trained in advanced techniques and <a href="http://gigaom.com/2013/01/06/why-data-scientists-matter-data-science-is-the-future-of-everything/">who should be called upon to answer the toughest questions</a> or tackle the thorniest problems.</p>
<p>Last week, for example, Salesforce.com <a href="http://www.zdnet.com/salesforce-com-taps-into-graph-craze-with-latest-chatter-update-7000013482/">introduced a new feature of its Chatter intra-company social network</a> that categorizes a variety of data sources so employees can easily find the people, documents and other information relevant to topics they&#8217;re interested in. As with similarly devised services &#8212; <a href="http://gigaom.com/2013/03/03/how-and-why-linkedin-is-becoming-an-engineering-powerhouse/">LinkedIn&#8217;s People You May Know</a>, the <a href="http://gigaom.com/2013/02/07/the-future-of-search-is-gravitational-content-will-come-to-you/">gravitational search movement</a>, or any type of service <a href="http://gigaom.com/2012/03/15/the-personalized-web-is-just-an-interest-graph-away/">using an interest graph</a> &#8212; the new feature&#8217;s beauty and utility lie in its abstraction of the underlying semantic algorithms and data processing.</p>
<p>The problem, however, comes when we&#8217;re forced to rely on these people, features and applications to decide how data can affect our lives or jobs, or what questions we can answer using the troves of data now available to us. In a true data democracy, citizens <a href="http://gigaom.com/2012/12/22/we-dont-need-more-data-scientists-just-simpler-ways-to-use-big-data/">must be empowered to make use of their own data as they see fit </a>and they must only <em>have </em>to rely apps and experts by choice or when the task really requires an expert hand. At any rate, citizens must be informed enough to have a meaningful voice in bigger decisions about data.</p>
<h2 id="the-democratic-revolution-is-u">The democratic revolution is underway</h2>
<p>The good news is that there&#8217;s a whole new breed of startups trying to empower the data citizenry, whatever their role. Companies such as <a href="http://gigaom.com/2012/08/14/how-0xdata-wants-to-help-everyone-become-data-scientists/">0xdata</a>, <a href="http://gigaom.com/2012/09/27/startup-precog-says-big-data-doesnt-need-to-be-so-complex/">Precog</a> and <a href="http://gigaom.com/2013/01/25/how-to-succeed-on-kickstarter-find-35-people-and-ask-for-less-than-9000/">BigML</a> are trying to make data science more accessible to everyday business users. There are next-generation business intelligence startups such as <a href="http://gigaom.com/2013/04/03/forget-in-memory-sisense-raises-10m-for-in-chip-analytics/">SiSense</a>, <a href="http://gigaom.com/2013/03/26/white-hot-bi-on-hadoop-startup-platfora-now-ga/">Platfora</a> and <a href="http://gigaom.com/2012/12/05/clearstory-data-raises-9m-and-might-actually-make-data-your-friend/">ClearStory</a> rethinking how business analytics are done in an area of HTML5 and big data. And then there are companies such as <a href="http://gigaom.com/2013/03/17/statwing-wants-to-make-your-data-and-armchair-quarterback-dreams-come-true/">Statwing</a>, <a href="http://gigaom.com/2012/10/16/infogr-am-tries-to-keep-the-ugly-out-of-graphic-design/">Infogram</a> and <a href="http://gigaom.com/2012/05/31/data-hero-aims-to-turn-us-all-into-analytics-stars/">Datahero</a> (which will be in beta mode soon, by the way) trying to bring data analysis to the unwashed non-data-savvy masses.</p>
<p>Combined with a growing number of publicly available data sets and data marketplaces, and more ways of collecting every possible kind of data &#8212;  personal fitness, web analytics, energy consumption, you name it &#8212; these self-service tools can provide an invaluable service. In January, I <a href="http://gigaom.com/2013/01/31/data-for-dummies-5-data-analysis-tools-anyone-can-use/">highlighted how a number of them can work</a> by using my own dietary and activity data, as well as publicly available gun-ownership data and even web-page text. But as I explained then, they&#8217;re still not always easy for laypeople to use, much less perfect.</p>
<div id="attachment_628507" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/statwing.jpg"><img  alt="Statwing spells out statistics for laypeople." src="http://gigaom2.files.wordpress.com/2013/04/statwing.jpg?w=708&#038;h=310" width="708" height="310" class="size-full wp-image-628507" /></a><p class="wp-caption-text">Statwing spells out statistics for laypeople.</p></div>
<h2 id="can-tableau-be-datas-george-wa">Can Tableau be data&#8217;s George Washington?</h2>
<p>This is why I&#8217;m so excited about <a href="http://gigaom.com/2013/04/03/a-tableau-ipo-could-validate-the-big-data-visualization-push-or-not/">Tableau&#8217;s forthcoming IPO</a>. There are few companies that helped spur the democratization of data over the past few years more than Tableau. It has become the face of the next-generation business intelligence software thanks to its ease of use and <a href="http://gigaom.com/2012/02/23/thanks-to-consumerization-its-ipo-season-in-analytics/">focus on appealing visualization</a>, and its free public software has found avid users even among relative data novices like myself. Tableau&#8217;s success and vision no doubt inspired a number of the companies I&#8217;ve already referenced.</p>
<p>Assuming it begins its publicly traded life flush with capital, Tableau will not just be financially sound &#8212; it will also be in a position to help the burgeoning data democracy evolve into something that can last. More money means being able to develop more features that Tableau can use to bolster sales (and further empower business users with data analysis), which should mean the company can afford to also continually improve its free service and perhaps <a href="http://www.tableausoftware.com/academic/students">put premium versions in the hands</a> of more types of more non-corporate professionals for free.</p>
<div id="attachment_628508" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/goog-trans.jpg"><img  alt="Tableau is already easy -- but not easy enough." src="http://gigaom2.files.wordpress.com/2013/04/goog-trans.jpg?w=708&#038;h=360" width="708" height="360" class="size-full wp-image-628508" /></a><p class="wp-caption-text">Tableau is already easy (I made this) &#8212; but not easy enough.</p></div>
<p>The bottom-up approach has already proven very effective in the worlds of cloud computing, <a href="http://gigaom.com/2011/02/15/startup-strategies-how-lew-cirne-made-new-relic-a-saas-success/">software as a service</a> and open-source software, and I have to assume it&#8217;s a win-win situation in analytics, too. Today&#8217;s free users will be tomorrow&#8217;s paying users once they get skilled enough to want to move onto bigger data sets and better features. But the base products have to be easy enough and useful enough to get started with, or companies will only have a lot of registrations and downloads but very few avid users.</p>
<p>And if Tableau steps ups its game around data democratization, I have to assume it will up the ante for the company&#8217;s fellow large analytics vendors and even startups. A race to empower the lower classes on the data ladder would certainly be in stark contrast to the historical strategy of building ever-bigger, ever-more-advanced products targeting only the already-powerful data elite. That&#8217;s the kind of revolution I think we all can get behind.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-80873p1.html">Shutterstock user Tiago Jorge da Silva Estima</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=628038&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=899592"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=899592" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=628038+we-need-a-data-democracy-not-a-benevolent-data-dictatorship&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/05/the-importance-of-putting-the-u-and-i-in-visualization/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=628038+we-need-a-data-democracy-not-a-benevolent-data-dictatorship&utm_content=dharrisstructure">The importance of putting the U and I in visualization</a></li><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=628038+we-need-a-data-democracy-not-a-benevolent-data-dictatorship&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/03/4-ipad-apps-to-help-wrangle-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=628038+we-need-a-data-democracy-not-a-benevolent-data-dictatorship&utm_content=dharrisstructure">4 iPad apps to help wrangle data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/07/we-need-a-data-democracy-not-a-benevolent-data-dictatorship/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_18363502.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_18363502.jpg?w=150" medium="image">
			<media:title type="html">democracy</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/statwing.jpg" medium="image">
			<media:title type="html">Statwing spells out statistics for laypeople.</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/goog-trans.jpg" medium="image">
			<media:title type="html">Tableau is already easy -- but not easy enough.</media:title>
		</media:content>
	</item>
	</channel>
</rss>
