<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GigaOM &#187; natural language processing</title>
	<atom:link href="http://gigaom.com/tag/natural-language-processing/feed/" rel="self" type="application/rss+xml" />
	<link>http://gigaom.com</link>
	<description></description>
	<lastBuildDate>Tue, 21 May 2013 08:38:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gigaom.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/0db8f6557d022075dbbf010c54d46d93?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GigaOM &#187; natural language processing</title>
		<link>http://gigaom.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gigaom.com/osd.xml" title="GigaOM" />
	<atom:link rel='hub' href='http://gigaom.com/?pushpress=hub'/>
		<item>
		<title>Parakweet uses natural language processing to find value in your tweets</title>
		<link>http://gigaom.com/2013/05/17/parakweet-uses-natural-language-processing-to-find-value-in-your-tweets/</link>
		<comments>http://gigaom.com/2013/05/17/parakweet-uses-natural-language-processing-to-find-value-in-your-tweets/#comments</comments>
		<pubDate>Fri, 17 May 2013 16:00:05 +0000</pubDate>
		<dc:creator>Eliza Kern</dc:creator>
				<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[social media dashboard]]></category>
		<category><![CDATA[social network]]></category>
		<category><![CDATA[social recommendation tools]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=646402</guid>
		<description><![CDATA[Looking for a book suggestion? Culling information from your Twitter feed and turning that into accurate recommendations is harder than it looks, but Parakweet is looking to use natural language procesing to do just that.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646402&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Millions of people access Twitter every month, and the sheer volume of tweets flowing through the company&#8217;s platform is remarkable. Different companies have tried to harness the value of those tweets and derive information from the 140 character blips. But it would seem that making suggestions to users about the best book to read or movie to watch based on tweets isn&#8217;t an easy challenge.</p>
<p><a href="http://gigaom.com/?attachment_id=646422" rel="attachment wp-att-646422"><img  alt="twitter book suggestions" src="http://gigaom2.files.wordpress.com/2013/05/screen-shot-2013-05-16-at-4-52-04-pm.png?w=287&#038;h=300" width="287" height="300" class="alignleft size-medium wp-image-646422" /></a>Parakweet is a company that&#8217;s working to use natural language processing to cull through your tweets and make smart, targeted suggestions based on the data. On Friday, the company plans to announce the launch of two products. One is <a href="http://www.bookvi.be/" target="_blank">Bookvi.be</a>, a consumer-oriented book recommendation engine, and TrendFinder For Movies, which is a social media dashboard primarily for entertainment companies to monitor conversations around movies. The latter is a paid product that provides the company with revenue, and the former is free for consumers.</p>
<p>&#8220;It&#8217;s a very hard problem we&#8217;ve tackled, which is accurately identifying sentiments,&#8221; CEO Ramesh Haridas said. &#8220;With 400 million tweets a day, there are 700,000 a day discussing movies, and if you tried text-matching techniques you&#8217;d come back with 40 million results. Many movies and books have very common titles, so you&#8217;d just drown in data.&#8221;</p>
<p>Both products use natural language processing to figure out how common a title is on Twitter, but also how a consumer is tweeting about a particular product, and they make recommendations based on those tweets. For instance, if I tweeted that a particular book is terrible and no one should ever read it, it would look ridiculous for a book recommendation engine to suggest that book to people. So Bookvi.be is structured to recognize the words I&#8217;m using in my tweet and know not to recommend that book. Users can choose to have a weekly email send to them with book suggestions, and they can type in their Twitter username to get book suggestions based on the people they follow.</p>
<p>&#8220;The bar on accuracy is very high,&#8221; Haridas said. &#8220;Especially if it&#8217;s sent via email, the precision needs to be intact.&#8221;</p>
<p>I&#8217;ve looked at a good number of social recommendation tools, and this one definitely stood out. For one, it was incredibly accurate &#8212; all the books it suggested were books I would actually read. But most importantly, it didn&#8217;t require me to create a new social network, or depend on friends for reviews, so you could get a lot of value from it right away. This is the obvious benefit of using someone else&#8217;s social graph, but Twitter seems perfectly suited to making content recommendations for things like books. Because unlike my Facebook friends, the people I follow on Twitter tend to accurately reflect my intellectual interests.</p>
<p>Of course, there are the obvious potential pitfalls of building a product around someone else&#8217;s platform, although Haridas said they support Facebook and are adding other platforms. But there&#8217;s a good deal of money to be made in accurately processing and understanding the words people are tweeting, as e<a href="http://gigaom.com/2013/05/13/with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company/" target="_blank">videnced by Twitter&#8217;s acquisition of Lucky Sort this week</a>, a similar company that also tries to figure out what people are talking about on social media.  <a href="http://gigaom.com/2013/04/17/with-new-twitter-ads-product-you-are-what-you-tweet-to-advertisers-anyway/" target="_blank">As I&#8217;ve written before, as Twitter ramps up its advertising products it&#8217;s more important than ever for the company to be able</a> to provide brands with more accurate ad targeting which hinges on the words people are tweeting and searching.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=646402&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=113810"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=113810" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=646402+parakweet-uses-natural-language-processing-to-find-value-in-your-tweets&utm_content=elizakern">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/12/connected-consumer-2013-how-2012-laid-the-groundwork-for-change/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=646402+parakweet-uses-natural-language-processing-to-find-value-in-your-tweets&utm_content=elizakern">How consumer media will change in 2013</a></li><li><a href="http://pro.gigaom.com/2012/11/sector-roadmap-crowd-labor-platforms-in-2012/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=646402+parakweet-uses-natural-language-processing-to-find-value-in-your-tweets&utm_content=elizakern">Examining the rise of crowd labor platforms in 2012</a></li><li><a href="http://pro.gigaom.com/2012/10/the-state-of-cross-platform-measurement-across-tv-online-and-social/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=646402+parakweet-uses-natural-language-processing-to-find-value-in-your-tweets&utm_content=elizakern">The state of cross-platform media measurement</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/17/parakweet-uses-natural-language-processing-to-find-value-in-your-tweets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/twitter-newspaper.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/twitter-newspaper.png?w=150" medium="image">
			<media:title type="html">twitter-NEWSPAPER</media:title>
		</media:content>

		<media:content url="http://2.gravatar.com/avatar/bd7905cba2440e49d86bd328573730f7?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">elizakern</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/screen-shot-2013-05-16-at-4-52-04-pm.png?w=287" medium="image">
			<media:title type="html">twitter book suggestions</media:title>
		</media:content>
	</item>
		<item>
		<title>This is why big data is the sweet spot for SaaS</title>
		<link>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/</link>
		<comments>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/#comments</comments>
		<pubDate>Wed, 15 May 2013 01:10:22 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[BloomReach]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[marketing]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[saas]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=645189</guid>
		<description><![CDATA[When it comes to using big data technology effectively, there's a lot to like about SaaS. When companies like BloomReach create and analyze massive web-wide data sets, they automate insights that almost no individual company could discover on its own.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645189&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>People often ask me where the smart money is in big data. I often tell them that’s a foolish question, because I’m not an investor — but if I were, I’d look to software as a service.</p>
<p>There are two primary reasons why, the first of which is obvious: Companies are tired of managing applications and infrastructure, so something that optimizes a common task using techniques they don’t know on servers they don’t have to manage is probably compelling. It’s called cloud computing.</p>
<p>The other reason is that <a href="http://gigaom.com/2013/04/29/google-research-director-and-ai-expert-peter-norvig-elected-into-aaas/">the <em>big </em>part of big data really is important</a> if you want to get a really clear picture of what’s happening in any given space. While no single end-user company can (or likely would) address search-engine optimization, for example, by building a massive store comprised of data from hundreds or thousands of companies as well as the entire web, a cloud service dedicated to that specific task can.</p>
<p>From <a href="http://gigaom.com/2012/11/28/log-data-startup-sumo-logic-raises-30m/">web security</a> to <a href="http://gigaom.com/2012/06/21/how-collective-intelligence-is-reshaping-systems-management/">systems management</a>, we’re already seeing how centralized data stores provide SaaS companies a broad view into what’s happening that can then be filtered down to serve each individual customer’s specific situation. <a href="http://www.bloomreach.com/">BloomReach</a>, a SaaS startup that helps companies optimize web-page content, is another good example of this principle in action.</p>
<h2 id="how-do-you-say-cotton-maxi-dre">How do <em>you</em> say, “cotton maxi dress”</h2>
<p>Ideally, BloomReach Head of Marketing Joelle Kaufman told me, the company wants to help customers ensure they get found in web searches by making sure they’re not invisible (buried deep down), irrelevant (not saying anything meaningful on their sites) or incompatible (not speaking their consumers’ language). On Tuesday, the company <a href="http://www.bloomreach.com/buzz/media-center-pr/continuous-quality-management/">announced a new feature called Continuous Quality Management</a>, which lets customers continuously monitor their pages to ensure they’re still featuring the right products and the right terminology. It’s the latest addition to a seemingly useful service that’s built atop a big data foundation few — if any — of its customers would ever attempt to build themselves.</p>
<p>BloomReach is able to help companies optimize their sites because it’s constantly crawling the web in order to figure out how everyone else is describing their content, laying out their pages and structuring their links. Running on the Amazon Web Services cloud, BloomReach runs more than 1,000 Hadoop jobs a day that process about 5 terabytes of data and a billion data points about users’ site behavior. With the latter, co-founder and CTO Ashutosh Garg explained, the company is trying to figure out who’s visiting sites, what they’re doing, how long they’re spending there and how they’re related in terms of behavior.</p>
<p>“You need to have the right amount of data and from the right places before we can do anything with it,” he said. “… It’s a massive machine learning problem.”</p>
<p><a href="http://gigaom2.files.wordpress.com/2013/05/br-stack.png"><img alt="BR stack" src="http://gigaom2.files.wordpress.com/2013/05/br-stack.png?w=708&#038;h=531" width="708" height="531" class="aligncenter size-large wp-image-645359"></a></p>
<p>When you consider all the possible ways something could be described or formatted, the scale of the problem becomes more evident. Simple semantic analysis like associating “desk” and “table” is easy, Garg explained, but what if some wants a lightweight camera and you only have its exact weight listed without any indication of how it compares to other options? What if people searching for “smartphones” really mean “Android phones,” but you’re top-loading your results with BlackBerry phones and Windows phones?</p>
<p>Another of Garg’s hypotheticals has to do with consumers’ presentation biases. If, for example, they’re looking at a lot of websites that look the same or focus on the same things (e.g., megapixels for digital cameras), they’ll expect to see the same things from every site.</p>
<h2 id="10-nonillion-possibilities-cho">10 nonillion possibilities: Choose 1.</h2>
<p>From a sheer numbers perspective, things get even hairier when you’re trying to determine the relationship between any two pages in order to figure out the best path for links to to take. Garg said this is what computer scientists call an <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete problem</a>, which means the amount of time it takes to process the results is exponentially greater than the amount of content you’re analyzing. So, for example, analyzing 40 pages doesn’t take 10 times as long as analyzing 4 pages, but more like 100 times longer.</p>
<p>Actually, BloomReach CEO Raj De Datta gave me another example of this problem <a href="http://gigaom.com/2012/02/22/bloomreach-wants-to-save-your-site-with-big-data/">when we spoke in early 2012</a>. Here’s how I described it then:</p>
<blockquote id="quote-if-a-company-wants-t"><p>[I]f a company wants to display just 1,000 products across 100 pages, De Datta explained, there are 10-to-the-28th-power (10 octillion) possibilities for how to do that. When it comes time to describe those products, there are 10-to-the-30th-power (10 nonillion) possibilities.</p></blockquote>
<p>If a website has a million pages, Garg said, “it will take you longer than the life of the universe to solve that problem.”</p>
<p>Where this type of problem arises, BloomReach turns to <a href="http://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simluations</a>, a favorite technique of physicists and Wall Street quants. The method involves running lots of simulations over large data sets in order to determine approximate results in a reasonable time frame. (And if all this isn’t enough computer science and cloud infrastructure for you, I suggest attending our <a href="http://event.gigaom.com/structure/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&amp;utm_content=dharrisstructure">Structure conference</a> in June, which features a who’s who list of speakers, including Google’s Jeff Dean, Facebook’s Jay Parikh and Netflix’s Adrian Cockroft.)</p>
<h2 id="different-queries-different-pa">Different queries, different pages</h2>
<p>Things get even trickier when you’re trying to change the content of web pages in real time as people are searching for things. This isn’t the best method for organic search, where pages need to stay pretty consistent with the indexed versions, but it can be ideal in situations such as paid search and mobile. There are millions of ways to segment buyers, Garg explained, and how accurately you assess their intent and display your content can make the all the difference. Whether someone is a new or repeat visitor often matters, as does whether someone is price-conscious (e.g., the query included “cheap”) or perhaps searching for a particular brand.</p>
<div id="attachment_645358" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/05/llbean.png"><img alt="Source: BloomReach" src="http://gigaom2.files.wordpress.com/2013/05/llbean.png?w=708&#038;h=531" width="708" height="531" class="size-large wp-image-645358"></a><p class="wp-caption-text">Source: BloomReach</p></div>
<p>Around the holidays, the company actually realized something interesting: The bounce rate on queries for things like “gifts for dad” or “gifts for co-workers” was pretty high, but so was the conversion rate. The time to conversion was relatively fast, as well. It turns out, Garg explained, that people don’t like to overthink certain gifts too much, so if something is presented in a visually appealing manner and is within their price range, they’ll buy.</p>
<p>But creating these types of models involves more than meets the eye. For all the talk about machine learning — and machines do a majority of the work for BloomReach — people also play a critical role. A person might know better than a machine whether something was likely purchased as gift, Garg explained, or they might spot the offensive content on the T-shirt the machine decided was ideal.</p>
<p>“Humans are really good at creativity, thinking through stuff,” he said.</p>
<p>Smart humans are also good at knowing when they’re overmatched, which is why SaaS is so valuable in the big data era. CMOs could try doing what BloomReach or <a href="http://gigaom.com/2012/04/24/datapop-scores-7m-for-custom-built-ads/">similar companies such as DataPop</a> are doing, or they could pay someone to do it much better. Guess which route the smart ones will take.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-54269p1.html">Shutterstock user Andrea Danti</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=645189&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=618872"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=618872" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/06/cloud-computing-infrastructure-2012-and-beyond/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Cloud computing infrastructure: 2012 and beyond</a></li><li><a href="http://pro.gigaom.com/2012/04/infrastructure-q1-cloud-and-big-data-woo-the-enterprise/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=645189+this-is-why-big-data-is-the-sweet-spot-for-saas&utm_content=dharrisstructure">Infrastructure Q1: Cloud and big data woo enterprises</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/14/this-is-why-big-data-is-the-sweet-spot-for-saas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_119782672.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/shutterstock_119782672.jpg?w=150" medium="image">
			<media:title type="html">collective intelligence</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/br-stack.png?w=708" medium="image">
			<media:title type="html">BR stack</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/llbean.png?w=708" medium="image">
			<media:title type="html">Source: BloomReach</media:title>
		</media:content>
	</item>
		<item>
		<title>With Lucky Sort creators on board, Twitter is officially a data company</title>
		<link>http://gigaom.com/2013/05/13/with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company/</link>
		<comments>http://gigaom.com/2013/05/13/with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company/#comments</comments>
		<pubDate>Mon, 13 May 2013 23:09:57 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[lucky-sort]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[real-time data]]></category>
		<category><![CDATA[social media]]></category>
		<category><![CDATA[social-data]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=644866</guid>
		<description><![CDATA[With its acquisition of Lucky Sort, Twitter seems to be acknowledging that it's a data company after all. The plan appears to be building a services that would do for Twitter equivalent to services such as Google Trends and Google Analytics.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=644866&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>We all kind of knew that Twitter’s path to making money was paved with data, and the announcement on Monday that it’s buying analytics startup Lucky Sort makes it official. Unless I’m totally misreading the writing on the wall, this move is all about giving advertisers — and anyone, in theory — the tools to learn about what people are talking about.</p>
<p>Word that Lucky Sort is shutting down and that <a href="http://luckysort.com/">several of its team are joining Twitter’s revenue engineering department</a> suggests this is exactly what the acquisition aims to accomplish.</p>
<p>As it stands, companies use Twitter as a way to track how people are talking about them and maybe, if they’re really advanced, do some sentiment analysis. If they’re willing to pay a third party, Datasift and Gnip are more than happy to broaden marketers’ views to <a href="http://gigaom.com/2012/11/13/how-to-handle-a-firehose-an-interview-with-datasifts-ceo/">encompass the entirety of Twitter’s data, both real-time and historical</a>. What companies really can’t do, though, is run their own advanced analytics about topics straight from the Twitter platform.</p>
<div id="attachment_644884" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/05/big-data.png"><img alt="big-data" src="http://gigaom2.files.wordpress.com/2013/05/big-data.png?w=708&#038;h=375" width="708" height="375" class="size-large wp-image-644884"></a><p class="wp-caption-text">One view of the Lucky Sort dashboard</p></div>
<p>The value proposition from such a product should be obvious at this point. Facebook, Google and Yahoo all collect a lot of data about how people are using their platforms and what topics are trending, and they all <a href="http://gigaom.com/2013/03/20/google-trends-youtube-data/">offer it up via a variety of products</a> targeting marketing types and the public at large. If Twitter wants to be taken seriously as a venue for advertising budgets and a platform for <a href="http://gigaom.com/2012/10/02/why-the-trick-to-twitter-as-a-data-source-is-more-data/">measuring the pulse of the nation</a>, people need to be able to ask questions of its data without relying on an intermediary or the occasional Twitter blog post.</p>
<p>As a journalist, I’d love to have access to this type of tool to track trending topics in real time and spot possible stories as they’re happening. The appeal to marketers should be obvious. As IBM’s Erick Brethenoux <a href="http://gigaom.com/2013/04/22/how-a-star-trek-convention-explains-the-secret-to-selling-more-stuff/">told me recently</a>, “[Marketers] talk a good game about social data. Very few actually leverage it effectively today.”</p>
<p>At Twitter, though, data is a slightly different beast than at other web companies. Twitter’s value lies largely in real-time data — topics can be peak, crest and all but vanish within a 48-hour window. This situation has <a href="http://gigaom.com/2012/06/04/twitter-shows-when-we-tweet-and-explains-why-its-search-sucks/">hampered some of Twitter’s efforts</a> to surface optimal search results, and it has spurred the decision to buy companies such as Backtype (for its <a href="http://gigaom.com/2011/08/04/twitter-to-open-source-hadoop-like-tool/">streaming-processing Storm technology</a>) and <a href="http://previously.ubalo.com">parallel-processing startup Ubalo</a>.</p>
<p>The latter move, <a href="https://ubalo.com/">which happened last week</a>, should help Twitter’s development team create new features without worrying about the intricacies of making them run — and run fast — across a cluster of machines. (You can learn a lot more about how companies such as Google, Facebook and Box are rethinking infrastructure to handle their unique data needs at our <a href="http://event.gigaom.com/structure/schedule/?utm_source=data&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=644866+with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company&amp;utm_content=dharrisstructure">Structure conference</a> next month in San Francisco.)</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=644866&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=334025"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=334025" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=644866+with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/01/why-the-next-front-in-big-data-might-be-psychological/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=644866+with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company&utm_content=dharrisstructure">Why the next front in big data might be psychological</a></li><li><a href="http://pro.gigaom.com/2011/04/finding-the-value-in-social-media-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=644866+with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company&utm_content=dharrisstructure">Finding the Value in Social Media Data</a></li><li><a href="http://pro.gigaom.com/2012/09/listening-platforms-finding-the-value-in-social-media-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=644866+with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company&utm_content=dharrisstructure">Listening platforms: finding the value in social media data</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/13/with-lucky-sort-creators-on-board-twitter-is-officially-a-data-company/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/big-data.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/big-data.png?w=150" medium="image">
			<media:title type="html">big-data</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/big-data.png?w=708" medium="image">
			<media:title type="html">big-data</media:title>
		</media:content>
	</item>
		<item>
		<title>How MailChimp learned to treat data like orange juice and rethink email in the process</title>
		<link>http://gigaom.com/2013/05/05/how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process/</link>
		<comments>http://gigaom.com/2013/05/05/how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process/#comments</comments>
		<pubDate>Sun, 05 May 2013 23:09:53 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[email marketing]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[MailChimp]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[predictive models]]></category>
		<category><![CDATA[semantic analysis]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=642316</guid>
		<description><![CDATA[MailChimp wasn't always a big data company, but 12 years into its existence the company is using its mountains of email data to do everything from modeling spam to connecting subscribers.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=642316&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>MailChimp Chief Data Scientist John Foreman likes to talk about orange juice. On the surface, it&#8217;s a strange way to start a discussion about data, but it all starts to make sense when you peel back the rind. It&#8217;s a way of thinking that&#8217;s letting MailChimp &#8212; which sends about 35 billion emails a year on behalf of roughly 3 million users &#8212; transform itself into a data-driven business 12 years into its existence.</p>
<p>When you&#8217;re in Atlanta, as I was during a recent trip, the obvious place to start talking about orange juice and data is with Coca-Cola. Foreman can tell you all about how the beverage giant &#8212; whose headquarters tower over the city just a just a mile away from MailChimp&#8217;s office &#8212; <a href="http://www.businessweek.com/articles/2013-01-31/coke-engineers-its-orange-juice-with-an-algorithm">uses advanced algorithms and giant vats of different juices</a> to ensure the proper flavor of its Simply Orange line of orange juice. However, it&#8217;s something else Coca-Cola is doing that inspired the way Foreman thinks about data and that&#8217;s helping MailChimp re-imagine what it means to engage with fans, readers and customer through their inboxes.</p>
<p>Anyone familiar with how large web companies <a href="http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/">came to pioneer the practice of what we now call &#8220;big data&#8221;</a> should appreciate the analogy. Coca-Cola, which also owns Minute Maid, produces a lot of excess pulp when it makes orange juice. For decades, presumably, it had just been throwing that pulp away, but in 2006 it decided to make use of it by launching a new product called Minute Maid Pulpy. Sold primarily in Asian countries, Pulpy <a href="http://www.ajc.com/news/business/coca-colas-minute-maid-pulpy-reaches-1-billion-in-/nQqFM/">has become a billion-dollar business</a> for Coca-Cola.</p>
<p>Once MailChimp is done with its primary business of sending emails, it has a lot of pulp of its own in the form of data. And rather than just ignoring it or writing up some cute blog posts (<a href="http://blog.mailchimp.com/author/jforeman/">which he also does</a>), Foreman and his bosses want to turn that data into revenue.</p>
<h2 id="first-things-first-making-bett">First things first: Making better orange juice</h2>
<div id="attachment_642357" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/05/20130424_121443.jpg"><img  alt="Neil Bainton" src="http://gigaom2.files.wordpress.com/2013/05/20130424_121443-e1367793432461.jpg?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-642357" /></a><p class="wp-caption-text">Neil Bainton</p></div>
<p>Actually, though, MailChimp first brought in Foreman in 2011 to help the company improve its core business of letting users build and send their emails. MailChimp&#8217;s culture was built around many things, COO Neil Bainton told me, but data wasn&#8217;t one of them. It had &#8220;various fits and starts&#8221; through the years trying to work data into its business model, and each step just added more complexity.</p>
<p>The challenges were technological as well as cultural, but Foreman had a plan, of which focus was a key aspect. Keeping a tight focus meant Foreman and his lone-developer sidekick could build what they needed to in a short timeframe. It also meant the company didn&#8217;t have to worry about some massive overnight transformation into a data-obsessed company like Google.</p>
<div id="attachment_642358" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/05/20130424_121423.jpg"><img  alt="John Foreman" src="http://gigaom2.files.wordpress.com/2013/05/20130424_121423-e1367793376856.jpg?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-642358" /></a><p class="wp-caption-text">John Foreman</p></div>
<p>&#8220;[They] don&#8217;t need to be afraid the entire culture is gonna fall down if we bring in this weird math guy,&#8221; he joked.</p>
<p>Foreman&#8217;s first project &#8212; deploying artificial intelligence models that would <a href="http://blog.mailchimp.com/project-omnivore-three-years-of-gorging-on-data/">automatically detect spammy email lists from MailChimp&#8217;s users</a> &#8211; is actually critical to the way MailChimp operates, though. It was up and running in production within a year, after a technologically challenging effort of merging separate database instances for each customer into a single environment that would let MailChimp run complex analyses across its customer base.</p>
<p>It&#8217;s such an important project, Foreman explained, because internet service and email providers keep reputation scores on the IP addresses that send email through their systems. Because MailChimp serves as the email engine for its millions of users, sending too many messages that get flagged as spam and lower MailChimp&#8217;s reputation will have a negative impact on everyone. The company used to deal with spam manually, and only after recipients began complaining about the messages they received.</p>
<p>&#8220;It used to be before we had that AI model in place that everyone had a crappier experience,&#8221; Foreman said.</p>
<h2 id="say-goodbye-to-those-90s-fans-">Say goodbye to those &#8217;90s fans, Pearl Jam</h2>
<div id="attachment_642362" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/05/bcdf-1024x864.png"><img  alt="Source: MailChimp" src="http://gigaom2.files.wordpress.com/2013/05/bcdf-1024x864.png?w=300&#038;h=253" width="300" height="253" class="size-medium wp-image-642362" /></a><p class="wp-caption-text">Source: MailChimp</p></div>
<p>Now, however, MailChimp knows some of the telltale signs of spam for which it should be on the lookout. If too high a percentage of email addresses on a given list are also <a href="http://blog.mailchimp.com/aol-and-hotmail-users-spend-more-than-gmail-users-and-other-research-finds/">available via publicly available lists</a> or those you can buy on sketchy corners of the internet, it&#8217;s probably spam. Too many old and far-more-likely-to-be-dead Earthlink or Compuserve addresses, or letters within one keystroke of each other as if someone just mashed the keyboard? Probably spam.</p>
<p>Thankfully, though, about 98 percent of the spam that MailChimp identifies is what Foreman calls &#8220;ignorant&#8221; &#8212; that is, people or companies that just don&#8217;t know the laws or best practices around sending emails. But ignorance doesn&#8217;t mean MailChimp relaxes its rules. Recently, it even flagged Pearl Jam for spammy practices because the band was trying to reconnect with old fans whose email addresses read like a who&#8217;s who list of 1990s email providers.</p>
<p>Having such a high percentage of ignorant spam actually has a positive effect on the company&#8217;s overall goal of monetizing its vast data repositories. Because the AI model automates what used to be a manual process, and because most innocent spammers will fall in line quickly once they&#8217;re notified (as opposed to nefarious spammers who constantly try to outsmart the system), MailChimp can pretty much set the model loose, forget about it and get to work on new efforts, Foreman said.</p>
<h2 id="now-about-that-pulp">Now, about that pulp</h2>
<p>Spam under control, MailChimp can focus its efforts on actually building new products with data, just like Coca-Cola did with that extra pulp. One of its first orders of business is figuring out how to help customers get to know better the people to whom they&#8217;re sending their newsletters.</p>
<p>With this in mind, the company built a service called <a href="http://wavelength.mailchimpapp.com/">Wavelength</a> that shows customers other newsletters that are similar to theirs. But the system that powers Wavelength also stores pretty much every interaction that every email address in the company&#8217;s database has with the newsletters they&#8217;re sent. That means what emails they open and when they open them, what links they click and when they click them, and what other newsletters they&#8217;re subscribed to. MailChimp also has a feature called <a href="http://kb.mailchimp.com/article/what-is-ecommerce360-and-how-does-it-work-with-mailchimp">Ecommerce360</a> that lets customers track clicks right through to conversions (marketing speak for someone actually buying something).</p>
<p>The company has been <a href="http://blog.mailchimp.com/digging-deeper-into-wavelength-and-egp-data-finding-interest-clusters-in-mailchimps-network/">playing around with this data to identify clusters of users</a> based on their behaviors and their interests &#8212; some of which Foreman has detailed on the company&#8217;s blog &#8212; and now it wants to roll it out to customers via a product MailChimp is calling ChimpQuery. Built atop <a href="http://gigaom.com/2013/03/14/google-bigquery-is-now-even-bigger/">Google&#8217;s BigQuery analytics service</a>, ChimpQuery will let customers start doing this type of clustering and segmentation on their own, while saving MailChimp the troubles of hosting that infrastructure itself. (You can play with a monstrous, interactive graph of the entire MailChimp subscriber list <a href="http://zoom.it/HD3t#full">here</a>.)</p>
<p>If you sell knitting supplies and you find out there&#8217;s a big cluster of people on your mailing list who also are interested in wedding planning and custom jewelry, there might be an opportunity to create your content with these interests in mind or even to partner with companies in those spaces.</p>
<div id="attachment_642360" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/05/marriedknit-tiff.jpg"><img  alt="A sample cluster of subscribers." src="http://gigaom2.files.wordpress.com/2013/05/marriedknit-tiff.jpg?w=708&#038;h=427" width="708" height="427" class="size-large wp-image-642360" /></a><p class="wp-caption-text">A sample cluster of subscribers.</p></div>
<p>Another topic that has been on Foreman&#8217;s mind lately is what he calls &#8220;frequency elasticity of engagement.&#8221; <a href="http://blog.mailchimp.com/sending-frequency-more-is-not-always-better/">He&#8217;s done research</a> suggesting that blasting the heck out of your email list might actually have detrimental effects in the long term (regardless of <a href="http://gigaom.com/2012/12/08/how-obamas-data-scientists-built-a-volunteer-army-on-facebook/">how the Obama campaign successfully exploited this strategy</a>) but noted that engagement also has a lot to do with content and a particular company&#8217;s given user list. MailChimp&#8217;s data could help customers figure out the ideal schedule for emailing their subscribers.</p>
<p>For example, Birchbox has really high engagement because people love the service and have to open their emails to find out what goodies they&#8217;re receiving. Emails from a company like Papa John&#8217;s, on the other hand, might sit in someone&#8217;s inbox essentially as spam until they want to order a pizza and go searching for a coupon. Everyone has to figure out what pace and engagement metrics work for them.</p>
<h2 id="reining-expectations-back-in">Reining expectations back in</h2>
<p>However, now that management is fully sold on the power of data, Foreman sometimes finds himself managing expectations rather than just pitching his ideas. COO Bainton, for example, is adamant that MailChimp start aiding its publishing-industry customers by using techniques such as natural-language processing and semantic analysis to help them personalize emails based on readers stated and unstated interests (that is, what boxes they check when they sign up and what stuff they actually click on).</p>
<p>Foreman, well, he&#8217;s pretty sure that&#8217;s too big a challenge for MailChimp to tackle considering how many publishing customers it has. MailChimp would have to understand all those customers&#8217; industries to some degree (<a href="http://www.opencalais.com/about">open source tools</a> tend to highlight technically but not situationally relevant relationships, he said, and don&#8217;t always understand things like sarcasm) and probably the different languages they publish in, as well. Rather than understand content, he&#8217;d rather focus personalization efforts around how users are connected.</p>
<p>The company also needs to balance its ambitions with what&#8217;s legally and socially acceptable. The creep factor might be more important than what&#8217;s legal when it comes to email marketing. MailChimp determines the legality of everything it does before rolling it out, Foreman explained, but in era of &#8220;post-modern spam&#8221; where legitimacy is in the eye of the recipient and where some people use their &#8220;spam&#8221; button as a proxy for unsubscribing, companies must be careful not to offend.</p>
<p>&#8220;The more we can tell you about that list without getting creepy is really useful,&#8221; Bainton said. However, he added, &#8221;I think expectation is more important than law.&#8221;</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=642316&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=627038"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=627038" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=642316+how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/01/why-the-next-front-in-big-data-might-be-psychological/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=642316+how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process&utm_content=dharrisstructure">Why the next front in big data might be psychological</a></li><li><a href="http://pro.gigaom.com/2010/10/will-hadoop-vendors-profit-from-banks-big-data-woes/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=642316+how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process&utm_content=dharrisstructure">Will Hadoop Vendors Profit from Banks&#8217; Big Data Woes?</a></li><li><a href="http://pro.gigaom.com/2010/09/the-red-hot-data-warehouse-market-whos-buying-next/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=642316+how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process&utm_content=dharrisstructure">The Red-Hot Data Warehouse Market: Who&#8217;s Buying Next?</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/05/05/how-mailchimp-learned-to-treat-data-like-orange-juice-and-rethink-email-in-the-process/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/05/joyusgray-e1367794217987.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/05/joyusgray-e1367794217987.png?w=150" medium="image">
			<media:title type="html">JoyusGray</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/20130424_121443-e1367793432461.jpg?w=300" medium="image">
			<media:title type="html">Neil Bainton</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/20130424_121423-e1367793376856.jpg?w=300" medium="image">
			<media:title type="html">John Foreman</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/bcdf-1024x864.png?w=300" medium="image">
			<media:title type="html">Source: MailChimp</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/05/marriedknit-tiff.jpg?w=708" medium="image">
			<media:title type="html">A sample cluster of subscribers.</media:title>
		</media:content>
	</item>
		<item>
		<title>Facebook relies on natural-language processing to power Graph Search</title>
		<link>http://gigaom.com/2013/04/29/facebook-relies-on-natural-language-processing-to-power-graph-search/</link>
		<comments>http://gigaom.com/2013/04/29/facebook-relies-on-natural-language-processing-to-power-graph-search/#comments</comments>
		<pubDate>Mon, 29 Apr 2013 17:00:40 +0000</pubDate>
		<dc:creator>Jordan Novet</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Graph Search]]></category>
		<category><![CDATA[natural language processing]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=640422</guid>
		<description><![CDATA[In a post on Facebook's engineering blog, engineers discuss the ways in which natural-language processing helps interpret what users plug in to suggest the best possible queries. <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=640422&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Since Facebook <a href="http://gigaom.com/2013/01/15/facebook-debuts-personalized-version-of-search-with-graph-search/">debuted</a> its Graph Search function in January, the social network has given access to a small percentage of users &#8212; millions, while Facebook had 1.06 billion monthly active users at the end of 2012. The feature aggregates people, places and things based on user input and quickly provides interesting and sometimes surprising content. That only happens thanks to nifty natural-language processing work that goes on behind the scenes. And it only works with English &#8212; for now. Engineers are trying to figure out how to make the product available in other languages.</p>
<p>In an <a href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-the-natural-language-interface-of-graph-search/10151432733048920">article</a> posted to Facebook&#8217;s engineering blog on Monday, research scientist Maxime Boucher and Xiao Li, engineering manager on the natural-language team in Graph Search, provide detailed information on the ways in which Graph Search calls on natural-language processing to guess what users want.</p>
<ul>
<li>Graph Search breaks down search strings into multiple components that serve as commands with which the system can query the database. For instance, &#8220;my friends who live in San Francisco&#8221; would be run like this: pulling up the user, grabbing that person&#8217;s list of friends, calling on the filter for people who currently live in a place, and filtering out only those friends who have San Francisco in that field. Graph Search considers that search query &#8220;intersect(friends(me), residents(12345)).&#8221; And if that&#8217;s exactly what the user had in mind, that query gets converted into language for the <a href="http://gigaom.com/2013/03/06/how-facebook-uses-numbers-to-show-people-places-and-things-with-graph-search/">Unicorn search engine</a> to chew on.Search terms sometimes include words Graph Search has no use for. At other times, words for guiding queries are missing. And users might plug in terms in the wrong order. Say a user types in &#8220;friends San Francisco.&#8221; Graph Search might offer &#8220;my friends who live in San Francisco&#8221; as a good option. If it sees &#8220;San Francisco friends,&#8221; it could respond with &#8220;my friends who live in San Francisco,&#8221; which is more in accord with the correct sequence for a query.</li>
<li>Graph Search analyzes words users enter to look for possible entities that users are referring to in the database, across more than 20 entity categories, such as cities, employers and schools. Using statistics for the entity categories, the tool identifies sequences of words that could be more applicable for certain entities than others. If &#8220;san&#8221; precedes &#8220;francisco,&#8221; the user likely is referring to a city, not a person.</li>
<li>The system recognizes slang, nicknames for places, misspellings, the many ways of expressing particular types of data and other peculiarities that users type into the search box and swaps out each of those for terms that actually exist in the database. That means, for example, that subject-verb agreement isn&#8217;t necessary for the system to serve up query options that might lead to what users want to see. And words such as &#8220;besties&#8221; get interpreted as &#8220;friends.&#8221;
<p>Graph Search is visible on one of the most popular social networks in the world and therefore needs to be satisfying for its users. As Boucher and Li write, &#8220;The challenge for the team was to make sure that any reasonable user input produces plausible suggestions using Graph Search. To achieve that goal, the team leveraged a number of linguistic resources for conducting lexical analysis on an input query before matching it against terminal rules in the grammar.&#8221;</li>
</ul>
<p>Graph Search still has a long <a href="http://gigaom.com/2013/02/21/facebooks-long-graph-search-to-do-list/">to-do list</a> for engineers to address. One of the biggest challenges is to construct and deploy a language-agnostic Graph Search system, so Facebook users all over the world will be able to do what English speakers can do with the tool. It will be difficult to produce a tool that can adjust for unusual spellings, handle incorrect grammar and otherwise optimize search strings entered in any language. &#8220;In Russian, there&#8217;s so many inflections around words and a lot of language-specific things we haven&#8217;t encountered in English,&#8221; Li told me in an interview on Friday. Engineers are now looking at different ways to make the tool available for other languages, Li said. One option? A whole lot of drop-down menus.</p>
<p>While there is still work to do in letting more people try Graph Search, it&#8217;s clear that the simple interface for navigating hundreds of millions of objects required engineers to produce a bunch of systems and models. It&#8217;s no <a href="http://gigaom.com/2012/08/08/for-google-keeping-search-relevant-means-baking-big-data-into-everything/">Google</a>, <a href="http://gigaom.com/2012/11/03/5-trends-that-are-changing-how-we-do-big-data/">Siri</a> or <a href="http://gigaom.com/2012/04/24/datapop-scores-7m-for-custom-built-ads/">DataPop</a>, but, because it contains elements tailored to the data set at hand and common use cases, and because it&#8217;s <a href="http://gigaom.com/2013/03/14/facebook-tweaks-its-algorithms-to-improve-graph-search-comment-search-coming/">getting better over time</a>, Graph Search is worth keeping an eye on.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=640422&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=167833"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=167833" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640422+facebook-relies-on-natural-language-processing-to-power-graph-search&utm_content=gigajordan">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/12/connected-consumer-2013-how-2012-laid-the-groundwork-for-change/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640422+facebook-relies-on-natural-language-processing-to-power-graph-search&utm_content=gigajordan">How consumer media will change in 2013</a></li><li><a href="http://pro.gigaom.com/2012/12/social-2013-the-enterprise-strikes-back/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640422+facebook-relies-on-natural-language-processing-to-power-graph-search&utm_content=gigajordan">Social 2013: The enterprise strikes back</a></li><li><a href="http://pro.gigaom.com/2012/11/sector-roadmap-crowd-labor-platforms-in-2012/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=640422+facebook-relies-on-natural-language-processing-to-power-graph-search&utm_content=gigajordan">Examining the rise of crowd labor platforms in 2012</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/29/facebook-relies-on-natural-language-processing-to-power-graph-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/graph-search-screen-shot.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/graph-search-screen-shot.jpg?w=150" medium="image">
			<media:title type="html">Graph Search screen shot</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c00ab753df107b639e76ed4c3ab07ba7?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigajordan</media:title>
		</media:content>
	</item>
		<item>
		<title>Spain’s Siri-challenger Sherpa learns English, arrives in the U.S.</title>
		<link>http://gigaom.com/2013/04/17/spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s/</link>
		<comments>http://gigaom.com/2013/04/17/spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s/#comments</comments>
		<pubDate>Wed, 17 Apr 2013 14:41:45 +0000</pubDate>
		<dc:creator>Kevin Fitchard</dc:creator>
				<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[natural language understanding]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[speech command]]></category>
		<category><![CDATA[Virtual Assistant]]></category>
		<category><![CDATA[Xabi Uribe-Etxebarria]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=631809</guid>
		<description><![CDATA[Sherpa is trying to build a more flexible virtual assistant technology that can easily adapted for new tasks. To that end it has developed its own conceptual meta-language which it uses to process all voice commands.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631809&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A new voice digital assistant is on the scene in the U.S., but unlike other Siri-challengers Sherpa comes with some overseas work experience. Sherpa launched its Spanish-language Android app in October and has since risen up the Google Play charts in Spain and Latin America. Sherpa has now learned English, and on Wednesday it <a href="https://play.google.com/store/apps/details?id=com.sherpa.asistentesherpa&amp;feature=search_result#?t=W251bGwsMSwxLDEsImNvbS5zaGVycGEuYXNpc3RlbnRlc2hlcnBhIl0.">launched in the U.S. in the Play store</a>.</p>
<p>Most virtual assistants powered by natural language processing are taught to do specific tasks very well but tend to come up short when given unfamiliar assignments. For instance, Siri excels at jobs like making calendar appointments and dictating text messages but can be confounded by more general requests for information, usually resorting to simple web searches.</p>
<p><a href="http://gigaom.com/2013/04/17/spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s/screen-shot-2013-04-17-at-9-38-02-am/" rel="attachment wp-att-631820"><img  alt="Sherpa Screenshot 2" src="http://gigaom2.files.wordpress.com/2013/04/screen-shot-2013-04-17-at-9-38-02-am.png?w=708"   class="alignleft size-full wp-image-631820" /></a>Sherpa CEO Xabi Uribe-Etxebarria said he set out to create a natural language platform that had a much greater scope of understanding, which could easily be applied to new tasks without “training” the app to perform them. He also wanted to create a language-independent platform, one that understood meaning and intent independent of a language’s vocabulary or syntax.</p>
<p>To that end, Uribe-Etxebarria and his machine-learning team developed a sort of meta-language, encompassing 250,000 semantic concepts accompanied by 5,000 rules used to order those concepts. Sherpa uses off-the-shelf speech recognition services (right now it uses <a href="http://gigaom.com/2012/10/31/google-explains-how-more-data-means-better-speech-recognition/">Google’s speech API</a>) to translate commands into its meta-language, and then it parses meaning and intent from the resulting string of concepts.</p>
<p>The result is a flexible virtual assistant that can easily be applied to new tasks, Uribe-Etxebarria said. Sherpa’s repertoire is constantly growing as it hooks into new apps and information sources. For instance, Sherpa has struck a deal with PayPal, allowing the app to make payments via voice command. It taps into Twitter’s API, letting users navigate their twitter feeds &#8212; toggling between mentions, direct messages and home stream views &#8212; through voice prompts. For general information requests, Sherpa has developed a nifty information card format, which aggregates information from a variety sources ranging from LinkedIn profiles to Wikipedia entries.</p>
<p>“We’ve gone beyond Siri in many cases,” Uribe-Etxebarria said. And given the flexibility of its technology, he added, Sherpa can continue to add new services and functions at a much quicker space than its competitors.</p>
<p>Still, Sherpa is entering an increasingly crowded space. New virtual assistants are popping up left and right, some very focused on specific tasks like <a href="http://gigaom.com/2013/04/11/personal-assistant-ios-app-donna-puts-your-phone-to-work-for-you/">Incredible Labs’ Donna</a>, while some like Nuance’s Dragon technologies are spanning devices, trying to <a href="http://gigaom.com/2013/01/07/nuance-to-create-a-universal-voice-assistant-bridging-phones-tvs-and-cars/">create a single virtual assistant for all things</a>. And of course, Google and Apple are building their speech technologies directly into their phone operating systems – it’s hard to argue with the convenience of that big fat Siri button.</p>
<p>Sherpa got off the ground in Bilbao, Spain, but it now has offices in Redwood City, Calif. It has <a href="http://www.businesswire.com/news/home/20130322005288/en/Sherpa-Secures-1.6-Million-Funding-Transform-Virtual">raised $1.6 million in angel funding</a>.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631809&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=464540"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=464540" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=mobile&utm_medium=editorial&utm_campaign=auto3&utm_term=631809+spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s&utm_content=kfitchard">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2011/09/the-future-of-mobile-a-segment-analysis-by-gigaom-pro/?utm_source=mobile&utm_medium=editorial&utm_campaign=auto3&utm_term=631809+spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s&utm_content=kfitchard">The future of mobile: a segment analysis by GigaOM Pro</a></li><li><a href="http://pro.gigaom.com/2012/01/newnet-q4-platform-mania-and-social-commerce-shakeout/?utm_source=mobile&utm_medium=editorial&utm_campaign=auto3&utm_term=631809+spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s&utm_content=kfitchard">NewNet Q4: Platform mania and social commerce shakeout</a></li><li><a href="http://pro.gigaom.com/2012/07/the-wearable-computing-market-a-global-analysis/?utm_source=mobile&utm_medium=editorial&utm_campaign=auto3&utm_term=631809+spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s&utm_content=kfitchard">Analyzing the wearable computing market</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/17/spains-siri-challenger-sherpa-learns-english-arrives-in-the-u-s/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/sherpa_02-450x250-e1366209061657.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/sherpa_02-450x250-e1366209061657.jpg?w=150" medium="image">
			<media:title type="html">Sherpa logo</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/0544c4b228f8fa80e31bb952501cd7a4?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">kfitchard</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/screen-shot-2013-04-17-at-9-38-02-am.png" medium="image">
			<media:title type="html">Sherpa Screenshot 2</media:title>
		</media:content>
	</item>
		<item>
		<title>Peter Thiel&#8217;s latest investments: better search and cellular nanotechnology</title>
		<link>http://gigaom.com/2013/04/17/peter-thiels-latest-investments-better-search-and-cellular-nanotechnology/</link>
		<comments>http://gigaom.com/2013/04/17/peter-thiels-latest-investments-better-search-and-cellular-nanotechnology/#comments</comments>
		<pubDate>Wed, 17 Apr 2013 12:00:47 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[algorithms]]></category>
		<category><![CDATA[breakout labs]]></category>
		<category><![CDATA[nanotechnology]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[Peter Thiel]]></category>
		<category><![CDATA[semantic search]]></category>
		<category><![CDATA[SkyPhrase]]></category>
		<category><![CDATA[Stealth Biosciences]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=631740</guid>
		<description><![CDATA[Thiel Foundation subsidiary Breakout Labs has funded two new startups called SkyPhrase and Stealth Biosciences that, respectively, are trying to reinvent natural language processing and improve our ability to interact with individual cells.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631740&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Breakout Labs, an offshoot of PayPal Co-founder Peter Thiel&#8217;s eponymous Thiel Foundation, has funded its first two startups of the year: SkyPhrase and Stealth Biosciences. The former is trying to improve data analysis and interaction via better natural language processing, while the other is trying to improve our health by literally sticking straws into our cells.</p>
<p><a href="https://skyphrase.com/">SkyPhrase</a> is a very early-phase company that, according to its web site, has &#8220;made breakthroughs in algorithms that enable computers to understand more complex language with greater precision than has ever been possible.&#8221; The goal is to improve search functionality but also to give developers a new, easy way to incorporate natural language processing into their apps. The company was founded by Rensselaer Polytechnic Institute Professor Nick Cassimatis.</p>
<p>In January, MIT Technology Review reporter Rachel Metz <a href="http://www.technologyreview.com/news/510056/startup-brings-better-understanding-of-tricky-questions-to-the-web/">covered the company and actually reviewed an early version</a> of the technology as applied to searching through tweets and emails. It wasn&#8217;t yet trained to do what she wanted with tweets but, she wrote, did a &#8220;decent&#8221; job searching through emails. Part of what makes it work appears to be its ability to understand conjunctions, even if it doesn&#8217;t yet have semantic capabilities: &#8220;I could search for, say, &#8216;e-mails from Bob Loblaw in December and January about recipes with a PDF,&#8217; or &#8216;e-mails from Bob Loblaw or Tobias Funke about cookies in December,&#8217;&#8221; Metz explained.</p>
<div id="attachment_631746" class="wp-caption alignright" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2013/04/nanostraws_sem.png"><img  alt="Nanostraws in a cell" src="http://gigaom2.files.wordpress.com/2013/04/nanostraws_sem.png?w=300&#038;h=200" width="300" height="200" class="size-medium wp-image-631746" /></a><p class="wp-caption-text">Nanostraws in a cell</p></div>
<p>Breakout Labs&#8217; other new investment, <a href="http://stealthbiosciences.com/">Stealth Biosciences</a>, is a team of Stanford professors, executives and entrepreneurs that has invented a way to get materials into and out of individual cells and to monitor their activity via electric probe. Called Nanostraws and Stealth Electrodes, respectively, the companies two techniques do just what they sound like they do: NanoStraws let doctors inject or extract material from cells in the aims of advancing research and delivering personalized medicine, while the electrodes &#8220;automate long-term intracellular electrical recordings of neurons and heart cells.&#8221;</p>
<p>Stealth Biosciences, in particular, seems like a heady endeavor, but that&#8217;s exactly what Breakout Labs is all about. <a href="http://gigaom.com/2011/10/25/peter-thiel-breakout-labs/">Launched in 2011</a>, the organization aims to fund projects too early in their lives to attract traditional venture capital. Those funded aren&#8217;t giving up large equity stakes in their companies, but are expected to provide a &#8220;modest portion&#8221; of their revenues back into the program to fund the next generation of Breakout Labs investments. Other investments thus far include Modern meadow &#8212; a company <a href="http://gigaom.com/2012/08/16/cue-the-protein-printer-peter-thiel-invests-in-artificial-meat/">trying to create artificial meat using 3-D printers</a> &#8212; and AVEtec, a Canadian startup<a href="http://gigaom.com/2012/12/16/peter-thiel-funds-tornado-power-seriously/"> trying to harness the power of tornadoes for good</a>.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631740&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=636796"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=636796" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=631740+peter-thiels-latest-investments-better-search-and-cellular-nanotechnology&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/08/how-emerging-technologies-are-influencing-collaboration/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=631740+peter-thiels-latest-investments-better-search-and-cellular-nanotechnology&utm_content=dharrisstructure">How emerging technologies will influence collaboration</a></li><li><a href="http://pro.gigaom.com/2012/01/newnet-q4-platform-mania-and-social-commerce-shakeout/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=631740+peter-thiels-latest-investments-better-search-and-cellular-nanotechnology&utm_content=dharrisstructure">NewNet Q4: Platform mania and social commerce shakeout</a></li><li><a href="http://pro.gigaom.com/2012/01/newnet-q4-platform-mania-and-social-commerce-shakeout/?utm_source=tech&utm_medium=editorial&utm_campaign=auto3&utm_term=631740+peter-thiels-latest-investments-better-search-and-cellular-nanotechnology&utm_content=dharrisstructure">NewNet Q4: Platform mania and social commerce shakeout</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/17/peter-thiels-latest-investments-better-search-and-cellular-nanotechnology/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/electrode-band-schematic.png?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/electrode-band-schematic.png?w=150" medium="image">
			<media:title type="html">electrode band schematic</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/nanostraws_sem.png?w=300" medium="image">
			<media:title type="html">Nanostraws in a cell</media:title>
		</media:content>
	</item>
		<item>
		<title>Idibon secures $1.4M as it builds a tool to mine the world&#8217;s languages</title>
		<link>http://gigaom.com/2013/04/16/idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages/</link>
		<comments>http://gigaom.com/2013/04/16/idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 16:00:16 +0000</pubDate>
		<dc:creator>Jordan Novet</dc:creator>
				<category><![CDATA[Idibon]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[natural language processing]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=631289</guid>
		<description><![CDATA[A San Francisco company has raised $1.4 million in seed funding to bring to market a tool for processing text in any language in the world.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631289&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.idibon.com/">Idibon</a>, an ambitious stealth-mode startup, has closed on $1.4 million in seed funding from Khosla Ventures to keep building out natural-language-processing software. The software helps enterprises get insight into sentiments expressed in text on the internet in any language you can think of &#8212; with a small role reserved for human beings.</p>
<p>The San Francisco company doesn&#8217;t want to reveal how everything works yet. But previous work from Idibon CEO Rob Munro provides hints of what&#8217;s possible. In his 2012 Stanford Ph.D. <a href="https://stacks.stanford.edu/file/druid:cg721hb0673/thesis-augmented.pdf">dissertation</a>, entitled &#8220;Processing Short Message Communications in Low-Resource Languages,&#8221; Munro explained that it was possible to build natural-language-processing systems that could handle many variations in word spelling in text messages and tweets in Chichewa, Haitian Krèyol and Urdu when classifying, even when the systems had little time to train and get better and no previous familiarity with the languages. In the case of the texts in Haitian Krèyol that were sent following the January 2010 earthquake in Haiti, prioritizing helped quickly sift out the genuine emergencies. The question is whether a tool could be developed to pick up patterns in text in any language. Such a system, if combined with a powerful translation tool, could be deployed for a wide variety of applications, from sentiment analysis to intelligence gathering.</p>
<p>Rather than leave machines to bear the burden of figuring out what people mean when they communicate in obscure languages, Idibon wants humans to play a role, such as verifying that data is correct. That sort of work could be crowdsourced. &#8220;Machines are never going to be 100 percent accurate,&#8221; Munro said. The idea of bringing together humans and algorithms to solve problems has come up in other applications, and several <a href="http://gigaom.com/2013/03/22/structuredata-2013-recap/">came up in on-stage conversation</a> at GigaOM&#8217;s Structure:Data 2013 conference in New York last month.</p>
<p>How could enterprises use Idibon? Half a dozen customers are using the beta version of the software in different ways. One is relying on Idibon to run a medical question-and-answer system that can spit out an answer or possible answers. And &#8220;a sales organization&#8221; is using Idibon to rifle through news articles, blogs and other documents to document relationships among people and organizations and point to past acquisitions, Munro said. It&#8217;s also possible for Idibon to process information from multiple languages to serve up data for business-intelligence applications.</p>
<p>For now, Idibon is &#8220;just a simple API service,&#8221; Munro said. Some direct integration of the Idibon data is happening, too. The software takes in unstructured data &#8212; from tweets, instant messages, emails and so on &#8212; processes it and responds with structured data, he said. Ultimately, though, &#8220;we want to become the leading organization for scalable cloud-based natural-language processing,&#8221; Munro said.</p>
<p>English comprises a small fraction of all communication &#8212; roughly <a href="http://www.britishcouncil.org/learning-faq-the-english-language.htm">375 million people</a> call English their first language, out of more than 7 billion people in the world &#8212; and that&#8217;s why a tool with more universal linguistic powers sounds so appealing. While not many enterprises might be looking to capture data in little-known languages now, it could become essential in the coming years. If Idibon can come out with a product soon, it could be the beneficiary of a sort of international arms race for truly global understanding.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=631289&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=620131"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=620131" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631289+idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages&utm_content=gigajordan">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/01/why-the-next-front-in-big-data-might-be-psychological/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631289+idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages&utm_content=gigajordan">Why the next front in big data might be psychological</a></li><li><a href="http://pro.gigaom.com/2011/12/whats-driving-the-next-phase-of-the-e-commerce-evolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631289+idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages&utm_content=gigajordan">What&#8217;s driving the next phase of the e-commerce evolution</a></li><li><a href="http://pro.gigaom.com/2011/10/siri-say-hello-to-the-coming-invisible-interface/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=631289+idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages&utm_content=gigajordan">Siri: Say hello to the coming &#8220;invisible interface&#8221;</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/16/idibon-secures-1-4m-as-it-builds-a-tool-to-mine-the-worlds-languages/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/idibon.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/idibon.jpg?w=150" medium="image">
			<media:title type="html">Idibon</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c00ab753df107b639e76ed4c3ab07ba7?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigajordan</media:title>
		</media:content>
	</item>
		<item>
		<title>Stanford researchers show how doctors&#8217; notes can spot problem drugs</title>
		<link>http://gigaom.com/2013/04/10/stanford-team-shows-how-doctors-notes-can-spot-problem-drugs/</link>
		<comments>http://gigaom.com/2013/04/10/stanford-team-shows-how-doctors-notes-can-spot-problem-drugs/#comments</comments>
		<pubDate>Thu, 11 Apr 2013 01:28:23 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[apixio]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[health care]]></category>
		<category><![CDATA[medical research]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[semantic analysis]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=629691</guid>
		<description><![CDATA[A team of Stanford researchers has developed a method for mining the text of doctors' notes to identify adverse reactions from prescription drugs. The technique could spot problems years before the current FDA-reporting process can.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=629691&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>When it comes to identifying potentially adverse reactions to prescription drugs, you might think doctors would be on the front lines. After all, they see a lot of patients for a lot of conditions and prescribe a lot of drugs, so who better to notice when certain prescriptions keep leading to the same side effects? And you&#8217;d be right &#8212; and wrong.</p>
<p>As individuals, doctors probably don&#8217;t see enough of any given adverse reaction to notice patterns emerging. But as a collection, their notes on patients&#8217; medical records can provide valuable insights, as <a href="http://www.nature.com/clpt/journal/vaop/ncurrent/full/clpt201347a.html">a group of Stanford researchers recently discovered</a>. Using &#8220;18 years of patient data from 1.8 million patients [consisting of] 19 million encounters, 35 million coded ICD-9 diagnoses, and &gt;11 million unstructured clinical notes,&#8221; the team was able to accurately identify interactions by analyzing the free-form text that doctors had entered about patients&#8217; symptoms, conditions and prescription regimens.</p>
<p>A key aspect to being able to predict adverse interactions is understanding the relationships among the different sets of terminologies used in different medical fields. It&#8217;s a lot easier to spot patterns across hospitals or even an individual patients&#8217; records when you know that a radiologist writing <em>X </em>is the same, or related to, an oncologist writing <em>Y</em>. We <a href="http://gigaom.com/2011/04/01/apixio-is-bringing-big-data-to-medical-records-in-the-cloud/">covered an earlier collaboration</a> between the study&#8217;s leader, Nigam Shah, and medical-data startup Apixio around this very topic in 2011.</p>
<div id="attachment_630004" class="wp-caption aligncenter" style="width: 718px"><a href="http://gigaom2.files.wordpress.com/2013/04/patient-feature.jpg"><img  alt="How Shah's team developed its patient-feature matrix" src="http://gigaom2.files.wordpress.com/2013/04/patient-feature.jpg?w=708&#038;h=312" width="708" height="312" class="size-large wp-image-630004" /></a><p class="wp-caption-text">How Shah&#8217;s team developed its patient-feature matrix</p></div>
<p>Shah and his team hope their work can complement the current process for tracking drug reactions, the FDA’s Adverse Event Reporting System. Whereas that system requires doctors and patients to manually alert the FDA of potential adverse side effects, their method could highlight potential problems that no one noticed or took the time to report. I&#8217;d consider this similar to some early research by social medical sites such as <a href="http://www.patientslikeme.com/">PatientsLikeMe</a>, whose users are producing lots of data about their conditions, drugs, dosages and side effects that could produce correlations ripe for controlled experiments.</p>
<p>A press release announcing the study&#8217;s publication highlights some of its future promise and current limitations:</p>
<blockquote id="quote-the-research-team-is"><p>&#8220;[T]he research team is working on refinements that will cull even more useful information from clinical notes, such as reports of reactions caused by drug combinations, the use of medications typically prescribed for one condition but found effective for treatment of a different health problem, or finding medical profiles of patients that fit a certain scenario. &#8230;</p>
<p>One downside is that most electronic health record systems are set up for patient care, not patient research, Goodman noted. In this study, the researchers mined a data system created for this kind of research, which isn’t widely available. The researchers used the Stanford Translational Research Integrated Database Environment, known as STRIDE.&#8221;</p></blockquote>
<p>This is just one of many ways in which researchers are <a href="http://gigaom.com/2012/07/15/better-medicine-brought-to-you-by-big-data/">experimenting with big data concepts</a> to help medical professionals make sense of more data than they could possibly analyze on their own. Other examples we&#8217;ve covered recently include <a href="http://gigaom.com/2013/02/11/researchers-say-ai-prescribes-better-treatment-than-doctors/">an artificial intelligence model</a> for prescribing safe, cost-effective treatments, the <a href="http://gigaom.com/2013/03/26/how-researchers-are-fighting-lung-cancer-using-pagerank/">application of Google PageRank-like algorithms</a> to map the spread of cancer cells throughout the body, and <a href="http://gigaom.com/2013/01/22/biotech-startup-syapse-wants-to-be-salesforce-com-for-our-genomes/">the use of graph data structures</a> to organize highly complex sequencing data.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-220975p1.html">Shutterstock user Maksym Dykha</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=629691&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=963512"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=963512" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=629691+stanford-team-shows-how-doctors-notes-can-spot-problem-drugs&utm_content=dharrisstructure">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2012/03/a-near-term-outlook-for-big-data/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=629691+stanford-team-shows-how-doctors-notes-can-spot-problem-drugs&utm_content=dharrisstructure">A near-term outlook for big data</a></li><li><a href="http://pro.gigaom.com/2012/01/why-the-next-front-in-big-data-might-be-psychological/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=629691+stanford-team-shows-how-doctors-notes-can-spot-problem-drugs&utm_content=dharrisstructure">Why the next front in big data might be psychological</a></li><li><a href="http://pro.gigaom.com/2011/11/connected-world-the-consumer-technology-revolution/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=629691+stanford-team-shows-how-doctors-notes-can-spot-problem-drugs&utm_content=dharrisstructure">Connected world: the consumer technology revolution</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/10/stanford-team-shows-how-doctors-notes-can-spot-problem-drugs/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_125607485.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_125607485.jpg?w=150" medium="image">
			<media:title type="html">medical record</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2013/04/patient-feature.jpg?w=708" medium="image">
			<media:title type="html">How Shah&#039;s team developed its patient-feature matrix</media:title>
		</media:content>
	</item>
		<item>
		<title>DataRPM scores $250K, introduces Google-like big data searching</title>
		<link>http://gigaom.com/2013/04/02/datarpm-scores-250k-introduces-google-like-big-data-searching/</link>
		<comments>http://gigaom.com/2013/04/02/datarpm-scores-250k-introduces-google-like-big-data-searching/#comments</comments>
		<pubDate>Tue, 02 Apr 2013 18:40:24 +0000</pubDate>
		<dc:creator>Jordan Novet</dc:creator>
				<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[DataRPM]]></category>
		<category><![CDATA[natural language processing]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=626609</guid>
		<description><![CDATA[Natural-language processing powers the Instant Answers feature from business intelligence startup DataRPM, which could help more people easily get insights from their big sets of data.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=626609&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>More companies are realizing that analyzing their big data can lead to insights that increase revenue and produce other business breakthroughs. But getting good answers isn&#8217;t always easy, often requiring IT administrators to take charge and leaving all but a handful of business executives equipped to use software. One startup has a nice and simple idea for big data analytics: Google-like search.</p>
<p>Fairfax, Va.-based <a href="http://datarpm.com/">DataRPM</a> on Tuesday announced that it has raised $250,000 from angel investors and rolled out a new feature for its business-intelligence Software as a Service (SaaS) called Instant Answers. The feature uses natural-language processing to figure out what users want to see, based on typed or spoken queries, and displays the visualization that the software thinks is the best fit. Users can filter and comment on the results.</p>
<p>While methods and purposes vary, the idea of making software or a site respond to limited user input isn&#8217;t new. The approach reminds me of Facebook&#8217;s <a href="http://gigaom.com/2013/03/14/facebook-tweaks-its-algorithms-to-improve-graph-search-comment-search-coming/">Graph Search</a>, which rapidly delivers several options for search results based on likes, friends and other user information. Software from BeyondCore also comes to mind, as it quickly displays graphs and audibly speaks out its findings to show the biggest drivers of, say, revenue. BeyondCore CEO Arijit Sengupta took a few minutes of stage time at GigaOM&#8217;s Structure:Data conference in New York last month to <a href="http://gigaom.com/2013/03/20/six-ideas-from-entrepreneurs-for-solving-your-big-data-problems/">show off the software</a>. </p>
<p>More natural-language processing and machine-learning technology could make DataRPM&#8217;s Instant Answers tool a better choice in a crowded market. Perhaps the SaaS could keep tabs on which data users call up and how users might modify their searches if they don&#8217;t get the data or visualizations they want the first time around. Later, it could predict what users want. The original Instant Answers is nevertheless a good start.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-923519p1.html">Shutterstock user anaken2012</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gigaom.com&#038;blog=14960843&#038;post=626609&#038;subd=gigaom2&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=584291"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/GigaOM_RSS_300x250&#038;sz=300x250&#038;c=584291" /></a></p><p><strong>Related research and analysis from GigaOM Pro:</strong><br />Subscriber content. <a href="http://pro.gigaom.com/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=626609+datarpm-scores-250k-introduces-google-like-big-data-searching&utm_content=gigajordan">Sign up for a free trial</a>.</p><ul><li><a href="http://pro.gigaom.com/2013/01/how-hr-can-make-the-case-for-workforce-analytics/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=626609+datarpm-scores-250k-introduces-google-like-big-data-searching&utm_content=gigajordan">How HR can make the case for workforce analytics</a></li><li><a href="http://pro.gigaom.com/2013/01/cloud-and-data-fourth-quarter-2012-analysis/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=626609+datarpm-scores-250k-introduces-google-like-big-data-searching&utm_content=gigajordan">The fourth quarter of 2012 in cloud</a></li><li><a href="http://pro.gigaom.com/2012/12/big-data-2013-key-trends-and-companies-to-watch/?utm_source=data&utm_medium=editorial&utm_campaign=auto3&utm_term=626609+datarpm-scores-250k-introduces-google-like-big-data-searching&utm_content=gigajordan">Big data 2013: key trends and companies to watch</a></li></ul>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2013/04/02/datarpm-scores-250k-introduces-google-like-big-data-searching/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_95595436.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2013/04/shutterstock_95595436.jpg?w=150" medium="image">
			<media:title type="html">shutterstock_95595436</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c00ab753df107b639e76ed4c3ab07ba7?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">gigajordan</media:title>
		</media:content>
	</item>
	</channel>
</rss>
