Mining the Tar Sands of Big Data


The tar sands of Alberta, Canada contain the largest reserves of oil on the planet. However, they remain largely untouched, and for one reason: economics. It costs as much as $40 to extract a barrel of oil from tar sand, and until recently, petroleum companies could not profitably mine these reserves.

In a similar vein, much of the world’s most valuable information is trapped in digital sand, siloed in servers scattered around the globe. These vast expanses of data — streaming from our smart phones, DVRs, and GPS-enabled cars — require mining and distillation before they can be useful.

Both oil and sand, information and data, share another parallel: In recent years, technology has catalyzed dramatic drops in the costs of extracting each.

Unlike oil reserves, data is an abundant resource on our wired planet. Though much of it is noise, at scale and with the right mining algorithms, this data can yield information that can predict traffic jams, entertainment trends, even flu outbreaks.

These are hints of the promise of big data, which will mature in the coming decade, driven by advances in three principal areas: sensor networks, cloud computing, and machine learning.

The first, sensor networks, historically included devices ranging from NASA satellites and traffic monitors to grocery scanners and Nielsen rating boxes. Expensive to deploy and maintain, these were the exclusive province of governments and industry. But another, wider sensor network has emerged in the last decade: smart phones and web-connected consumer devices. These sensors — and the tweets, check-ins, and digital pings they generate — form the tendrils of a global digital nervous system, pulsing with petabytes.

Just as these devices have multiplied, so have the data centers that they communicate with. Housed in climate-controlled warehouses, they consume an estimated 2 percent — and represent the fastest growing segment — of the United States energy budget. These data centers are at the heart of cloud computing, the second driver of big data.

Cloud computing reframes compute power as a utility, like electricity or water. It offers large-scale computing to even the smallest start-ups: With a few keystrokes, one can lease 100 virtual machines from Amazon’s Elastic Compute Cloud (s amzn) for less than $10 per hour.

Yet this computing brawn is only valuable when combined with intelligence. Enter machine learning, the third principal tcomponent driving value in the industrial age of data.

Machine learning is a discipline that blends statistics with computer science to classify and predict patterns in data. Its algorithms lie at the heart of spam filters, self-driving cars, and movie recommendation systems, including one to which Netflix (s nflx) awarded its million-dollar prize to in 2009. While data storage and distributed computing technologies are being commoditized, machine learning is increasingly a source of competitive advantage among data-driven firms.

Together, these three technology advances lead us to make several predictions for the coming decade:

1. A spike in demand for “data scientists.” Fueled by the oversupply of data, more firms will need individuals who are facile with manipulating and extracting meaning from large data sets. Until universities adapt their curricula to match these market realities, the battle for these scarce human resources will be intense.

2. A reassertion of control by data producers. Firms such as retailers, banks, and online publishers are recognizing that they have been giving away their most precious asset — customer data — to transaction processors and other third-parties. We expect firms to spend more effort protecting, structuring and monetizing their data assets.

3. The end of privacy as we know it. With devices tracking our every point and click, acceptable practice for personal data will shift from preventing disclosures towards policing uses. It’s not what our databases know that matters — for soon they will know everything — it’s how this data is used in advertising, consumer finance, and health care.

4. The rise of data start-ups. A class of companies is emerging whose supply chains consist of nothing but data. Their inputs are collected through partnerships or from publicly available sources, processed, and transformed into traffic predictions, news aggregations, or real estate valuations. Data start-ups are the wildcatters of the information age, searching for opportunities across a vast and virgin data landscape.

The consequence of sensor networks, cloud computing, and machine learning is that the data landscape is broadening: data is abundant, cheap, and more valuable than ever. It’s a rich, renewable resource that will shape how we live in the decades ahead, long after the last barrel has been squeezed from the tar sands of Athabasca. For more insights from the big data landscape, come to GigaOM’s Structure: Big Data conference on March 23 in New York City.

Michael Driscoll is the co-founder and CTO of Metamarkets. Roger Ehrenberg is the founder and managing partner of IA Ventures. Metamarkets is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, founder of Giga Omni Media, is also a venture partner at True.

Image courtesy of Flickr user sbamueller

Related content from GigaOM Pro (subscription req’d):


Brad Connell

Another industry that will be vital in the coming age of data is graphic design / data visualization; It is going to become very important to be able to make all that data not only understandable and digestible, but also beautiful and well designed. Things like infographics, charts and graphs can make data a pain or a pleasure to look at and comprehend, depending how well they are designed. Have a look at the work of Nicholas Feltron — — who meticulously documents seemingly mundane statistics of his everyday life and then designs an Annual Report from his last years worth of data. Imagine having an iPhone app that tracks and presents your personal data in a similar fashion. We’re already starting to see this with things like Nike+ and Klout.


The missing semtech component is not so surpising in that semantic and lingusitic data mining, and applicable commercial value-generation (ROI)is still bound up by our lack of an alignment/agreement within a complex of cognitive, meta-cognitive neuropsych and neuro-linguistic models. Until we can agree on the direct role of the human brain and mind n data-modeling, aggregation, generation and derived meaning and as that role relates to making money the semantic space will languish in relative degree behind sensors, ML and collective-intelligence movements.

I think the predictive analytics and attempts to represent/visualize 1) our (a) unique (human) “Self” in real-time; and 2) fine vs. current course-grain human-connectity (applying such visualizations) in the form of efficacious and value-rich (read: Self-Evolving) prescriptions of “Hu-synergies and Hu-synchrocities”(applied to socnet data clouds) will be one of the best proving grounds for semtech as it adds direct value to AI-ML and sensor efforts.

Big data was created for a reason. It exists, not via the blind and sheer momentum of technology use behavior; but for a reason we have not yet grasped. Human and human-systems “Autopoiesis” will become a valuable data reality for the consuming marketplace within the next 25 years (imho) and begin to be seen as such in this generation. I write to this a bit here:


P.S. I am an ex-neuropsych grad stu out of Karl H. Pribram’s lab at Stanford – Mr. “Holographic Hypothesis of Memory and Perception” (… there at the exciting time of – David Bohm + Karl Pribram… and all that)

Michael Driscoll

@Steve – I confess I’m one of the skeptics when it comes to semantic technologies, and the ontologies and mark-ups they require (I’ve written about this here: ).

@Cole – Speaking of semantics… ‘tar sands’ is the popular term favored by the likes of The Economist and even the Wikipedia entry you cite. But point taken.

@David – How important ML is as a competitive advantage depends on how propriety the data is. For the widely available data in the financial markets, ML is the differentiating factor. For other spaces, like traffic prediction from mobile phones, if you can get access to proprietary cell tower data – for example – then, to use a phrase, more data beats better algorithms.

Steve Ardire

Hi Michael – it’s perfectly fine and healthy to be skeptical about semtech because it needs to be ‘interleaved’ into existing information management solutions ( web, enterprise, real time streams ) and sold much better than what most vendors do these days ;)

My main point was the continuing advances already address and overlap what you discuss here in article i.e. advances in sensor networks, cloud computing, and machine learning.

Finally I’ve seen the 2011 Semtech Conf program and it’s quite good !

Cole Cooper

An interesting commentary on the parallels in bitumen refining and data refining. However, I must take exception to your characterization of the Bitumen as “Tar Sands”. They are not. It is more accurate to call them “Oil Sands”.
Tar is a product of pitch which is derived from pine trees. Calling the Oil Sands Tar sands is a mis-identification.

David Famolari

Great piece and totally agree with the rise of data startups.

I see ML algorithms as less of a source of competitive advantage than the exploration of new and/or alternative data sources.

In my view, what ultimately determines the wildcatters success is not building better drills but staking smarter claims.

Steve Ardire

OK article and sorry to be critical but your Tar Sands of Big Data analogy is EXACTLY what semantic web, semantic enterprise, semantic sensor networks: ( like W3C SSN-XG ontology and how to semantically enable real time sensor feeds ) and next gen cloud computing is addressing.

This trumps what you say in this article i.e. “These are hints of the promise of big data, which will mature in the coming decade, driven by advances in three principal areas: sensor networks, cloud computing, and machine learning”.

GigaOM’s Structure: Big Data conference on March 23 in New York City is good stuff but if you want know more per above then come to 2011 Semtech Conf ( the program will be posted soon )


PS – I’m also an ex geologist ;)

Comments are closed.