Blog Post

The Data Mining Renaissance

We are in the midst of a data mining renaissance.

Traditionally, data warehousing implementations were large, complex and expensive, meaning only the top-ranking companies could afford them. Teradata pioneered the initial market for corporate data warehousing solutions and still maintains a segment lead, something HP’s CEO Mark Hurd knows all too well. More recent entrants into the data warehousing and intelligence market, like Netezza, have emerged with cost-effective, appliance-based approaches. Others in this arena include Greenplum, recent Microsoft (s MSFT) acquisition DATAllegro and, of course IBM (s IBM), Oracle (s ORCL) and SAP (s SAP).

But the web changed the way we radiate and consume information and in doing so, created a new opportunity to measure and monetize it. Faced with more user data, logging information, and web content than anyone thought one system could handle, the major web companies developed highly scaled data warehousing solutions themselves. Armed with these tools, they improved customer resonance by building better recommendation engines, more targeted advertising networks and more intricate campaigns.

The preferred architectural model for this web-derived data warehouse — a combination of low-cost server hardware, distributed systems and open-source software — set off an innovation path that outpaced the commercial market. What once was a very expensive proposition — to amass the computer power, storage capacity and network bandwidth to run a high-end data warehouse and analytics engine — is now readily available on-site or in the cloud.

The current software favorite for large-scale data processing and analytics is Hadoop, an open-source implementation of MapReduce. MapReduce is a technique popularized by Google that distributes complex problems to many distributed nodes and, as such, is useful for processing information from large data sets. Hadoop helps major web companies make sense of the mouse clicks and content they track across hundreds of millions of users. Amazon (s amzn) began offering it as a pre-configured option on EC2 in early April. Firms like Cloudera help companies apply Hadoop to their data-processing initiatives. Cloudera recently secured a $5 million dollar financing round to beef up its distribution of Hadoop.  Meanwhile, Yahoo (s yhoo) has been singing the Hadoop tune for the last couple of years, following the hire of Hadoop’s creator, Doug Cutting, in 2006.

A Hadoop-spawned ecosystem includes a range of complimentary projects. Hive, for example, is an open-source data warehouse infrastructure built on top of Hadoop. And other uses of MapReduce are gaining traction. Recently, Aster Data Systems raised $12 million in new funding for its high-performance analytical database. Aster’s software features in-database MapReduce, and is now available as a cloud-based solution.

Not too long ago, Ian Ayres wrote a book called “Super Crunchers,” in which he details the resurgence of data mining and its impact on overall society. As he notes, this trend has been embraced by not only the web giants (think Amazon (s AMZN) using filters to track tastes and purchasing history), but by insurance companies and government agencies, those tasked with making decisions that affect your everyday life. If his observations ring true, we have not seen the last of how these capabilities will affect our day-to-day activities.

Business intelligence and data warehousing are big enough markets to sustain activity at all levels. The industry giants will lumber on delivering conventional solutions, newer technology startups will put innovative products in place, and creative developments from the open-source community will set the industry aflutter. What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions. The improved data warehousing and analytics capabilities are ready for use. Let’s watch and see who rises to the occasion.

43 Responses to “The Data Mining Renaissance”

  1. A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

    The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system.

    I have detailed post this on blog at:

  2. We at ShareThis are using best of all the technologies most of you are talking about. Hadoop/Cascading, MapReduce, SQL/MR using Aster Data and of course Microstrategy. All of this on the cloud (Amazon) 100%. There are a lot of technologies out there competing in this space, time will tell who has the best product for this ever growing data market.

  3. Great post! I’ve read the Ayres book, and really enjoyed it. I’d really like to see more (from where, I’m not sure) on the range of skillsets needed and employment opportunities in such an emerging space.

  4. You may want to look at “Vertica” as well. I attended a webinar a few days ago from them on the future of DBMSs in this world of Cloud Computing and the presenter, Michael Stonebraker, MIT Professor, did a great job of explaining what Vertica brings to the table.
    And, no I don’t work for Vertica..;-)

  5. > Business intelligence and data warehousing are big enough markets to sustain activity at all levels.

    I definitely agree. I recently wrote about how Amazon’s Elastic MapReduce actually makes the cloud computing proposition (and the massively-parallel data mining proposition) a more compelling one especially for the enterprise and web startups. You can find the article here: .

    All in all, I think this is only the tip of the iceberg — as more and more people have access to more and more data that suddenly seems worth mining, we’re going to really start seeing more interesting uses of this data. It’s not only business intelligence and data warehousing, but recommendation engines, viral growth mechanisms, and generally “social” applications of data-driven computation would start coming out sooner than later.

  6. I teach a Data Mining class at Stanford, in the CS department. Enrollment in the class has steadily increased over the past 3 years. There was a big jump this year from 30 to over 50 enrolled students. What we’re witnessing is that the power of data is now being better understood, and data mining is moving from a niche subject to a mainstream interest at least for CS majors.

    BTW, students used Aster Data’s system, hosted on the Amazon EC2 cloud, to do significant data mining as part of the course. The ready availability of interesting public data sets (Netflix, NY Times, and many others) makes such projects interesting and not merely academic curiosities.

  7. Re: What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions.

    Aster Data’s vision is to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems by providing tight integration with industry-standard SQL in a relational database. This provides an “enterprise-class” MapReduce. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop?

  8. >>What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions.

    Aster Data’s vision is to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems by providing tight integration with industry-standard SQL in a relational database. This provides an “enterprise-class” MapReduce. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop?

    • There are many far-reaching applications for this technology, not just advertising and in fact advertising is relatively low on the list of applications with an immediate and strong economic rationale, at least for the near term. This is somewhat counter-intuitive but it reflects the realities, profit centers, and cost structures of the advertising business. That said, its application to advertising and marketing is to change the nature of those activities, making them simultaneously more subtle and more effective, which can be viewed as both good and bad depending on your perspective.

      One advertising area where the structural economics will make it difficult to run a business without significant improvements in contextual targeting and user preference modeling is the mobile advertising space, Internet advertising models translate poorly there, but that is a large topic unto itself. I have spent quite a bit of time discussing the relationship of data mining to advertising with executives in related industries for my own purposes, and it is a bit more complicated than most people (including myself) would assume at first glance. At some point I will get around to blogging a series on this topic…

  9. I would make the important distinction that what we are witnessing is really the commoditization of existing data mining technologies. The revolution is economic rather than technological, as there have been few true advances in the field for many years. Things like MapReduce are “new” in the same way that XML was “new”, being popularizations of capabilities that have been available for decades.

    This is an important point because there are a number of gross deficiencies in the underlying technologies currently employed in data mining that will define the market over the long-term. The two most limiting weaknesses are inability to scale high-dimensionality representations and a lack of meaningful real-time analytic capability. This seriously constrains the scope and utility of proliferating data mining tech; just when it becomes commoditized, we start to really need it to be more capable than it currently is and every existing platform falls far short. None of the companies mentioned above have materially addressed these limitations, which have a well-known theoretical basis.

    The renaissance will really start when a company solves the computer science problems, because until then we will be chafing under increasingly limiting theoretical constraints of existing data mining technology.

    • Gary Orenstein

      Andrew, Good distinction on the commoditization of existing data mining technologies. Thanks. I also think that real-time capabilities represent the next frontier. Far too much of the processing happens off-line or after hours today and could be much more impactful in real-time.

    • Andrew –

      you beat me to it. Well said…

      And to your last point, that will be when I need to start looking for a new job. The current state of the art BI implementation is old as Jesus spit: 2- or 3-tier BI tool residing on (hopefully) multi-node DW server cluster.

      Companies like LogiXML are getting ahead of the game when it comes to rapid reporting and data-mining implementations. But companies like MicroStrategy, Teradata, Cognos, etc still own the ability to do real complex analysis and data modeling.

    • Good call.

      But, I wouldn’t wait for a “company” to solve computer science questions. Companies main goal is short-term financial profit and seldom solve any problem at all.

      That’s why we still need non-profit high-level academic research centers.

      Remember, the most successful data-mining tool used nowadays was conceived by two graduate students in Stanford while tens of companies were struggling on the subject.