43 Comments

Summary:

We are in the midst of a data mining renaissance. Traditionally, data warehousing implementations were large, complex and expensive, meaning only the top-ranking companies could afford them. Teradata pioneered the initial market for corporate data warehousing solutions and still maintains a segment lead, something HP’s CEO […]

We are in the midst of a data mining renaissance.

Traditionally, data warehousing implementations were large, complex and expensive, meaning only the top-ranking companies could afford them. Teradata pioneered the initial market for corporate data warehousing solutions and still maintains a segment lead, something HP’s CEO Mark Hurd knows all too well. More recent entrants into the data warehousing and intelligence market, like Netezza, have emerged with cost-effective, appliance-based approaches. Others in this arena include Greenplum, recent Microsoft acquisition DATAllegro and, of course IBM, Oracle and SAP.

But the web changed the way we radiate and consume information and in doing so, created a new opportunity to measure and monetize it. Faced with more user data, logging information, and web content than anyone thought one system could handle, the major web companies developed highly scaled data warehousing solutions themselves. Armed with these tools, they improved customer resonance by building better recommendation engines, more targeted advertising networks and more intricate campaigns.

The preferred architectural model for this web-derived data warehouse — a combination of low-cost server hardware, distributed systems and open-source software — set off an innovation path that outpaced the commercial market. What once was a very expensive proposition — to amass the computer power, storage capacity and network bandwidth to run a high-end data warehouse and analytics engine — is now readily available on-site or in the cloud.

The current software favorite for large-scale data processing and analytics is Hadoop, an open-source implementation of MapReduce. MapReduce is a technique popularized by Google that distributes complex problems to many distributed nodes and, as such, is useful for processing information from large data sets. Hadoop helps major web companies make sense of the mouse clicks and content they track across hundreds of millions of users. Amazon began offering it as a pre-configured option on EC2 in early April. Firms like Cloudera help companies apply Hadoop to their data-processing initiatives. Cloudera recently secured a $5 million dollar financing round to beef up its distribution of Hadoop.  Meanwhile, Yahoo has been singing the Hadoop tune for the last couple of years, following the hire of Hadoop’s creator, Doug Cutting, in 2006.

A Hadoop-spawned ecosystem includes a range of complimentary projects. Hive, for example, is an open-source data warehouse infrastructure built on top of Hadoop. And other uses of MapReduce are gaining traction. Recently, Aster Data Systems raised $12 million in new funding for its high-performance analytical database. Aster’s software features in-database MapReduce, and is now available as a cloud-based solution.

Not too long ago, Ian Ayres wrote a book called “Super Crunchers,” in which he details the resurgence of data mining and its impact on overall society. As he notes, this trend has been embraced by not only the web giants (think Amazon using filters to track tastes and purchasing history), but by insurance companies and government agencies, those tasked with making decisions that affect your everyday life. If his observations ring true, we have not seen the last of how these capabilities will affect our day-to-day activities.

Business intelligence and data warehousing are big enough markets to sustain activity at all levels. The industry giants will lumber on delivering conventional solutions, newer technology startups will put innovative products in place, and creative developments from the open-source community will set the industry aflutter. What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions. The improved data warehousing and analytics capabilities are ready for use. Let’s watch and see who rises to the occasion.

  1. I would make the important distinction that what we are witnessing is really the commoditization of existing data mining technologies. The revolution is economic rather than technological, as there have been few true advances in the field for many years. Things like MapReduce are “new” in the same way that XML was “new”, being popularizations of capabilities that have been available for decades.

    This is an important point because there are a number of gross deficiencies in the underlying technologies currently employed in data mining that will define the market over the long-term. The two most limiting weaknesses are inability to scale high-dimensionality representations and a lack of meaningful real-time analytic capability. This seriously constrains the scope and utility of proliferating data mining tech; just when it becomes commoditized, we start to really need it to be more capable than it currently is and every existing platform falls far short. None of the companies mentioned above have materially addressed these limitations, which have a well-known theoretical basis.

    The renaissance will really start when a company solves the computer science problems, because until then we will be chafing under increasingly limiting theoretical constraints of existing data mining technology.

    Share
    1. Andrew –
      Well said. You really hit the nails on the head. Hopefully, we will get there as so much is possible with no constraints.

      Share
    2. Gary Orenstein Friday, April 10, 2009

      Andrew, Good distinction on the commoditization of existing data mining technologies. Thanks. I also think that real-time capabilities represent the next frontier. Far too much of the processing happens off-line or after hours today and could be much more impactful in real-time.

      Share
    3. Andrew –

      you beat me to it. Well said…

      And to your last point, that will be when I need to start looking for a new job. The current state of the art BI implementation is old as Jesus spit: 2- or 3-tier BI tool residing on (hopefully) multi-node DW server cluster.

      Companies like LogiXML are getting ahead of the game when it comes to rapid reporting and data-mining implementations. But companies like MicroStrategy, Teradata, Cognos, etc still own the ability to do real complex analysis and data modeling.

      Share
    4. Good call.

      But, I wouldn’t wait for a “company” to solve computer science questions. Companies main goal is short-term financial profit and seldom solve any problem at all.

      That’s why we still need non-profit high-level academic research centers.

      Remember, the most successful data-mining tool used nowadays was conceived by two graduate students in Stanford while tens of companies were struggling on the subject.

      Share
  2. Hope all this awesomeness doesn’t serve only advertising, etc. Much appreciated post and first comment above though.

    Share
    1. There are many far-reaching applications for this technology, not just advertising and in fact advertising is relatively low on the list of applications with an immediate and strong economic rationale, at least for the near term. This is somewhat counter-intuitive but it reflects the realities, profit centers, and cost structures of the advertising business. That said, its application to advertising and marketing is to change the nature of those activities, making them simultaneously more subtle and more effective, which can be viewed as both good and bad depending on your perspective.

      One advertising area where the structural economics will make it difficult to run a business without significant improvements in contextual targeting and user preference modeling is the mobile advertising space, Internet advertising models translate poorly there, but that is a large topic unto itself. I have spent quite a bit of time discussing the relationship of data mining to advertising with executives in related industries for my own purposes, and it is a bit more complicated than most people (including myself) would assume at first glance. At some point I will get around to blogging a series on this topic…

      Share
  3. Hope all this awesomeness doesn’t serve only advertising, etc. Much appreciated post and first comment above though.

    Share
  4. >>What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions.

    Aster Data’s vision is to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems by providing tight integration with industry-standard SQL in a relational database. This provides an “enterprise-class” MapReduce. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop?
    http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/

    Share
  5. Marshall — heated agreement that “this awesomeness” needs (and will) serve more than just advertising. There was a good webcast a couple weeks ago talking about other use-cases in Government, Telco, and consumer retail/finance: http://www.asterdata.com/blog/index.php/2009/03/27/more-in-database-mapreduce-applications/

    Share
  6. Re: What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions.

    Aster Data’s vision is to bring the power of MapReduce to a whole new class of developers and mission-critical enterprise systems by providing tight integration with industry-standard SQL in a relational database. This provides an “enterprise-class” MapReduce. When would you use Aster’s In-Database MapReduce vs. a system like Hadoop?
    http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/

    Share
  7. I teach a Data Mining class at Stanford, in the CS department. Enrollment in the class has steadily increased over the past 3 years. There was a big jump this year from 30 to over 50 enrolled students. What we’re witnessing is that the power of data is now being better understood, and data mining is moving from a niche subject to a mainstream interest at least for CS majors.

    BTW, students used Aster Data’s system, hosted on the Amazon EC2 cloud, to do significant data mining as part of the course. The ready availability of interesting public data sets (Netflix, NY Times, and many others) makes such projects interesting and not merely academic curiosities.

    Share
    1. @anand – That’s a great data point regarding a growing interest in data mining at the university level. I thought readers might appreciate a link to your class:

      http://www.stanford.edu/class/cs345a/

      Also it’s worth noting that in addition to your teaching at Stanford, you are an investor and board member at Aster Data, so clearly have a view of both worlds.

      Share
  8. > Business intelligence and data warehousing are big enough markets to sustain activity at all levels.

    I definitely agree. I recently wrote about how Amazon’s Elastic MapReduce actually makes the cloud computing proposition (and the massively-parallel data mining proposition) a more compelling one especially for the enterprise and web startups. You can find the article here: http://bit.ly/YRf5 .

    All in all, I think this is only the tip of the iceberg — as more and more people have access to more and more data that suddenly seems worth mining, we’re going to really start seeing more interesting uses of this data. It’s not only business intelligence and data warehousing, but recommendation engines, viral growth mechanisms, and generally “social” applications of data-driven computation would start coming out sooner than later.

    Share
  9. You may want to look at “Vertica” as well. I attended a webinar a few days ago from them on the future of DBMSs in this world of Cloud Computing and the presenter, Michael Stonebraker, MIT Professor, did a great job of explaining what Vertica brings to the table.
    http://www.vertica.com/
    And, no I don’t work for Vertica..;-)

    Share
  10. Great post! I’ve read the Ayres book, and really enjoyed it. I’d really like to see more (from where, I’m not sure) on the range of skillsets needed and employment opportunities in such an emerging space.

    Share
  11. Gary Orenstein Friday, April 10, 2009

    @Anand. Thanks. I mentioned Veritca briefly in an early post on Cloud Computing’s Three Horse Race, http://gigaom.com/2009/03/08/cloud-computings-three-horse-race/

    Share
  12. For the record, my Teradata baseball cap I got at TDWI several years ago is still my favorite hat. I’m sure there’s a reason for it.

    Share
    1. … and we’re working pretty hard at Teradata to ensure that it remains your favorite. :)

      Check out TD 13.0 coming soon. Performance-wise, no one else will be even close :)

      Share
  13. [...] The Data Mining Renaissance We are in the midst of a data mining renaissance. Traditionally, data warehousing implementations were large, complex [...] [...]

    Share
  14. Cazoodle would like to come on the fore front of this data mining and use it for applications on beyond just mere advertising and actually become a useful tool for those people wanting a better quality search application. http://www.cazoodle.com

    Share
  15. [...] The Data Mining Renaissance (gigaom.com) [...]

    Share
  16. We at ShareThis are using best of all the technologies most of you are talking about. Hadoop/Cascading, MapReduce, SQL/MR using Aster Data and of course Microstrategy. All of this on the cloud (Amazon) 100%. There are a lot of technologies out there competing in this space, time will tell who has the best product for this ever growing data market.

    Share
  17. [...] posted an article claiming that we are in the midst of a data mining renaissance. Sounds good to [...]

    Share
  18. [...] keeping many CIOs up at night. The result, as Gary Orenstein notes in a recent post, is that we’re in the midst of a data mining renaissance. But, before businesses can make sense of the data and start using it smartly to deliver business [...]

    Share
  19. A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

    The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system.

    I have detailed post this on blog at:

    http://cloudcomputing.blogspot.com/2009/04/database-continuum-on-cloud-from.html

    Share
  20. [...] More and more companies are turning to Hadoop and other software written for the ultra web. Gary Orenstein in a recent post outlined the various systems that have emerged to capitalize on the data mining [...]

    Share
  21. [...] company. solon and more companies are motion to Hadoop and another code cursive for the ultra web. Gary Orenstein in a past post distinct the assorted systems that hit emerged to cipher on the accumulation defence [...]

    Share
  22. [...] Yahoo’s VP of engineering; and Jeff Hammerbacher, formerly with Facebook. (Related posts: The Data Mining Renaissance and Supercomputers, Hadoop, MapReduce and the Return to a Few Big Computers) [...]

    Share
  23. [...] real-time queries yet, but just give the tech community some time. We’re in the midst of a data-mining renaissance, and Hadoop is playing a leading [...]

    Share
  24. [...] real-time queries yet, but just give the tech community some time. We’re in the midst of a data-mining renaissance, and Hadoop is playing a leading [...]

    Share
  25. [...] big data problem has fed a surge of activity in data analytics systems. The flurry of new data warehousing and database vendors and the increasing adoption of the [...]

    Share
  26. [...] | Sunday, September 20, 2009 | 11:00 AM PT | 0 comments Hadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to [...]

    Share
  27. [...] Asset Management Hadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to [...]

    Share
  28. [...] consume information and in doing so, created a new opportunity to measure and monetize it,” writes Gary Orenstein. “The preferred architectural model for this web-derived data warehouse –- a combination of [...]

    Share
  29. Cloud Computing and SaaS is an expected $160 Billion industry by the year 2015.

    Share
  30. [...]  0 Big data is certainly on the tip of everyone’s tongues these days as both the amount of data entered online expands and the ways to track objects and people grows with wireless connectivity and sensors. We have both [...]

    Share
  31. [...] data is certainly on the tip of everyone’s tongues these days as both the amount of data entered online expands and the ways to track objects and people grows with wireless connectivity and sensors. We have both [...]

    Share
  32. [...] decision to buy Greenplum stems from the fact that enterprises are going through a data mining renaissance. IDC predicts that over the next 10 years the amount of digital data created annually will grow 44 [...]

    Share
  33. [...] The new MapReduce IDE is called Aster Data Developer Express, and it is a point and click visual development environment. Aster, which makes data management software, is hoping that some of the newer companies with copious amounts of data – such as social networks and mobile apps would start using this IDE, allowing it to cash in the ongoing data mining renaissance. [...]

    Share
  34. [...] for $1.7 billion Netezza emerged as a strong company in data warehousing, a core element of the data mining renaissance. IBM, which through acquisitions such as Cognos in business intelligence and SPSS in statistical [...]

    Share
  35. [...] ways to build their businesses and become even more relevant to their customers. And with newer data warehousing appliances and management and processing tools such as Hadoop helping companies capture and manage more data, [...]

    Share

Comments have been disabled for this post