We are in the midst of a data mining renaissance.
Traditionally, data warehousing implementations were large, complex and expensive, meaning only the top-ranking companies could afford them. Teradata pioneered the initial market for corporate data warehousing solutions and still maintains a segment lead, something HP’s CEO Mark Hurd knows all too well. More recent entrants into the data warehousing and intelligence market, like Netezza, have emerged with cost-effective, appliance-based approaches. Others in this arena include Greenplum, recent Microsoft acquisition DATAllegro and, of course IBM, Oracle and SAP.
But the web changed the way we radiate and consume information and in doing so, created a new opportunity to measure and monetize it. Faced with more user data, logging information, and web content than anyone thought one system could handle, the major web companies developed highly scaled data warehousing solutions themselves. Armed with these tools, they improved customer resonance by building better recommendation engines, more targeted advertising networks and more intricate campaigns.
The preferred architectural model for this web-derived data warehouse — a combination of low-cost server hardware, distributed systems and open-source software — set off an innovation path that outpaced the commercial market. What once was a very expensive proposition — to amass the computer power, storage capacity and network bandwidth to run a high-end data warehouse and analytics engine — is now readily available on-site or in the cloud.
The current software favorite for large-scale data processing and analytics is Hadoop, an open-source implementation of MapReduce. MapReduce is a technique popularized by Google that distributes complex problems to many distributed nodes and, as such, is useful for processing information from large data sets. Hadoop helps major web companies make sense of the mouse clicks and content they track across hundreds of millions of users. Amazon began offering it as a pre-configured option on EC2 in early April. Firms like Cloudera help companies apply Hadoop to their data-processing initiatives. Cloudera recently secured a $5 million dollar financing round to beef up its distribution of Hadoop. Meanwhile, Yahoo has been singing the Hadoop tune for the last couple of years, following the hire of Hadoop’s creator, Doug Cutting, in 2006.
A Hadoop-spawned ecosystem includes a range of complimentary projects. Hive, for example, is an open-source data warehouse infrastructure built on top of Hadoop. And other uses of MapReduce are gaining traction. Recently, Aster Data Systems raised $12 million in new funding for its high-performance analytical database. Aster’s software features in-database MapReduce, and is now available as a cloud-based solution.
Not too long ago, Ian Ayres wrote a book called “Super Crunchers,” in which he details the resurgence of data mining and its impact on overall society. As he notes, this trend has been embraced by not only the web giants (think Amazon using filters to track tastes and purchasing history), but by insurance companies and government agencies, those tasked with making decisions that affect your everyday life. If his observations ring true, we have not seen the last of how these capabilities will affect our day-to-day activities.
Business intelligence and data warehousing are big enough markets to sustain activity at all levels. The industry giants will lumber on delivering conventional solutions, newer technology startups will put innovative products in place, and creative developments from the open-source community will set the industry aflutter. What follows is the ability for more mainstream companies to efficiently deploy unparalleled data mining solutions. The improved data warehousing and analytics capabilities are ready for use. Let’s watch and see who rises to the occasion.