Enterprise search, until recently a solved problem, is rearing its head again as IT shops consider how to search big data repositories such as Hadoop using existing tools and processes. To that end, Cloudera announced today that it has integrated the open source Apache Lucene/Solr search functionality into its Hadoop distribution.
It’s a smart move for multiple reasons.
As the enterprise search market evolves to include big data, legacy players in the market have seen big changes over the last couple of years. Many of the pioneers of this space have been acquired. Those companies including Verity, Autonomy, FAST, Endeca, Exalead and ISYS face a major challenge with the rise of open source search, particularly from Lucene and Solr.
Companies including Twitter, Netflix, LinkedIn, Facebook, Groupon, Shopzilla, AT&T, Sears, Ford and Verizon use this technology today. By using Lucene, for example, Twitter is able to index four hundred million tweets within 50 milliseconds of being posted, and then provides search results to more than a billion queries every day. It is also the search engine powering Microsoft Azure, which is interesting, as Microsoft owns FAST.
Cloudera is not the first company with a next generation data store to recognize the need for search capabilities in its platform. Datastax added support for Apache Lucene/Solr to its NoSQL database a year ago. For more on how the rest of the industry is going after this market, see GigaOM Research report: Big data’s potential with search (sub required).
Adding search to Hadoop opens up the platform to regular business users. It has been exceptionally challenging for business users to perform analysis on Hadoop data as very few can code in Java MapReduce. Cloudera added support for SQL, opening up the market to analysts that can write SQL statements, but this still didn’t address the bulk of business users out there. Search is the ideal model to broaden the range of users who can get value out of data in the Hadoop platform.
Cloudera’s grand vision is to turn its Hadoop distribution into a unified data management platform, at the basis of which is an elastic pool of storage (HDFS), and then different ways of accessing that data, via batch (Apache MapReduce) interactive SQL (Impala), machine learning, and now search (Apache Lucene/Solr) tools.
Cloudera’s premise is that using its platform is simpler than having multiple different data stores and sets of expertise to manage disparate systems. The company also promises that its approach is less expensive, via scaling out on commodity hardware, than buying single-task proprietary systems such as data warehouses from Teradata, multi-parallel processing systems from the likes of HP Vertica, EMC Greenplum and Oracle Exadata, NoSQL databases from the likes of DataStax, and individual search and machine learning tools.
It is no doubt simpler to have one box that does everything. The big question is whether its Swiss Army knife approach is good enough at each of these individual tasks and has just the right features that users do not need the purpose-built system.
My suspicion is that the high-end of the market will always shell out big bucks for the feature-rich, purpose built boxes from IBM, Oracle, HP and EMC among others. But the more cost-conscious mid-market will go for Cloudera’s approach of a unified platform. And ultimately as these mid-market companies grow and the platform expands with them, they will relegate the purpose built boxes further and further into a niche at the very high-end.