Enterprise data is predicted to grow 650 percent by 2014. That data — 80 percent of which is unstructured — comes in the form of office documents, customer interactions, wikis, community posts, bulletin boards and emails, and contains valuable information. The question is how to manage this data and make it searchable.
Google is one option, as the search giant has made substantial inroads into the enterprise market with its Search Appliance offering. The model is “good enough” rather than feature-rich, it’s browser-based and can cost as little as $1,000 in a market used to six-figure enterprise applications. Enterprise search market leader Autonomy, meanwhile, is still growing its market share. But these proprietary models often mean an increase in cost, either with the number of seats on a search software license or with the volume of searches in a given year, and some enterprises are seeking alternatives.
Since the merge of the Lucene search engine and Solr search server projects in early 2010, some have looked to open source search models for organizing unstructured data. Why? The open source community believes it has advantages in scalability, flexibility and speed over enterprise-specific behemoths like Autonomy, Microsoft (owners of FAST, the ex-Alltheweb search engine) and even over Google’s Search Appliance. These factors, along with the superior functionality, cost structure and support services of open source are better adapted to changing enterprise requirements than their proprietary counterparts, and could prove to be a viable alternative for widespread enterprise use in the future.
Who Is Leading the Open Source Charge?
The open source search option is organized in the Lucene/Solr projects at apache.org. Lucene is the search engine, Solr the server. The strong open source community around Solr and its scalability has given many companies confidence to integrate Lucene into their mission-critical tasks.
Otis Gospodnetic, co-author of Lucene implementations primer “Lucene in Action,” says, “You would be surprised to know how many services and applications use Lucene. For example, did you know that Apple uses Lucene in Safari and iTunes?” Lucene, he adds, powers iTunes’ search engine Spotlight and browser history in Safari. CNET, Netflix, Cisco, the Guardian and MTV use it, as do very high-end web applications like LinkedIn’s People You may Know function.
Gospodnetic’s company, Sematext International, is one of the commercial exploitation companies offering Solr implementation. Performance characteristics offered by Sematext include 10,000 queries per second and 1 billion documents as part of a Lucene/Solr implementation. Although this is a rather high-end specification, the effect of providing better access to more unstructured data is to increase internal search activity and the need for scale.
VC-backed Lucid Imagination, host of the Lucene Revolution conference, is another major supplier of open-source search support and implementation. Lucid management and tech staff includes veterans of Northern Light, Amazon A9 and Excite. The Lucid approach provides information, implementation and support services based around Solr/Lucene, and it seems to be working. “We’ve been running positive revenues for seven quarters,” says CEO Eric Gries. “We have never discounted our prices in this period and our prices are going up.” Lucid’s take on the market is that plenty of enterprise clients want to transition to open source but need a supplier base that can support them through mission-critical installations.
Providing that much-needed base, in addition to Lucid Imagination and Sematext, are Findwise, Open Source Connections, JTeam, Doculibre, GetSet and Mindheap.
Market Drivers for Open Source Search
The effects of Enterprise 2.0 and the socialization of knowledge continue to proliferate unstructured data and create demand for knowledge as content. Regulatory compliance concerns are also a driver, as the legal value of any information transaction continues to become more significant. And while open source search applies to the larger web, in enterprises, other factors driving the move towards open source include:
Enterprise 2.0. Enterprises face a growing array of data types. On top of relational databases, companies now generate data through wikis, legacy bulletin boards, internal online communities, email, customer interaction points, PDF, video, images and in Microsoft Office. Estimates of these volumes vary, but a ratio of 80 to 20 unstructured-to-structured data in the enterprise seems to be sticking.
An inability to organize this unstructured data and simultaneously access different structured data files is a problem for business intelligence, adaptability, responsiveness and compliance. At the same time, employees expect good and easy data access, as Google and the web have primed them for instant gratification. Likewise, consumer-grade tools and platforms are increasingly trickling into the enterprise setting. Lucene/Solr is essentially a text index and search server, like Google without the web-crawling functionality.
People and video. Cisco’s Pulse project, which uses Lucene/Solar as a platform, is a good example of an enterprise search application made possible by open source. It allows employees to seek out one another based on their current interests, even down to their latest corporate video appearance. Pulse tags content and media across enterprise networks and makes them searchable in real-time.
The benefit of Lucence/Solr as a platform for the project, according to Cisco, is its scalability and speed. Cisco specified a 100 million-object index built from near-real-time tagging, searchable at sub-second speeds. With open source, CISCO could take the platform and tune it to the specific needs of real-time expert tagging and discovery. An example of this would be enterprise users with thin personal profiles who modestly hide away their skills and achievements. Because Pulse tags people and their talents in everything from email to video to office documents, hidden abilities can be picked up on the fly, giving you up-to-date information on talent across the business.
The Cloud. Cloud computing and SaaS will present new challenges for data discovery and management as enterprises migrate data into cloud environments and expect immediate access to their assets through search. NASA’s Nebula project, a low-cost cloud-computing platform designed to broaden participation in the U.S. space program, uses Lucene/Solr for its search facility. Acquia, the Drupal support services company, is incorporating Solr, which means Acquia-supported Drupal installations will be able to run faceted search. WordPress (see disclosure) is gearing up its cloud-based Solr search via powcloud (currently in private beta); this will give users access to on-site navigation through Solr’s faceted search functionality.
Web niches. While Eric Gries of Lucid Imagination has 100 enterprise clients and has been profitable the past seven quarters, Daniel Levitt is the sole proprietor of recipe site What Could I Cook, built in cooperation with the Guardian newspaper. The Guardian had a problem with a relatively small data set: Ten years of recipes from its own pages and those of the Observer and Observer Food monthly don‘t amount to an insurmountable heap of data. However, readers still found it hard to identify recipes by name and instead wanted to input a few ingredients in the search box and receive specific recipes back. Levitt built his site with virtually no coding experience, using a PHP client and Lucene’s index. What Could I Cook is an example of growing niche market for search applications that — while it extends beyond enterprise use — includes yellow pages search, classifieds search and other search applications that relay on open source to achieve their ultimate goal.
There are, however, barriers to the adoption of open source. Many enterprises have poor search and discovery platforms that don’t index and return data well, often because of poor index or topology management. And many have platforms with price plans that were designed when data access rights belonged to small teams, as opposed to the larger open source community.
As open source search becomes more trusted, and as the supplier ecosystem grows, open source looks like a disruptive force in the market. The expansion of web-like features to the enterprise is in part a testament to Google’s pervasive influence on the search market. However, the ability for customers to open up the search box and modify it for their own purposes, and to include advanced features like filtering at no cost, should see open source search emerge as a market force in the enterprise, the cloud, SaaS and in ecommerce. Ultimately, the biggest driver may prove to be demand on the part of employees to make the enterprise environment look and work like the consumer-driven web.
Disclosure: WordPress is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, founder of Giga Omni Media, is also a venture partner at True.