Edit Note: This is the first on a multi-part series of posts exploring the use cases for NoSQL deployments in the real world.
With all the excitement surrounding the relatively recent wave of non-relational – otherwise known as “NoSQL” – databases, it can be hard to separate the hype from the reality. There’s a lot of talk, but how much NoSQL action is there in the real world? In this series, we’ll take a look at some real-world NoSQL deployments.
Trend Micro provides corporate computer security products, and maintains web reputation databases that allow intelligent detection of spam, phishing, or suspicious web sites. Maintenance of these databases requires processing massive amounts of log data from DNS and other Internet servers, accumulating at the rate of about four petabytes per year. New product offerings for Trend Micro required real-time analysis of exponentially growing data volumes. After evaluating a number of database alternatives – including Hypertable and Cassandra – Trend Micro settled on Apache Hbase as the core database of new elastic infrastructure.
Several years ago, Trend Micro adopted Hadoop – the increasingly ubiquitous, open-source implementation of Google’s MapReduce framework – to store and perform bulk processing of these data sets. However, Hadoop on its own doesn’t provide a data store that can support individually updatable data items; you can’t change a single data item within a raw Hadoop dataset without having to reprocess the entire set.
HBase, a NoSQL database modelled from Google’s BigTable system, offers the row level access required by Trend Micro and – since it is part of the Hadoop ecosystem already established at Trend Micro – may seem a natural choice. However, Trend Micro evaluated other non-relational databases, including Cassandra and HyperTable. HBase was eventually chosen because it demonstrated an ability to handle the transaction rates required and because of its active development community. Trend Micro’s HBase solution is scheduled to go live in the first half of 2011.
As well as the index of web sites that forms the core of the reputation database, the system accepts event traces and activity logs from customer desktops accumulating at the rate of 5 billion items every day. According to Andy Purtell, senior architect at Trend Micro, a traditional relational system would have been hundreds of times more expensive than HBase – if it could have handled the load at all. As well as the massive insert rate, HBase’s flexible schema, which allows new attributes to be added without reorganizing the database, and its tight Hadoop integration were compelling.
Because HBase is part of the Hadoop ecosystem, Trend Micro programmers have a variety of tools available that are tightly integrated with HBase. Currently, developers at Trend Micro write data access routines directly in Java, but they are considering PIG – a scripting language for Hadoop – and the SQL-like HIVE system for more ad-hoc access.
“I’m not that interested in the NoSQL vs. RDBMS debate,” Purtell says. “I’m more interested in finding the best tools to build a solution. For our application, HBase is faster and cheaper than the relational alternative.”
Guy Harrison is a director of research and development at Quest Software, and has over 20 years of experience in database design, development, administration, and optimization. He can be found on the internet at www.guyharrison.net, on e-mail at firstname.lastname@example.org and is @guyharrison on twitter.
Related content from GigaOM Pro (sub req’d):