5 Comments

Summary:

Trend Micro maintains web reputation databases that allow intelligent detection of spam, phishing, or suspicious web sites. It processes data accumulating at the rate of about four petabytes per year. Here’s why Trend Micro settled on Apache Hbase as the core database of new elastic infrastructure.

hbase

Edit Note: This is the first on a multi-part series of posts exploring the use cases for NoSQL deployments in the real world.

With all the excitement surrounding the relatively recent wave of non-relational – otherwise known as “NoSQL” – databases, it can be hard to separate the hype from the reality. There’s a lot of talk, but how much NoSQL action is there in the real world? In this series, we’ll take a look at some real-world NoSQL deployments.

Trend Micro provides corporate computer security products, and maintains web reputation databases that allow intelligent detection of spam, phishing, or suspicious web sites. Maintenance of these databases requires processing massive amounts of log data from DNS and other Internet servers, accumulating at the rate of about four petabytes per year. New product offerings for Trend Micro required real-time analysis of exponentially growing data volumes. After evaluating a number of database alternatives – including Hypertable and Cassandra – Trend Micro settled on Apache Hbase as the core database of new elastic infrastructure.

Several years ago, Trend Micro adopted Hadoop – the increasingly ubiquitous, open-source implementation of Google’s MapReduce framework – to store and perform bulk processing of these data sets. However, Hadoop on its own doesn’t provide a data store that can support individually updatable data items; you can’t change a single data item within a raw Hadoop dataset without having to reprocess the entire set.

HBase, a NoSQL database modelled from Google’s BigTable system, offers the row level access required by Trend Micro and – since it is part of the Hadoop ecosystem already established at Trend Micro – may seem a natural choice. However, Trend Micro evaluated other non-relational databases, including Cassandra and HyperTable. HBase was eventually chosen because it demonstrated an ability to handle the transaction rates required and because of its active development community. Trend Micro’s HBase solution is scheduled to go live in the first half of 2011.

As well as the index of web sites that forms the core of the reputation database, the system accepts event traces and activity logs from customer desktops accumulating at the rate of 5 billion items every day. According to Andy Purtell, senior architect at Trend Micro, a traditional relational system would have been hundreds of times more expensive than HBase – if it could have handled the load at all. As well as the massive insert rate, HBase’s flexible schema, which allows new attributes to be added without reorganizing the database, and its tight Hadoop integration were compelling.

Because HBase is part of the Hadoop ecosystem, Trend Micro programmers have a variety of tools available that are tightly integrated with HBase. Currently, developers at Trend Micro write data access routines directly in Java, but they are considering PIG – a scripting language for Hadoop – and the SQL-like HIVE system for more ad-hoc access.

“I’m not that interested in the NoSQL vs. RDBMS debate,” Purtell says. “I’m more interested in finding the best tools to build a solution. For our application, HBase is faster and cheaper than the relational alternative.”

Guy Harrison is a director of research and development at Quest Software, and has over 20 years of experience in database design, development, administration, and optimization. He can be found on the internet at www.guyharrison.net, on e-mail at guy.harrison@quest.com and is @guyharrison on twitter.

Related content from GigaOM Pro (sub req’d):

You’re subscribed! If you like, you can update your settings

By Guy Harrison

You're subscribed! If you like, you can update your settings

Related stories

  1. It makes sense to use nodal database for spam related problems especially when spam attacks come in large spikes and suddenly you have large amount of writes that you have to process quickly in order to calculate reputation. Any delay in processing results in more spam in user’s inbox. Proofpoint a leading email solution provider is using hadoop and hbase and Cassandra for similar reasons for many years now.

  2. Real World NoSQL: MongoDB at Shutterfly : Cloud Computing News « Saturday, January 29, 2011

    [...] Real World NoSQL: HBase at Trend Micro GIGAOM TV Green Overdrive: The Electric Mini! VIEW MORE THAN 1,000 VIDEOS ONGIGAOM.TV Upcoming Event GigaOM Research [...]

  3. RealWorld NoSQL: Cassandra at Openwave : Cloud Computing News « Saturday, January 29, 2011

    [...] Edit Note: This is the third of a multi-part series of posts exploring the use cases for NoSQL deployments in the real world. So far, the series has covered case studies on MongoDB and Hbase. [...]

  4. Real World NoSQL: Amazon SimpleDB at Netflix : Cloud Computing News « Friday, February 4, 2011

    [...] Edit Note: This is the fourth of a multi-part series of posts exploring the use cases for NoSQL deployments in the real world. So far, the series has covered case studies on MongoDB, Cassandra and Hbase. [...]

  5. Real World NoSQL: Membase at Tribal Crossing : Cloud Computing News « Saturday, February 5, 2011

    [...] Edit Note: This is the fifth and final article of a multi-part series of posts exploring the use cases for NoSQL deployments in the real world. So far, the series has covered case studies on MongoDB, Cassandra, Amazon’s Simple DB and Hbase. [...]

Comments have been disabled for this post