MongoDB: Now with Hive compatibility

MongoDB proprietor 10gen announced an updated version of its Hadoop connector on Tuesday that includes some pretty significant new features. Among them are support for Hive (the SQL-like framework and query language for Hadoop), the ability to store native MongoDB files in Hadoop and the ability to run incremental MapReduce jobs on the same collection of MongoDB data.

The MongoDB connector for Hadoop has been available for a while, and it is used pretty widely,¬†10gen Director of Product Marketing Kelly Stirman said. Tuesday’s improvements are the first major updates since its general release in April 2012.

If you haven’t noticed already, MongoDB and Hadoop have become very popular in the past few years. MongoDB serves as the operational database for many web and mobile applications because of its support for the types of JSON files they produce, while Hadoop is the de facto big data processing and analytics platform at many companies. Especially in large web companies and Fortune 500 companies, Stirman told me, the two are often deployed side by side.

This diagram could now include Hive on the right.
This diagram could now include Hive on the right.

The MongoDB connector is already pretty popular, he added, because it actually lets users process MongoDB data inside the database rather than sending it to Hadoop for processing. Adding Hive support on top of existing support for MapReduce and Pig should only stand to make popular, as Hive, with its SQL-like nature, is a popular way for companies to interact with their Hadoop data. Database startup Drawn to Scale had added a similar capability — SQL queries on MongoDB data — shortly before closing down earlier this summer.

Support for MongoDB’s native BSON files in the Hadoop Distributed File System means users can back up their database files to Hadoop, and they can also process them there to avoid putting undo load on production MongoDB clusters.

Stirman called the ability to run incremental MapReduce updates on MongoDB collections akin to an “enrichment process.” Whereas users previously could only run MapReduce jobs that wrote to entirely new collections inside the database, the new feature, called MongoUpdateWriteable, lets users run jobs on existing collections. It’s a faster and simpler way of capturing changes on a day-to-day basis, rather than having to compare different outputs to spot trends or query a new collection every time that MapReduce job runs.

Database-industry watchers might question whether these features merely improve the functionality of existing MongoDB-Hadoop environments, or whether they’ll actually affect marketshare somehow. Stirman seems to think the latter, at least with regard to companies already using Hadoop. 10gen sometimes runs into Cassandra and HBase as competitive options in the sales cycle, he noted, but now, “Essentially, there’s parity across the three with respect to Hadoop.”

Parity? Maybe — at least to the extent you’re willing let Hadoop’s scale compensate for less scalability on the database side. There certainly are still plenty of reasons to choose other NoSQL databases over MongoDB depending on the application. In fact, our Structure: Europe conference next month in London features speakers from LinkedIn, Netflix, Facebook and even a former National Security Agency analyst (now with Sqrrl) — all of which use Hadoop and all of which, for reasons they might well discuss, have opted for NoSQL options other than MongoDB.