For fast, interactive Hadoop queries, Drill may be the answer

Big data digits

In the era of big data, there is increasing demand for ever-faster ways to analyze — preferably in an interactive way —  information sitting in Hadoop. Now the Apache Foundation is backing an open-source version of Dremel, the tool Google uses for these jobs, as a way to bring that speedy analysis to the masses. The proposed tool is called Drill and the Apache Foundation documents describe it as “a distributed system for interactive analysis of large-scale datasets.”

As it stands now, MapReduce is typically used to perform batch analysis on Hadoop data — which is fine unless you want fast results or want to redefine your queries on the fly. If you want to do either or both of those things you need another tool.  Drill might be one solution to this problem. As GigaOM reported last year, Google invented Dremel (subsquently exposed as BigQuery) as a purpose-built tool to allow the scanning of petabytes of data in seconds to answer ad hoc queries and, presumably power compelling visualizations.

“Drill is complementary to MapReduce. [Just as] there are thousands of engineers at Google who use Dremel with MapReduce there will be lots of people who will use Drill with MapReduce,” said Tomer Shiran, director of product management at MapR, one of the driving forces behind this effort.

The initial goal of the Apache project is to establish common APIs and devise an architecture that will accommodate many data sources, data formats and query languages. Early committers to the project include several MapR Technologies employees including Shiran, Jason Frantz, Ted Dunning, MC Srivas, Keys Botzum, and Gera Shegalov. Also aboard are Ryan Rawson of Drawn to Scale and Chris Wensel CEO of Concurrent and author of the Cascading data workflow API.

Shiran said he expects more contributors to come aboard soon — including representatives from e-commerce and web 2.0 companies as well as MapR competitors. “People see what Google does with Dremel and they want to do it as well,” he said.

Backers say Drill will work with Hive, Pig and Cascading — systems that compile higher-level language requests into MapReduce jobs, by making them faster as well. “Wensel who wrote Cascading is on and he’s really excited about Drill,” Shiran said. There’s more on the project here.

Making Hadoop data queries faster is a big theme these days. In some cases users want to analyze massive amounts of streaming data, a task suited for products like Storm  or Nodeable’s StreamReduce (see disclosure). But for companies with existing data in Hbase or Hadoop who want to query it really fast, something like Drill is the answer, Shiran said.

Disclosure: Nodeable is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.

Feature photo courtesy of Shutterstock user ARTSILENSEcom

loading

Comments have been disabled for this post