Perform Data Queries Faster With Drill
Written by Kay Ewbank   
Friday, 24 August 2012

Drill a new distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel, has been accepted into the Apache Incubator.

The main attraction of Dremel, the query system used for for Google’s BiqQuery analytics is the ability to store and search trillion-row datasets without the need to use Hadoop.

While Hadoop is very efficient when using the MapReduce framework to perform batch analysis, the batch nature of the work makes Hadoop unsuitable for analysing transactional data.

Drill, by comparison, can perform data queries at a much faster rate. The team behind Drill say that it is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data, and has a design goal of being able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

 

 

According to its Apache Incubator proposal, which is being championed by Ted Dunning, like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers.

It points out that in many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data.

The Drill architecture consists of four key components or layers:

The query languages layer is responsible for parsing the user’s query and constructing an execution plan.  The initial goal is to support the SQL-like language used by Dremel and Google BigQuery, DrQL. Drill is also designed to support other languages and programming models, such as the Mongo Query Language, Cascading and Plume.

Drill has a low-latency distributed execution engine that is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines such as Dremel, Dryad, Hyracks, CIEL and Stratosphere, alongside columnar storage.

The nested data formats layer is responsible for supporting various data formats, with the initial goal of supporting the column-based format used by Dremel.

A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use.

The scalable data sources layer is responsible for supporting various data sources, starting with Hadoop.

 

More Information

Drill Proposal

Related Articles

Real-time Hadoop Analysis

New MinuteSort Record Set by Microsoft Research

SQL Server 2012 and Second Preview for Hadoop for Azure

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

 

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.

Banner


Pico 2W Announced But There Is A Surprise!
25/11/2024

Raspberry Pi released the Pico 2 a few months ago and we have been waiting for the Pico 2W since then. But Pimoroni beat them to the draw with the Pico Plus 2W based on the RM2 radio module and hinted [ ... ]



pg_parquet - Postgres To Parquet Interoperability
28/11/2024

pg_parquet is a new extension by Crunchy Data that allows a PostgreSQL instance to work with Parquet files. With pg_duckdb, pg_analytics and pg_mooncake all of which can access Parquet files, is  [ ... ]


More News

 

Last Updated ( Friday, 24 August 2012 )