Apache Druid Improves Compaction

Written by Kay Ewbank

Tuesday, 04 February 2020

Apache Druid, a high performance real-time analytics database, designed for workflows where fast queries and ingest really matter, has been updated with improvements including better compaction and batch ingestion.

Currently an incubator project at Apache, Druid is:

designed to excel at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency, and provides an open source alternative to data warehouses.

It was originally developed at a startup called Metamarkets to power an all-in-one analytics solution for programmatic digital advertising. Ad-tech is an area that generates data to the tune of hundreds of billions or even trillions of new records per day, and Druid was developed to cope with this level of data. It has since been extended for situations that aren’t adequately addressed by classic analytics stacks. Application areas that Druid is used for include network flow analytics, product analytics, and user behavior. It is used by major companies including NTT, WalkMe, Pinterest, Netflix, Airbnb, Lyft, and Walmart.

druid

Druid can natively stream data from message buses such as Kafka and Amazon Kinesis, and batch load files from data lakes such as HDFS and Amazon S3.Along with support for column-oriented storage, Druid also incorporates designs from search systems and timeseries databases.

The developers say Druid is better than traditional data warehouses because it has much lower latency for OLAP-style queries and for data ingest (both streaming and batch). Its support for time-based partitioning means time-based queries can be run efficiently, and it has fast search and filter for fast slice and dice. This makes it good for use with real-time analytics and where the end-user (technical or not) wants to apply numerous queries in rapid succession to explore or better understand data trends.

The latest release includes an update to the native batch ingestion system. The internal framework now supports non-text binary formats, with initial support for ORC and Parquet. Single dimension range partitioning for parallel native batch ingestion has also been added, meaning it is now possible to carry out range-based partitioning on a single dimension.

Compaction improvements start with support for parallel index task split hints, meaning operators can provide hints to control the amount of data that each first phase subtask reads. Parallel and stateful auto-compaction support has been added, and the Druid broker can now opportunistically merge query results in parallel using multiple threads.

druid

More Information

Druid Home Page

Kafka 2 Adds Support For ACLs

Kafka Graphs Framework Extends Kafka Streams

Amazon Introduces Kinesis Analytics

Cloudera Extends Apache HBase To Use Amazon S3

Hadoop 3 Adds HDFS Erasure Coding

Amazon Redshift Updates

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

pg_disatch - Run SQL Queries Asynchronously On PostgreSQL
24/06/2025

pg_disatch is meant to be a TLE-compliant alternative to pg_later but built on top of pg_cron. What makes it different?

+ Full Story

Biome 2 Beta Released
23/06/2025

The beta of Biome 2 has been released with improvements including support for custom lint rules using GritQL; support for domains in link rules; and multi-file analysis.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 04 February 2020 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments