Apache Hudi 1.0 Released

Written by Kay Ewbank

Tuesday, 21 January 2025

Apache has released Hudi 1.0, described as a landmark achievement that defines what the next generation of data lakehouses should achieve. Hudi pioneered transactional data lakes in 2017.

The open source data lake technology for stream processing runs on top of Apache Hadoop and is supported as part of Amazon EMR by Amazon Web Services.

The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.

hudi

Apache Hudi can be used used to manage petabyte-scale data lakes. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.

This 1.0 release adds more software capabilities that are directly comparable with DBMS. In previous versions, Hudi's indexing mechanisms delivered fast update performance, and the developers wanted to generalize such features across writers and queries. They also wanted to introduce new capabilities like fast metastores for query planning, support for unstructured/multimodal data and caching mechanisms that can be deeply integrated into open-source query engines. These features have been added to Hudi's storage engine layer.

The new version also provides more of a database-like experience. Hudi was originally designed as a software library that can be embedded into different query and processing engines for reading, writing and managing tables. The new version adds a way to easily install and explore all of Hudi's functionality.

The team plans to deliver the benefits of the storage engine to other table formats via interop standards defined in projects like Apache XTable.

This release also introduces Non-Blocking Concurrency Control (NBCC). This is a general-purpose concurrency model aimed at stream processing use or high-contention/frequent writing scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict resolution.

Hudi 1 also introduces new indices that are designed to improve query performance through partition pruning and further data skipping. A new secondary index allows users to create indexes on columns that are not part of record key columns in Hudi tables. It can be used to speed up queries with predicates on columns other than record key columns. There's also a new partition stats index that aggregates statistics at the partition level for the columns for which it is enabled. This helps in efficient partition pruning even for non-partition fields. Finally, an expression index enables efficient queries on columns derived from expressions. It can collect stats on columns derived from expressions without materializing them, and can be used to speed up queries with filters containing such expressions.

Hudi 1 is available now.

hudi

More Information

Hudi Website

Apache Hudi Achieves Top Level Status

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

5 Ways AI is Changing Front-End Development
25/04/2025

For a few years now, front-end developers have been nibbling with AI to help them streamline repetitive tasks and boost productivity. However, AI is now evolving into more than just an assistance tool [ ... ]

+ Full Story

Kolosal AI-Run LLMs Locally On Your Workstation Or Edge Devices
17/04/2025

Kolosal is a new player in the LLM ecosystem, heralded as the lightweight alternative to LM Studio by requiring fewer system resources while offering similar functionality.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments