Apache Hudi 1.0 Released
Written by Kay Ewbank   
Tuesday, 21 January 2025

Apache has released Hudi 1.0, described as a landmark achievement that defines what the next generation of data lakehouses should achieve. Hudi pioneered transactional data lakes in 2017. 

The open source data lake technology for stream processing runs on top of Apache Hadoop and is supported as part of Amazon EMR by Amazon Web Services.

The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.

hudi

Apache Hudi can be used used to manage petabyte-scale data lakes.  Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.

This 1.0 release adds more software capabilities that are directly comparable with DBMS. In previous versions, Hudi's indexing mechanisms delivered fast update performance, and the developers wanted to generalize such features across writers and queries. They also wanted to introduce new capabilities like fast metastores for query planning, support for unstructured/multimodal data and caching mechanisms that can be deeply integrated into open-source query engines. These features have been added to Hudi's storage engine layer. 

The new version also provides more of a database-like experience. Hudi was originally designed as a software library that can be embedded into different query and processing engines for reading, writing and managing tables. The new version adds a way to easily install and explore all of Hudi's functionality. 

The team plans to deliver the benefits of the storage engine to other table formats via interop standards defined in projects like Apache XTable.

This release also introduces Non-Blocking Concurrency Control (NBCC). This is a general-purpose concurrency model aimed at stream processing use or high-contention/frequent writing scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict resolution.

Hudi 1 also introduces new indices that are designed to improve query performance through partition pruning and further data skipping. A new secondary index allows users to create indexes on columns that are not part of record key columns in Hudi tables. It can be used to speed up queries with predicates on columns other than record key columns. There's also a new partition stats index that aggregates statistics at the partition level for the columns for which it is enabled. This helps in efficient partition pruning even for non-partition fields. Finally, an expression index enables efficient queries on columns derived from expressions. It can collect stats on columns derived from expressions without materializing them, and can be used to speed up queries with filters containing such expressions.

Hudi 1 is available now.

hudi

More Information

Hudi Website

Related Articles

Apache Hudi Achieves Top Level Status

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Python Is TIOBE Index Language Of The Year 2024
06/01/2025

This news was widely anticipated and, as it's the sixth time Python has won this accolade, it might even be attracting some yawns. However, you would be wrong to view Python as boring. When you delve  [ ... ]



Meta's MultiModal, MultiLingual Translator
21/01/2025

Meta has taken us a long way towards creating a Babel Fish, a tool that helps individuals translate speech between any two languages. This is thanks to SEAMLESSM4T which is open-source for non-co [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info