Apache Hudi 1.0 Released |
Written by Kay Ewbank | |||
Tuesday, 21 January 2025 | |||
Apache has released Hudi 1.0, described as a landmark achievement that defines what the next generation of data lakehouses should achieve. Hudi pioneered transactional data lakes in 2017. The open source data lake technology for stream processing runs on top of Apache Hadoop and is supported as part of Amazon EMR by Amazon Web Services. The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019. Apache Hudi can be used used to manage petabyte-scale data lakes. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing. Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout. This 1.0 release adds more software capabilities that are directly comparable with DBMS. In previous versions, Hudi's indexing mechanisms delivered fast update performance, and the developers wanted to generalize such features across writers and queries. They also wanted to introduce new capabilities like fast metastores for query planning, support for unstructured/multimodal data and caching mechanisms that can be deeply integrated into open-source query engines. These features have been added to Hudi's storage engine layer. The new version also provides more of a database-like experience. Hudi was originally designed as a software library that can be embedded into different query and processing engines for reading, writing and managing tables. The new version adds a way to easily install and explore all of Hudi's functionality. The team plans to deliver the benefits of the storage engine to other table formats via interop standards defined in projects like Apache XTable. This release also introduces Non-Blocking Concurrency Control (NBCC). This is a general-purpose concurrency model aimed at stream processing use or high-contention/frequent writing scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict resolution. Hudi 1 also introduces new indices that are designed to improve query performance through partition pruning and further data skipping. A new secondary index allows users to create indexes on columns that are not part of record key columns in Hudi tables. It can be used to speed up queries with predicates on columns other than record key columns. There's also a new partition stats index that aggregates statistics at the partition level for the columns for which it is enabled. This helps in efficient partition pruning even for non-partition fields. Finally, an expression index enables efficient queries on columns derived from expressions. It can collect stats on columns derived from expressions without materializing them, and can be used to speed up queries with filters containing such expressions. Hudi 1 is available now. More InformationRelated ArticlesApache Hudi Achieves Top Level Status
To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info |