Apache Hudi Achieves Top Level Status
Written by Kay Ewbank   
Monday, 29 June 2020

Apache Hudi has been adopted as a top-level project. The open source data lake technology for stream processing on top of Apache Hadoop is already being used at organizations including Alibaba, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services.

The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.


Apache Hudi can be used used to manage petabyte-scale data lakes.  Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.

Hudi supports three types of queries - snapshot, incremental and read optimized. Hudi snapshot queries give a view of real-time data using a combination of columnar and row-based storage such as Parquet and Avro. It's incremental queries provide a change stream with records inserted or updated after a point in time, while the read optimized queries are essentially snapshot queries offering faster performance on purely columnar storage such as Parquet.

According to Uber, Hudi is conceptually divided into three main components: the raw data that needs to be stored, the data indexes that are used to provide upsert capability, and the metadata used to manage the dataset. Hudi maintains a timeline of all actions performed on the table at different points in time, referred to as instants in Hudi. This means users can get an instantaneous views of the table, while also efficiently supporting retrieval of data in the order of arrival. Hudi guarantees that the actions performed on the timeline are atomic and consistent based on the time at which the change was made in the database. With this information, Hudi provides different views of the same Hudi table, including a read-optimized view for fast columnar performance, a real-time view for fast data ingestion, and an incremental view to read Hudi tables as a stream of changelogs.



More Information

Hudi Website

Related Articles

SQL Server 2019 Includes Hadoop And Spark

Hortonworks Plans To Take Hadoop Cloud Native

Hadoop 3 Adds HDFS Erasure Coding

Hadoop 2.9 Adds Resource Estimator

Hadoop Adds In-Memory Caching

Hadoop SQL Query Engine Launched

Hadoop 2 Introduces YARN 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.


Too Good To Miss: E.coli Could Be Your Next Raspberry Pi

Some of our articles deserve a second outing. Here we have one such from last October to add to our occasional Too Good to Miss series. It's all about E.coli and programming. No, it's not a story abou [ ... ]

Mozilla VPN Goes Live

Mozilla has finally brought to market its Virtual Private Network Service, formerly branded as the Firefox Private Network. The change of name to Mozilla VPN is to attract a larger audience than just  [ ... ]

More News





or email your comment to: comments@i-programmer.info