DataBricks Open Sources All Of Delta Lake
Written by Kay Ewbank   
Thursday, 07 July 2022

Databricks has now made all of Delta Lake open source, including all the APIs. The storage layer of the product was made open source in 2019. Delta Lake can be used to build data lakehouses, which enable data warehousing and machine learning directly on the data lake.

Delta Lake handles the stage where data is brought into an organization's data lake. It stores data in Apache Parquet format, and is designed for use in data lakes that are built on HDFS and cloud storage.

Databricks was created as a company by the original developers of Apache Spark and specializes in commercial technologies that make use of Spark. Delta Lake is a unified analytics engine and associated table format built on top of Apache Spark, and until it was made open source was only available as part of Databricks Delta, the company's proprietary stack.

databricks

Since the storage layer wasy made open source, the project has attracted over 190 contributors across more than 70 organizations, nearly two-thirds of whom are from outside Databricks, including contributors from companies including Apple, IBM, Microsoft, Disney, Amazon, and eBay.

Delta Lake comes with standalone readers/writers that lets any Python, Ruby, or Rust client write data directly to Delta Lake without requiring any big data engine such as Apache Spark, along with open-source connectors, including Apache Flink, Presto, and Trino. The open source announcement opens up capabilities that until now were only available in Databricks.

Delta Lake 2.0, the latest release of Delta Lake, has improvements including support for ZOrder, Change Data Feed, Dynamic Partition Overwrites, and Dropped Columns. Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is used by Delta Lake in data-skipping algorithms, and the developers say it dramatically reduces the amount of data that Delta Lake on Apache Spark needs to read.

Delta Lake 2 is available now.

 
 
databricks 
 
 

More Information

Databricks Website

Delta Website

Related Articles

Databricks Delta Lake Now Open Source

Databricks Delta Adds Faster Parquet Import

Databricks Runtime for Machine Learning

Databricks Adds ML Model Export

Spark Gets NLP Library

Apache Spark With Structured Streaming

Spark BI Gets Fine Grain Security

Spark 2.0 Released

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


The Feds Want Us To Move On From C/C++
13/11/2024

The clamour for safe programming languages seems to be growing and becoming official. We have known for a while that C and C++ are dangerous languages so why has it become such an issue now and is it  [ ... ]



Google Opensources Privacy Library
08/11/2024

Google is making a new differential privacy library available as open source. PipelineDP4J is a Java-based library that can be used to analyse data sets while preserving privacy.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info