Databricks Delta Lake Now Open Source
Written by Kay Ewbank   
Friday, 26 April 2019

At the Spark +AI Summit taking place this week in San Francisco, Databricks announced that it has open sourced its Delta Lake storage layer, which handles the stage where data is brought into an organization's data lake.

Databricks was created as a company by the original developers of Apache Spark and specializes in commercial technologies that make use of Spark. Until now, Delta Lake has been part of Databricks Delta, the proprietary stack from Databricks. It is a unified analytics engine and associated table format built on top of Apache Spark.

 

databricks

Delta Lake is a storage layer that stores data in Apache Parquet format. It is designed for use in data lakes that are built on HDFS and cloud storage.

Data lakes are used to store both structured and unstructured data, but the data can be unreliable because of problems including schema mismatches and no enforcing of consistency.  Data can be missing from some columns, and inconsistencies can creep in when schemas are changed in some parts of a pipeline but not in others.

Databricks Delta keeps closer control over the schemas in different parts of the data lake, validating that schema changes are replicated throughout the pipeline. Missing columns of data are correctly set to null, and data definition language (DDL) is used to add new columns and update schemas.

These features and the use of optimistic concurrency control between writes, and snapshot isolation for consistent reads during writes,  mean that Delta Lake offers ACID transaction support.  Delta Lake also uses snapshots to give data versioning for rollbacks and reproducing reports. The tool has options such as schema enforcement, and all data in Delta Lake is stored in Apache Parquet format, a favorite format for storing and working with large datasets.
 

Another advantage Delta Lake offers is that you can carry out local development and debugging to develop data pipelines on your desktop or laptop machine. Delta Lake uses the Spark engine for the metadata of the data lake, and is compatible with the Apache Spark APIs. 

Databricks says Delta is 10 -100 times faster than Apache Spark on Parquet.  It has been designed for both batch and stream processing, and can be used for pipeline development, data management, and query serving.

Now that Delta Lake is open source, Databricks is open to contributions from outside the company. 
 
databricks 
 
 

More Information

Databricks Website

Delta Website

Related Articles

Databricks Delta Adds Faster Parquet Import

Databricks Runtime for Machine Learning

Databricks Adds ML Model Export

Spark Gets NLP Library

Apache Spark With Structured Streaming

Spark BI Gets Fine Grain Security

Spark 2.0 Released

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Newbies, Lurkers and Experts on Stack Overflow
16/09/2022

Stack Overflow tends to be more "friendly" to newcomers than  more experienced users. Almost two-thirds of the Stack Overflow community are "Silent Observers" and the majority of answers are prov [ ... ]



Code On Coin Cracked By 14 Year Old!
04/09/2022

A 14-year-old boy was the first to crack four levels of encryption in code imprinted on a commemorative coin released by the Australian Signals Directorate, the country's foreign intelligenc [ ... ]


More News

pythondata

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Friday, 26 April 2019 )