Apache Arrow 2 Improves C++ and Rust Support

Written by Kay Ewbank

Thursday, 29 October 2020

There's a new release of Apache Arrow with improvements to the support for C++ and Rust, particularly in support for Parquet.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

The major components of the project include the columnar In-memory format, and an IPC format: that provides a serialization of the Arrow format and associated metadata, for communication between processes and heterogeneous environments. There's also the Arrow Flight RPC protocol that provides a building block for remote services exchanging Arrow data with applications.

Improvements to the C++ support in this release start with Parquet handling. Nested data in Parquet is handled better, and you can read and write arbitrarily nested data, including extension types with a nested storage type. This had the side-effect of fixing several bugs in writing nested data and FixedSizeList. Parquet datasets can also now be written with partitions, including control over accumulation of statistics for individual columns. Other C++ improvements include the addition of compute kernels for standard deviation, variance, and mode, and improvements to S3 support, including automatic region detection.

C# support has also been improved, with the addition of full support for Struct types, and the addition of synchronous write APIs for ArrowStreamWriter and ArrowFileWriter.

R Support has been enhanced with the ability to write multi-file datasets with partitioning to Parquet or Feather. You can also now read and write directly to AWS S3.

The developers say that while the Java and C/C++ (used by Python and R) Arrow implementations will probably remain the most feature-rich, the Rust implementation is closing the feature gap quickly, and the 2.0 release includes a lot of improvements to the Rust implementation.

The Rust Arrow compute kernels have been improved with new kernels added for string operations, including substring, min, max, concat, and length. Many kernels have been improved to support dictionary-encoded arrays, and optimized for arrays without nulls, making them significantly faster in that case. The work on a Rust Parquet writer for Arrow data didn't make it into this release, and is now planned for the 3.0.0 release.

The Rust component has also seen work on DataFusion, the in-memory query engine with DataFrame and SQL APIs, built on top of base Arrow support. DataFusion now has a richer DataFrame API, and more scalable queries because of a change to use async/await with the tokio threaded runtime rather than launching dedicated threads.

DataFusion also has improved scalar function both in the SQL and the DataFrame API, including string length, COUNT(DISTINCT column, IsNotNullMin/Max for strings, Array of columns and string concatenation.

More Information

Apache Arrow Website

Arrow On GitHub

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

Databricks Delta Adds Faster Parquet Import

Apache Kudu 1.9 Adds Location Awareness

Apache Kudu Improves Web Interface

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Swift 6.2 Adds WebAssembly Support
17/06/2025

Swift 6.2 has been released with features to enhance performance, concurrency, and interoperability with other languages like C++, Java, and JavaScript. It also adds support for WebAssembly.

+ Full Story

CISA and NSA - Use Rust Or Perhaps Java
02/07/2025

The CISA and the NSA are urging us to adopt memory-safe languages (MSLs) for the sake of cybersecurity. You probably think they mean Rust but things aren't as clear cut as you might expect.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 29 October 2020 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments