Apache Arrow 6 Improves Support For R and Rust
Monday, 22 November 2021

Apache Arrow 6 has been released with improvements to support for R and Rust as well as Arrow Flight. There's also new support for DataFusion.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

arrow

The improvements to the new release start with the addition of bindings for Flight in GLib and Ruby. The team says that while SQL support for Flight hasn't made it into this release, work is ongoing. Arrow Flight SQL defines a protocol for clients to communicate with SQL databases using Arrow Flight.

In Arrow's compute layer, a basic in-memory query engine has been implemented and is accessible from the R bindings. The query engine supports operations including filter, project, sort, equality joins, and various aggregations. A wide range of functions have also been added in this version, and type support has been improved for most of the compute functions.

The support for R has been enhanced with a number of major new features in this version, some of which the team has been building up to for several years. In practical terms, there's more dplyr support, including the ability to carry out grouped aggregation. You can now summarise() on Arrow data, both with or without group_by(). These are supported both with in-memory Arrow tables as well as across partitioned datasets. Most common aggregation functions are supported. In addition to aggregation, Arrow now also supports all of dplyr’s mutating joins (inner, left, right, and full) and filtering joins (semi and anti).

The R team has also added support for DuckDB as a way to query Arrow Datasets. This means you can use duckdb’s dbplyr methods, as well as its SQL interface, to aggregate data.

Alongside the R improvements, there's new support for DataFusion. This is an embedded query engine that uses Rust and Apache Arrow to provide a system that the developers say is high performance, easy to connect, easy to embed, and high quality. This release includes a runtime operator metrics collection framework, and object store abstraction for unified access to local or remote storage. The framework includes Hive-style table partitioning support for Parquet, CSV, Avro and Json files, and DataFrame API support for: except, intersect, show, limit and window functions. It also has extensive SQL support, and now passes TPC-H queries 8, 13 and 21.

Apache Arrow 6 is available for download.

arrow 

More Information

Apache Arrow Website

Arrow On GitHub

Related Articles

Apache Arrow 5 Improves Asynchronous Scanner

Apache Arrow 4 Adds New C++ Compute Functions

Apache Arrow Improves C++ Support

Apache Arrow 2 Improves C++ and Rust Support

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Online Code Club For Kids
16/11/2021

Today the Raspberry Pi Foundation has launched Code Club  World, a free online platform where young people aged 9 to 13 can "learn to make stuff with code".



AWS BugBust Challenge Underway In World Guinness Record Attempt
30/11/2021

As part of its annual re:Invent conference, Amazon Web Services (AWS) is running a BugBust Challenge.  Java and Python developers of all skill levels, can compete to fix as many software bugs a [ ... ]


More News

square

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 22 November 2021 )