Apache Beam Moves To Java 8
Written by Kay Ewbank   
Wednesday, 28 February 2018

Apache Beam, the open source programming SDK for defining batch and streaming data-parallel processing pipelines, is now available in a new version that moves to Java 8 and Spark 2.x.

 

Apache Beam has an number of Beam SDKs that you can use to build a program that defines a pipeline. This is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Beam began life at Google, and is used as the Google Cloud Dataflow (GCD) service. Beam uses the same API as GCD.

The latest version now uses Java 8 as its supported Java version, and the code and examples in Beam have been reworked to take advantages of the improvements in Java 8 such as lambdas, streams, and improved type inference.

Beam's Spark runner has also been updated to the Spark 2.x development line to improve performance and for future compatibility with the Structured Streaming APIs. The Beam Pipeline Runners translate the data processing pipeline you define with your Beam program into the API compatible with the distributed processing back-end of your choice.

The support for AWS S3 has also been improved. In previous versions, AWS S3 was supported via the HadoopFileSystem, but the new release adds native support for S3, so improving performance.

The final improvement of note is the addition of the Splittable DoFn API for the Python SDK,  and Splittable DoFn support for the Python streaming DirectRunner.

Splittable DoFn Example

 

DoFn is a Beam SDK class that defines a distributed processing function. The DoFn object contains the processing logic that gets applied to the elements in the input collection. It processes one element at a time. Splittable DoFn is a generalization of DoFn that can be used to develop more powerful IO connectors than before, with shorter, simpler, more reusable code.

beamicon

More Information

Beam Website

Related Articles

Apache Beam Moves To Top Level

Apache Spark 2.0 Released

Flink Gets Event-time Streaming

Google Announces Big Data the Cloud Way

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Kotlin Ktor Improves Client-Server Support
04/11/2024

Kotlin Ktor 3 is now available with better performance and improvements including support for server-sent events and CSRF (Cross-Site Request Forgery) protection.



Google Intensive AI Course - Free On Kaggle
05/11/2024

Google is offering a 5-Day Gen AI Intensive Course designed to equip data scientists with the knowledge and skills to tackle generative AI projects with confidence. It runs on the Kaggle platform from [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 28 February 2018 )