Apache Beam Moves To Java 8
Written by Kay Ewbank   
Wednesday, 28 February 2018

Apache Beam, the open source programming SDK for defining batch and streaming data-parallel processing pipelines, is now available in a new version that moves to Java 8 and Spark 2.x.

 

Apache Beam has an number of Beam SDKs that you can use to build a program that defines a pipeline. This is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Beam began life at Google, and is used as the Google Cloud Dataflow (GCD) service. Beam uses the same API as GCD.

The latest version now uses Java 8 as its supported Java version, and the code and examples in Beam have been reworked to take advantages of the improvements in Java 8 such as lambdas, streams, and improved type inference.

Beam's Spark runner has also been updated to the Spark 2.x development line to improve performance and for future compatibility with the Structured Streaming APIs. The Beam Pipeline Runners translate the data processing pipeline you define with your Beam program into the API compatible with the distributed processing back-end of your choice.

The support for AWS S3 has also been improved. In previous versions, AWS S3 was supported via the HadoopFileSystem, but the new release adds native support for S3, so improving performance.

The final improvement of note is the addition of the Splittable DoFn API for the Python SDK,  and Splittable DoFn support for the Python streaming DirectRunner.

Splittable DoFn Example

 

DoFn is a Beam SDK class that defines a distributed processing function. The DoFn object contains the processing logic that gets applied to the elements in the input collection. It processes one element at a time. Splittable DoFn is a generalization of DoFn that can be used to develop more powerful IO connectors than before, with shorter, simpler, more reusable code.

beamicon

More Information

Beam Website

Related Articles

Apache Beam Moves To Top Level

Apache Spark 2.0 Released

Flink Gets Event-time Streaming

Google Announces Big Data the Cloud Way

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Meta Releases OpenSource Podcast Generating Tool
28/11/2024

Meta has released an open source project that can be used to automatically convert a PDF file into a podcast. Meta says Notebook Llama can be considered an open-source version of Google's NotebookLM.

 [ ... ]



Raspberry Pi CM5 - Expensive And Undocumented
27/11/2024

So the unexpected has happened - the Compute Module 5 has been launched. But it simply emphasises some problems with adopting the Pi as an IoT device.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 28 February 2018 )