Apache Spark 2.0 Technical Preview
Written by Kay Ewbank   
Tuesday, 07 June 2016

A new version of Apache Spark is now available in technical preview.

Apache Spark is an open source data processing engine, and the final Spark 2.0 release is a few weeks away, according to the developers.

One feature added in the new version is standard SQL Support. There's a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. This will make it much easier to port apps that use SQL to Spark.

sparklogo

The APIs in Spark 2.0 have been streamlined to make Spark easier to use. DataFrames and Datasets have been unified for Scala and Java. The unification isn't applicable for Python and R as compile-time type safety isn't a feature for those languages. Where the unification does apply, DataFrame is now a type alias for Dataset of Row.

The SparkSession API has been reworked with a new entry point that replaces SQLContext and HiveContext, though the old SQLContext and HiveContext have been kept for backward compatibility. The Accumulator API has also been redesigned with a simpler type hierarchy and support specialization for primitive types.

Machine learning support has been improved with a spark.ml package, with its “pipeline” APIs, being put forward as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API. Pipleline persistence has been added, so users can save and load machine learning pipelines and models across all programming languages supported by Spark. MLlib is Spark's scalable machine learning library. It fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. The sample code below shows Python code for prediction with logistic regression using MLib.

sparkml

 

R support is another area to be improved, with support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.

Another improvement is a Structured Streaming APIs that takes a novel way to approach streaming. This is an initial version of the Structured Streaming API that is an extension to the DataFrame/Dataset API. Key features will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.

The preview is available either from Apache Spark or on Databricks, where the Databricks team has been contributing to Spark over recent months.

sparklogo

More Information

Databricks Blog Post

Apache Spark

Related Articles

Spark Announcements

Apache Releases Spark 1.6

Spark 1.4 Released

MOOC On Apache Spark 

Learning Spark (book review) 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter,subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

 

Banner


Remembering Thomas Kurtz, Co-creator of BASIC
15/11/2024

Thomas Eugene Kurtz, the co-founder of the BASIC programming language, has died at the age of 96. BASIC, which was developed for the purpose of education, popularized computer programming making it ac [ ... ]



Highlights Of The Europe 2024 PostgreSQL Conference
22/11/2024

This year's premium conference for PostgreSQL took place in Athens, Greece between October 22-25. The nice Athenian weather and cultural aspect aside, the conference was a big hit too.


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 07 June 2016 )