Apache Samza Adds SQL

Written by Kay Ewbank

Monday, 08 January 2018

There's a new version of Apache Samza that adds Samza SQL and both Azure EventHubs and AWS Kinesis. Samza is an open source framework originally developed alongside Kafka by LinkedIn before being made open source and taken over by the Apache Software Foundation.

The idea behind Samza is to provide a simple way to develop and run stream processing jobs that can be used by non-programmers as well as developers. Samza uses Apache Kafka for messaging, and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. It has support for local state via a RocksDB store that allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.

Samza has a simple callback-based “process message” API comparable to MapReduce. It supports managed state via snapshotting and restoration of a stream processor’s state. When a processor is restarted, Samza restores its state to a consistent snapshot. It also provides fault tolerance by working with YARN to transparently migrate tasks to another machine if the active machine in the cluster fails. Kafka is used to process messages in the order they were written to a partition, so that no messages are ever lost.

Samza is partitioned and distributed at every level. It has a pluggable API that means it can be run with other messaging systems and in other execution environments, though it is designed to work out of the box with Kafka and YARN. Samza is written in Scala and Java.

The new release of Samza adds three main new features. The first is Samza SQL. This is a high level API that is designed to expand the target audience for stream processing to make it accessible to anyone who can write SQL. The developers say Samza SQL can be used to obtain quick real time insights, and to quickly create stream processing applications.

Samza SQL is based on Apache Calcite, an open source SQL language framework used by several Apache projects. The way Samza SQL works is that you write a normal SQL query, and the API deals with creating, configuring, and managing the pipeline.

The second improvement is an Azure EventHubs producer, consumer and checkpoint provider. An AWS Kinesis consumer has also been added. Other improvements include durable state in high-level API, Zookeeper-based deployment stability, and multi-stage batch processing.

apache

More Information

Samza Site

Apache Bigtop Adds OpenJDK 8 Support

Apache Fluo Improves Spark Integration

Kafka 1 Becomes More Tolerant

Comparing Kafka To RabbitMQ

Apache Kafka Adds New Streams API

GoKa Stream Processing For Kafka

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

FSF Turns 40 and Auctions Off Original GNU
21/02/2025

The Free Software Foundation (FSF) turns 40 this year and, as part of the celebrations, is holding a virtual memorabilia auction that will include the original drawing of the iconic GNU head.

+ Full Story

MongoDB Acquires Voyage AI To Add Embedding Models
10/03/2025

MongoDB is to acquire Voyage AI with the intention of using Voyage AI's facilities within MongoDB so developers can build apps that use AI more reliably. Voyage AI produces embedding and reranking mod [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

More Information

Related Articles

Comments