Data Analytics With Spark Using Python

Author: Jeffrey Aven
Publisher: Addison Wesley
Pages: 320
ISBN: 978-0134846019
Print: 013484601X
Kindle: B07D3BP8C8
Audience: Python developers wanting to learn Spark
Rating: 4.4
Reviewer: Kay Ewbank

Spark is increasingly popular in the world of big data, and this book sets out how to work with it and its related technologies using Python.

The author is obviously a Python enthusiast, saying that Python experience is useful but not strictly necessary as Python is quite intuitive for anyone with any programming experience whatsoever. Whether you agree with this depends on how much you like Python! The examples are all in Python and largely use the PySpark Python API for Spark. Topics range from core Spark programming to Spark SQL, streaming, and machine learning. The broader ecosystem is also covered - Hadoop, Kafka, Cassandra and so on.

The book starts off with an introduction to big data, Hadoop and Spark, followed by chapters on deploying Spark, understanding the Spark Cluster architecture, and learning Spark programming basics. All of this is probably necessary, especially the chapter on Spark programming basics, which explains the concept of RDDs (Resilient Distributed Datasets), how to get data into them and work on the data once it's there.

Part Two of the book looks in more detail at each of the elements to programming Spark, starting with the Spark Core API. Topics include partitioning data, data sampling, and optimizing Spark. There's a good chapter on SQL and NoSQL programming with Spark that looks at Hive and using DataFrames for SQL, then goes on to look at using Spark with HBase, Cassandra,, and DynamoDB. Each of these gets a couple of pages, so you're not going to get really deep into how to use Spark with them. Typically, there's an introduction to the database, a listing and explanation on creating a table and inserting some data, scanning a table, updating a cell, and advice on what other packages to use with that combination. You then get exercises to see whether you can do the tasks covered so far.

A chapter on stream processing and messaging using Spark covers the use of DStreams, structured streaming, and using Spark with Kafka and Amazon Kinesis Streams. The book ends with a chapter on data science and machine learning using Spark that looks at the use of Spark and R as well as machine learning using Spark MLib, and using Jupyter notebooks and Apache Zeppelin notebooks .

Overall, this seems a good introduction to using Spark via Python. Each topic is introduced fairly briefly, so you see how to do the basics but don't learn more advanced techniques, but you should know enough to get started. I'd have preferred a bit more on each topic, but given the very wide ecosystem of Spark, that might have resulted in a massive book, and what is covered works well.

For recommendations of books on big data, see Reading Your Way Into Big Data. For recommendations of Python books see Books for Pythonistas and Python Books For Beginners in our Programmer's Bookshelf section.

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Classic Computer Science Problems in Python

Author: David Kopec
Publisher: Manning
Date: March 2019
Pages: 224
ISBN: 978-1617295980
Print: 1617295981
Kindle: ‎ ‎ B09782BT4Q
Level: Intermediate
Audience: Python developers
Category: Python
Rating: 4
Reviewer: Mike James
Classic algorithms in Python - the world's favourite language.

+ Full Review

Facilitating Professional Scrum Teams (Pearson)

Author: Patricia Kong, Glaudia Califano and David Spinks
Publisher: Pearson
Pages: 320
ISBN: 978-0138196141
Print: 0138196141
Kindle: B0CLKZC5JM
Audience: Scrum managers
Rating: 5
Reviewer: Kay Ewbank

This book sets out to "Improvement, Effectiveness and Outcomes". How does it fa [ ... ]

+ Full Review

More Reviews

Last Updated ( Tuesday, 23 October 2018 )

Recent Articles

Recent Book Reviews

Popular Articles