Big Data Analytics with Spark |
Page 2 of 2
Author: Mohammed Guller Chapter 6 Spark Streaming Spark streaming involves processing stream data in near real time (e.g. to identify card fraud). The chapter opens with a high-level architectural view, Spark streaming splits real time data into batches stored as RDDs, for processing via Spark core. Data sources and destinations are briefly discussed. The chapter continues with a deeper look into the Spark streaming API. StreamingContext is the main entry point into the streaming library, code is provided to create a StreamingContext, and to start, checkpoint, and stop a stream computation. An outline Spark streaming application is created, and populated as the section discusses: creating a DStream (sequence of RDDs split over time), processing the DStream (e.g. map, filter, reduce, transform), output operations (e.g. saveAsTextFiles, print), and window operations (applies to overlapping RDDs). The chapter ends with a complete Spark steaming application that shows trending Twitter hashtags. Each section of the annotated code is explained. This chapter provides a very useful introduction to Spark streaming. I especially liked the step-by-step explanation of the code – clear and concise. Chapter 7 Spark SQL SQL is very popular for data analysis, since there are many more people with SQL skills than MapReduce skills. Spark SQL runs on top of Spark, and is able to process structured data in various formats (e.g. database, Avro, CSV), and integrates seamlessly with the other Spark libraries. The chapter continues with a look at some of the performance advantages of Spark SQL, including: less disk I/O, partitioning, in-memory caching, predicate pushdown, and query optimizations. Next, the major classes and methods of the Spark SQL library are discussed (e.g. SQLContext, DataFrame), in each case simple example code is provided. The section continues with a look at DataFrames, which is a collection of rows and columns with a schema (i.e. columns are typed and named). Various ways of creating, processing, and saving DataFrames are discussed with examples. The chapter ends with an example of using Spark SQL for interactive analysis, this is followed with a similar example using SQL/HiveQL. In each case, code is provided and discussed. This chapter provides a helpful overview of Spark SQL functionality, with example code discussed in a stepwise fashion. Chapter 8 Machine Learning with Spark Machine learning involves training a system to learn from data, and then performing processing based on that learning. Instead of programming, it learns from the data, and it able to make predictions about new data. Examples include Netflix recommendation system, and driverless cars. The chapter first provides an overview of some general terms used in machine learning, including: features (independent variables), labels (dependent variables), models (used for predictions), training data, and test data. This is followed with a short discussion of some of the types of application that use machine language, including: classification (e.g. spam), regression (e.g. forecasting), and recommendation (e.g. Amazon books). Next, the different types of machine learning algorithms are briefly discusses, namely: supervised (train using known results), unsupervised (train using unknown results), and classification (predict category of data). Spark has 2 machine learning libraries, the more mature MLlib and Spark ML. First, MLlib is discussed, it contains functions for both machine learning and statistical analysis. Its various data types are briefly discussed (Vector, LabeledPoint, and Rating). Some algorithms and models are then explained, together with example code, namely: regression, classification, clustering, and recommendation. The section ends with a short example MLlib application (using the traditional ‘iris’ dataset), with the code discussed in a line-by-line manner. The chapter continues with a look at the other machine learning library, Spark ML. It’s used to create machine learning workflows/pipelines, this involves linking together the various steps in a machine learning task. Various abstractions used by Spark ML are briefly discussed, including: ML Dataset, Transformer, Estimator, Pipeline, and CrossValidator. The section ends with a short example Spark ML application (using a ‘sentiment’ dataset), again the code is discussing in a line-by-line manner. This is a long chapter and, like others in this book, explains concepts from the beginning, giving context, and providing useful example code to support the assertions made. It serves as a very useful introduction to machine learning. Chapter 9 Graph Processing with Spark A graph is a data structure that has vertices (nodes) and edges (connection between vertices). Graphs are a natural structure for certain types of data (e.g. links between people). The chapter opens by explaining some graph terms, including: undirected, directed, and property graphs. Next, Spark’s graphing library, GraphX, is introduced, being a distributed analytics framework. As with the other Spark application libraries, this library integrates seamlessly with both Spark core and the other Spark libraries. The chapter continues with a look at the data types for graphs (e.g. VertexRDD, Edge), and operators (e.g. transformations, join, aggregation) acting on the graph data. It should be noted this API is still under development. An annotated example is provided on how to create a graph, together with code on how to extract useful information from a graph. This chapter provides a useful introduction to graphs, and processing graphs using Spark’s GraphX library. Helpful annotated code is provided throughout. Chapter 10 Cluster Management A cluster manager manages the resources (e.g. CPU, storage) on a cluster of servers. Spark has its own standalone cluster manager, but it can also integrate with other cluster managers, specifically Apache Mesos, and Hadoop YARN. The chapter first examines the standalone cluster manager from a high-level architectural view, discussing the master (coordinates resources) and various worker (do the work) components. Next, the setting up for a standalone cluster is discussed, together with annotated scripts. The section ends with a look at running a Spark application on the standalone cluster, in both client (in-process) and cluster mode (runs on worker nodes). The chapter ends by briefly looking at using both Apache Mesos, and YARN as cluster managers. Both are examined from a high-level architecture view, setting up the cluster, and running a Spark application on the cluster. The chapter is useful for integrating Spark applications into the wider context of the various cluster managers that Spark can use. Chapter 11 Monitoring Monitoring is vital for ensuring an application is working correctly, and for helping troubleshoot problems. Spark has various inbuilt monitoring features (e.g. a web-based monitoring application), and also integrates with various third party monitoring tools (e.g. Ganglia). The chapter first takes a look at monitoring on a standalone cluster, using the inbuilt web-based monitoring tool. The main web pages are described for monitoring both the master and worker nodes. Next, monitoring a Spark application is discussed, from different aspects, including: jobs launched, job stages, task stages, RDD storage, and environment. In each case, the relevant web page is discussed, highlighting salient features (e.g. kill a job). The chapter ends with a look at web pages that are useful for monitoring Spark steaming, and Spark SQL applications. This chapter will undoubtedly prove useful for system health checks and troubleshooting. Conclusion This book aims to provide a “...concise and easy-to-understand tutorial for big data and Spark”, and clearly succeeds. The book is exceptionally well written. Helpful explanations, diagrams, practical step-by-step walkthroughs, annotated code, inter-chapter links, and website links abound throughout. The book is aimed at developers that are new to Spark, and explains concepts from the beginning. If you work through the book you should become competent in the use of Spark, there is much more to learn of course, but this book gives a solid foundation in both core Spark and its major specialized libraries: Streaming, Machine Learning, SQL, and Graphing. The book is based on workshops given by the author, and clearly the feedback from these has been useful in creating this book, since it seems to have answered all the questions I had. This book provides everything you need to know to get started with Spark, explained in an easy-to-follow manner. If you want to learn Spark, buy this book. Highly recommended. Related ArticlesReading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark Mastering Apache Spark - review by Ian Stirk Learning Spark - review by Kay Ewbank
To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.
|
||||||
Last Updated ( Tuesday, 16 February 2016 ) |