Mastering Apache Spark |
Page 2 of 2
Author: Mike Frampton Chapter 5 Apache Spark GraphX This chapter opens with an overview of graph terminology, a graph is a data structure, with nodes (vertices), and connections (edges). A graph has many uses, including fraud detection and social modelling. GraphX is Spark’s graph technology. The chapter continues with a look at GraphX coding. A family tree is used in the examples to illustrate the concepts and processing. First the environment is discussed, specifically the directory structure, SBT environment, and source code compilation into JAR files. Generic Scala GraphX code is described, and used in all the subsequent examples. The generic code:
The chapter continues with various graph examples, these examples include:
Since Spark doesn’t have its own storage, it can’t do in-place processing. An example is provided of using the Neo4j graph database to provide in-place processing. The example uses Mazerunner, a GraphX Neo4j processing prototype. It shows how a graph based database can be used for graph storage, and Spark used for graph processing. This chapter provides a useful introduction to graph terminology, architecture and processing. Useful practical example code is code is provided. The generic GraphX code template should prove a useful base for your own graph processing code. The section on Mazerunner was useful, illustrating the potential of graph storage and processing. Chapter 6 Graph-based Storage This chapter opens with a discussion about Spark not providing its own storage. Graph data needs to be sourced from somewhere, and after processing, needs to be stored somewhere. This chapter primarily discusses the use of the Titan graph database for storage. The chapter proceeds with details on how to download, install, and configure Titan, an interesting but not yet mature product. The various storage options are discussed, these include HBase and Cassandra NoSQL databases. The chapter shows how the associated Gremlin shell can be used interactively to develop graph scripts and Bash shell scripts. The chapter continues with a look at using Titan with HBase. Details are given on how to install HBase from the CDH distribution. Gremlin HBase scripts are provided to define: storage, ZooKeeper servers and ports, and HBase tables. Useful example code is provided that shows the processing and storage of the HBase graph data. A similar exercise is undertaken for the Cassandra database. The chapter ends with a look at accessing Titan with Spark, showing the use of Spark to create and access Titan based graphs. This chapter provides a helpful overview of some of the newer and more experimental technologies, for graph storage systems. Chapter 7 Extending Spark with H2O The book now switches back to looking at Machine Learning. While Spark MLlib provides lots of useful functionality, more options are available when integrating Spark with the Sparkling Water component of H2O. Details are provided on how to download, install and use H2O. This is followed by environment details, including directory structure, SBT config file content, and the use of Bash scripts to execute the H2O examples. The chapter continues with a look at H2O architecture. It is possible to interact with the data via a web interface, and this is described. Data is shared between Spark and H2O via the H2O RDD, and this is shown in an example (H2O does the processing and the results passed back to Spark). H2O contains algorithms for Deep Learning, and this is examined next. Deep Learning is feature rich, with extra hidden layers, so there is a greater ability to extract data features. Examples are provided. The chapter ends with a look at H2O Flow, this is a web-based interface for H2O and Sparkling Water. It’s used for monitoring, manipulating data, and training models. Example code is provided. This chapter shows how Spark MLlib can be extended using H2O libraries. The general architecture of H2O is examined, together with download, installation and configuration details. Various extra data analysis and modelling features are shown. Chapter 8 Spark Databricks Creating a big data analytics cluster, together with importing data and ETL, can be difficult and expensive. Databricks aims to make this task easier. Databricks is a cloud-based service that provides similar functionality as an in-house Spark cluster. Currently only Amazon Web Services (AWS) is supported, but there are plans to cover other cloud platforms. The chapter opens with an overview of Databricks, having a cluster similar to a Spark cluster with master, slaves, and executors. Configuration and server size are defined, and monitoring and security are built-in. With the cloud platform, you only pay for what you use. The cluster is defined in terms of notebooks, which have folders, which can hold code/script. It is also possible to create jobs. The chapter continues with a look at how to install Databricks, it’s noted that AWS offers 1 year free access, 5GB storage, and 750 hours of EC2 usage – which all means low cost access. The chapter continue with the various steps needed to get up and running (account id, access account id, secret access key – used by Databricks to access your AWS storage). AWS billing is briefly discussed. Various administration features are discussed, including: Databricks menus (to perform actions on folders), and account management (add accounts, change passwords etc). Cluster management is briefly discussed, and a step-by-step walkthrough of creating and configuring a new cluster is shown. Examples are provided on how to create and use notebooks and folders. Running various jobs and libraries is briefly discussed. The chapter ends with a look at Databricks tables, again mostly via the admin website. Using menu options it is possible to create a table via an import. It is also possible to create external tables, and tables programmatically. Sample table data can be previewed, and SQL commands run against the tables. Finally, the DBUtils package is examined and some of its methods discussed. This chapter provides useful information about the current status of Spark in the cloud. There are useful walkthroughs for setting up a cluster to use in the cloud. Databricks know a lot about Spark, having designed it! Chapter 9 Databricks Visualization The previous chapter laid the foundation for Spark in the cloud, this chapter extends this to data visualization. Databricks provides dashboards for data visualization, based on the tabular data that SQL produces. Various menu options allow the data to be presenting in different formats. The chapter continues with a step-by-step walkthrough for the creation of a simple dashboard, which is published - so it can be accessed by an external client. This is followed by the creation of a RDD-based report, and a streamed-based report. Next, the REST interface is discussed, this allows integration of your Databricks cloud instance with external applications. Code for this is given and discussed. Various methods of moving data in and out of Databricks are then described, with examples. The chapter ends with a brief mention of some resources from which you can obtain further information and help about Databricks. This chapter provides a useful overview of Databricks visualization. The use of menus and the step-by-step walkthroughs make the chapter particularly easy to understand. The author believes the natural progression of big data processing is: Hadoop → Spark → Databricks. Time will tell. Conclusion This book has well-written discussions, helpful examples, diagrams, website links, inter-chapter links, and useful chapter summaries. It contains plenty of step-by-step code walkthroughs, to help you understand the subject matter. The book describes Spark’s major components (i.e. Machine Learning, Streaming, SQL, and Graph processing), each with practical code examples. Some of the template code could form the basis of your own application code. Several of the core Spark components are extended using less well-know components, many of these are still works in progress. I’m not sure how many readers will find these chapters/sections useful, since they often involve workarounds, or the components might not exist or be superseded later – they can also distract from the book’s core. That said, if you enjoy working at the bleeding edge of technology, you’ll enjoy what these extensions add. Although the book assumes some knowledge of Spark, for completeness, it might have been useful to have some introduction to it (e.g. explain RDDs, introduce the spark-shell etc). Developers coming from a Windows environment might struggle initially understanding Linux, SBT, JARs etc. Despite these concerns, I enjoyed this book, it contains plenty of useful detail. Spark is a rapidly changing technology, so check http://spark.apache.org/ for the latest changes. The book is highly recommended. Related ArticlesReading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark Big Data Analytics with Spark - review by Ian Stirk Learning Spark - review by Kay Ewbank
|
||||||
Last Updated ( Tuesday, 16 February 2016 ) |