Hadoop: The Definitive Guide (4th ed) |
Page 4 of 4
Chapter 19 Spark Unlike most of the technologies in this book, Spark does not use MapReduce, instead it uses its own distributed runtime. Spark keeps larger datasets in memory between jobs, this tends to give excellent performance benefits for interative jobs and interactive analysis. The chapter opens with a look at where to download Spark, and continues with installation details. Next, a simple Spark example is provided, which uses the Spark shell to load a text file. The Spark shell is good for initial testing, but for serious work an application needs to be created, this creates a SparkContext object that provides an entry point into the Spark environment. Code examples are provided in Scala, Python and Java (in order of increasing amounts of code required!). Resilient Distributed Datasets (RDDs) are the major data structure in Spark, these can have transformations applied, but only run when an action is applied (i.e. transformations are lazy). The chapter continues with a look at the anatomy of a Spark job, a step-by-step walkthrough is provided. This chapter provides a useful overview of Hadoop’s capability for interactive processing via Spark. I suspect we’ll see more systems moving away from MapReduce and towards Spark due to its superior performance. DataFrames (named columns in RDD), a newer feature, is not mentioned. Chapter 20 HBase HBase is Hadoop’s NoSQL database, it is a distributed column-oriented key-value database build on top of HDFS. It should be used when you want real-time read/write random access to very large datasets. Unlike traditional relational databases, it was created with massive scalability in mind. The chapter opens with a look at various HBase concepts. There’s a brief tour of the data model (column families – groupings of columns, and regions – subsets of table rows). Next, the implementation of HBase is discussed, with reference to the HBase master orchestrating a cluster of regionserver workers. ZookKeeper is used as the authority on cluster state. The chapter continues with a look at where to download HBase, and how to install it. Various code examples are given, namely: Java –basic table and admin access, MapReduce – count number of rows in a HBase table. REST and thrift interfaces are provided for non-Java access. The next section looks at building an online query application. Unlike HDFS and MapReduce, HBase is very good at reading/writing individual records. The examples continue with schema design (2 tables), loading the data, and online queries using the HBase Java API. The chapter then looks at how HBase compares with traditional relational databases. HBase is designed to scale massively, its tables can contain millions on columns and have billions of rows, which can be partitioned across nodes automatically. This chapter provides a helpful overview of what HBase is, and how it is used. I particularly liked the sentence “With HBase, the software is free, the hardware is cheap, and the distribution is intrinsic.” Chapter 21 ZooKeeper ZooKeeper is Hadoop’s distributed co-ordination service, it allows the creation of distributed applications. These applications are difficult to create from scratch. The chapter opens with a look at where to download ZooKeeper, and continues with installation details. Configuration is controlled by the zoo.cgf file. Various useful ZooKeeper command are given. The chapter continues with a ZooKeeper example, which maintains a membership list on a central server. Details and code are provided for: group membership, creating a group, joining a group, listing members of a group, and deleting a group. The next section takes a look at the ZooKeeper service, in terms of maintaining High Availability, and performance. Subsections discussed include: the data model, operations, and watch triggers. Next, ZooKeeper is used to write some useful applications, these include: a configuration service, a resilient ZooKeeper application, and a lock service. The chapter ends with some recommendations for running ZooKeeper in production.
Part V Case Studies This part comprises three case studies, illustrating how Hadoop is currently being used to help solve real world problems. Chapter 22 Composable Data at Cerner Chapter 22 Biological Data Science: Saving Lives with Software Chapter 22 Cascading
The book ends with four appendices:
Conclusion This book covers a broad range of Hadoop topics, including Hadoop core (HDFS, MapReduce, and YARN) and many related components, additionally, installation details and case studies are included. The book is well written, providing good explanations, examples, walkthroughs, and helpful diagrams. Useful links are given between chapters and to websites. Most chapters have footnotes and a “further reading” section so you can obtain more information. You probably need an understanding of Java or a similar language to get the most out of the book. It should take your general level of understanding from level 3 to level 8. Since the book covers internals, administration, and development, I’m not sure who will read the entire book. Some sections seemed dry on first reading. Some of the books that are referenced are getting old. Not all components are covered (e.g. Storm), but many popular ones are. I did wonder if there was too much emphasis on MapReduce, since there seems to be movement away from MapReduce batch processing towards interactive processing, as shown with the growing popularity of Spark. Despite these minor criticisms, if you want to gain a good understanding of the current state of Hadoop and its components, I can highly recommend this book.
For reviews of other titles on Hadoop see the Data Science category
To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews. |
||||||||
Last Updated ( Tuesday, 21 July 2015 ) |