Page 2 of 2
Author: Michael Frampton
Chapter 7: Monitoring Data
This chapter opens with a look at the Hue browser, a web-based user interface that sits on top of Hadoop, it has interfaces to many tools including: Oozie, Pig, Impala, HBase, Hive, and Sqoop. Additionally, Hue has a helpful HDFS file browser. The chapter describes how to download, install and configure Hue. It is important to ensure the underlying components are installed correctly (e.g. HDFS, YARN, HBase, Oozie, and Sqoop2). Some detail is provided on how to set up these tools. Hue logs can be checked for any errors. There’s a list of potential errors, their cause and solution. The section ends with a helpful overview of the different sections of the Hue user interface.
The chapter continues with a look at Ganglia, an open source monitoring system, designed for distributed high-performance systems. Details are given on how to install, configure and use Ganglia. Next, common errors are discussed together with their causes and solutions. The various components of the Ganglia interface are discussed (e.g. CPU, memory, network, etc) with reference to its various graphs and tables.
The chapter ends with a look at Nagios, this tool extends monitoring, and can create alerts based on problem criteria. Details are given on how to download, install and configure the tool, together with details of common errors, their causes and solutions. A brief example is given that sets up an alert.
This chapter is particularly useful if you come from a Windows environment, where the alternative Linux command line can prove troublesome. Hue provides a centralized place to access the various Hadoop tools. Hue’s monitoring can be extended with Ganglia (graphs of what’s happening), and Nagios (provides alerts).
Chapter 8: Cluster Management
Cluster managers are used to install and configure related Hadoop tools, and manage updates on a Hadoop cluster. This is much easier and quicker than installing the various individual tools or changing the configuration manually. Additionally, they provide a centralised web-based interface to the underlying Hadoop tools and configuration, together with automatic monitoring. It should be noted there are license fees, however these should be offset by the amount of time they save and the ease of work they provide.
The first cluster manager discussed is Ambari which is used by the HortonWorks Hadoop stack. Details are provided for step by step installation, configuration of hosts, and monitoring, together with Ganglia and Nagios integration.
Next the Cloudera cluster manager is examined. Details are provided for step by step installation, configuration of hosts, and monitoring. The section ends with a look at using SQL-like queries to build dynamic dashboards for monitoring.
The chapter ends with a look at Apache’s Bigtop. Although it’s not specifically a cluster manager, it aims to simplify installation and integration, and is used to smoke test the integrated Hadoop stack to ensure the tools work together. Cloudera bases its own releases on Bigtop’s test functionality. Details are provided of its installation, configuration, and the running of smoke tests.
The overriding message from this chapter is that using a cluster manager is much easier and quicker than installing, configuring, testing, and upgrading the individual tools and configurations manually. Additionally, having a single place to check the health of your Hadoop cluster is invaluable.
Chapter 9: Analytics with Hadoop
This chapter is about using tools that can help find meaning in data. The chapter opens with a look at Cloudera’s Impala, this is a massively parallel processing (MPP) SQL query engine for Hadoop. Details are provided on how to download, install, and configure Impala. Its use is illustrated via both the Hue interface and the shell tool. The section shows how to use Impala to create a database, create an external table (i.e. in an explicit location), and load a table from a file.
Next Hive is examined. Hive uses HiveQL to manipulate Hive-based tables, additionally User Defined Functions (UDFs) can be used. Although Hue can be used, often the Beeswax Hive interface is used to create scripts. The section shows how to use Hive to create a database, create a table, and load a table from a file.
The chapter ends with a look at Spark. Spark is used for very fast in-memory distributed processing. It can be developed in Java, Python, Scala, or you can use the built in scripting language for ad-hoc queries. Spark can scale to a large number of servers (2000 nodes), and run in local mode or you can use cluster managers (e.g. YARN or Spark). Details are provided on how to download, install, configure, and run Spark.
This chapter provides a useful overview of the functionality in Impala, Hive, and Spark. It really is only an overview, each tool has much more functionality, but helpfully links are provided for further information. Combining these tools with Sqoop and Flume allow you to build ETL solutions. These tools work with versions of SQL, which is very familiar to many developers, allowing a less painful transition to these tools.
Chapter 10: ETL with Hadoop
This chapter describes two popular tools (Pentaho’s Data Integration and Talend’s Open Studio), that can be used to change Hadoop and Spark based data visually. Both integrate with map reduce and the various Hadoop tools (e.g. Pig and Sqoop), and can be used to create and schedule ETL based work. The tools allow drag and drop of components to create big data ETL processing. These tools should save you significant time and money, and provide a quick entry point into map reduce processing (but you may still need low level map reduce for complex problems). For both tools, details are provided on how to download, install, configure, and use them.
This was an interesting chapter, detailing tools that provide a high level entry point to the creation of map reduce tasks, saving you time and resources. It might be argued this should be the default starting point, using low level tools only when required.
Chapter 11: Reporting with Hadoop
Hadoop’s Hive can hold much more data than traditional relational databases, but as data volumes increase so can data quality issues, various tools can report on the quality of this data.
The first tool examined is Hunk, which is Hadoop’s version of Splunk, a tool used to create reports and dashboards showing the state of the data on the cluster. Details are given on how to download, install, configure and run Hunk. The section continues with step by step report creation and dashboard examples. Additionally, there’s a useful overview of common errors and their solutions.
The chapter continues with a look at Talend’s reports. These build on the previous chapter’s Talend ETL section. Details are given on how to download, install, configure and run the reports. The section continues with step by step examples on how to create reports. This is followed with a useful overview of common errors and their solutions. Talend can run data quality checks against Hive data to identify data that needs attention.
This book provides a broad and practical introduction to big data, using Hadoop and its many tools. It gives comprehensive step by step instructions on how to download, install, configure, and run the various tools - additionally, common errors are explained and solutions proposed.
The book is easy to read, with helpful explanations, screenshots, listings, outputs, and a logical flow between the sections and chapters. There are good links between the chapters, and to websites containing further information. It also steps back and puts what’s being discussed into the larger context of big data. The book will certainly give you more confidence in the topic.
It should be noted that there is much more information available on all the tools discussed, however this book is a great starting point, and it does an excellent job of introducing the many tools in an easily understandable manner. This is perhaps the book to read before the more specific or advanced books (e.g. “Hadoop: The Definitive Reference” - the new edition of which is out in April 2015, look out for its review soon.)
If I have one concern, it relates to who will use the whole book. This book contains both admin and development sections, however large companies typically separate out their admin and development teams.
If you want a useful working introduction and overview of the current state of big data, Hadoop and its associated tools, I can highly recommend this very instructive book.
Also see Reading Your Way Into Big Data, an article on Programmer's Bookshelf in which Ian Stirk provides a roadmap of the reading required to take you from novice to competent in areas relating to data science.