Hadoop Essentials

Article Index
Hadoop Essentials
Chapters 4-7, Conclusion

Page 2 of 2

Author: Shiva Achari
Publisher: Packt Publishing

ISBN: 978-1784396688
Print: 1784396680
Kindle: B00WX3W3S4

Audience: Developers new to Hadoop
Rating: 3
Reviewer: Ian Stirk

Chapter 4 Data Access Components – Hive and Pig
Writing MapReduce jobs is relatively complex, and the software lifecycle can be long. Pig and Hive allow non-Java programmer to create MapReduce jobs quickly, using a higher-level of abstraction.

The chapter starts with a look at Pig, which is a wrapper for the scripting language Pig Latin. In addition to querying data, Pig is often used for workflow. The various Pig data types are discussed (primitives and complex data types). Next, the Pig architecture is examined, in terms of the logical plan (syntax and objects check), physical plan (translate operator to physical form), and the MapReduce plan (compile to actual MapReduce jobs).

Pig provides three modes of execution: interactive/grunt, batch, and embedded. The section continues with a look at the grunt shell (this interacts with HDFS), including: loading and storing data, together with various row/column based queries. Example data and outputs are provided.
The chapter then looks at the use of Hive, which is Hadoop’s data warehouse, and conveniently uses a SQL-like query language. The section opens with a look at Hive’s architecture, briefly discussing various components including: driver, metastore (system catalog), query compiler (parse, check types, optimize), execution engine, HiveServer2, and clients. Details of how to install Hive are given, together with setting up the Hive shell.

The section continues with a look at HiveQL, Hive’s SQL-like language. Data Definition Language (DDL) operations are given with brief examples, including: creating a database and a table. This is followed with Data Manipulation Language (DML) operations, again with brief examples, including: SELECT, joins, aggregations, built-in functions, and the use of custom user-defined functions (UDFs).
The section ends with a look at partitioning (used to distribute data horizontally, e.g. by year), and bucketing (can give a more even distribution of data).

This is a useful chapter, explaining the need to use abstract languages to query big data. The architecture of both Pig and Hive is examined, and example code for querying the data is provided. I noted installation details were given for Hive, but not for Pig. The chapter states that Hive can’t handle transactions, however this feature is available from version 0.14.0, released November 2014.

Chapter 5 Storage Component – HBase
This chapter provides an overview of HBase, this is Hadoop’s NoSQL database. HBase is based on Google’s Big Table, and is scalable to millions of columns and billions of rows. HBase is a columnar NoSQL database that stores data in Key/Value pairs, distributed over various nodes. It’s particularly suited for sparse, denormalized data.
The advantages of HBase are briefly listed, including: real-time random high-scale read/writes, auto-failover, column-base structure, variable schema, auto partitioning, and integration with various programming languages (Java, Hive, Pig etc).

Next, the HBase architecture is discussed, this includes:

MasterServer (administrator, needed for: cluster monitor and management, assigning Regions to RegionServers, failover and load balancing)
RegionServer (manages Regions in coordination with master. Data splitting in regions)
Regions (handle availability and data distribution)
Zookeeper (use to monitor RegionServer and recover if down)

The chapter continues with a look at the HBase data model, including: logical components of data model, ACID properties, and the CAP theorem. A helpful section on compaction and splitting is given. Both are important for data management, ideally data should be evenly distributed across the Regions and RegionServers.
Various HBase shell commands are described, together with simple examples, these include: Help, Create, List, Put, Scan, Get, and Drop. These commands are not SQL commands, however, HBase integrates with Hive, and analysts typically prefer to query HBase via Hive’s SQL-like syntax.

The chapter ends with a look at various performance tuning optimizations, including:

Compression (gives more data per read)
Filters (filter out data early in processing)
Counters (distributed counter)
HBase coprocessors (moves computation much closer to data)

This is another useful chapter, showing why HBase is needed, what it is, its features, architecture and its components. Useful diagrams are given showing the interaction of HBase, Hadoop, ZooKeeper, RegionServer, HMaster, HDFS, and MapReduce.

The performance tuning section says HBase is the most popular NoSQL technology, however the popular database ranking site http://db-engines.com/en/ranking (for June 2015) shows that HBase is much less popular than either MongoDB and Cassandra NoSQL databases. Also, some of the chapter’s explanations seemed muddled.

Chapter 6 Data Ingestion in Hadoop – Sqoop and Flume
Hadoop systems can contain petabytes of data. However, before you can process data, you need to get it into Hadoop. This chapter discusses two popular tools for getting data into Hadoop.

Sqoop is a popular tool for transferring data between relational databases and Hadoop. Sqoop creates map-only jobs to transfer the data in parallel, using multiple mappers. The earlier version of Sqoop had a simple architecture, and various limitations, including: only command line supported, monitor/debug difficult, security needed root access, only JDBC-based connectors supported. These limitation were overcome with Sqoop 2, additionally, Sqoop 2 integrates with Hive, HBase, and Oozie. The section ends with some very useful example import and export Sqoop commands.

The chapter continues with a look at Flume, this is a popular tool for transferring data between various sources and destinations. Flume is distributed, reliable, scalable, and customizable. Flume, like many other Hadoop tools follows a master/slave pattern to processing.
The section takes a look at Flume’s architecture, discussing its multitier topology, and includes sections on Flume master, nodes, agents and channels. The chapter ends with helpful example code of a single agent and multiple flows in an agent.

This chapter provides a useful overview of two popular tools used to transfer data between various sources and Hadoop. The example Sqoop commands should prove useful, as should the Flume examples. It should be remembered there is much more to learn about these tools.

Chapter 7 Streaming and Real-time Analysis – Storm and Spark
The previous chapters relate primarily to batch processing. This chapter takes a look at streaming and real-time processing.

The chapter opens with a look at Storm, which processes stream data quickly, is highly scalable, fault tolerant, and reliable. The physical architecture of Storm is examined, again there is a master/slave pattern followed. Storm’s components are:

Nimbus – master process, distributes processing across cluster
Supervisor – manages worker nodes
Worker – execute tasks
ZooKeeper – coordinates Nimbus and Supervisors

Next, the data architecture of Storm is examined, the components being:

Spout – creates stream
Bolt – ingests Spout data, can filter, aggregate, join etc
Topology – a network graph between Spouts and Bolts

The section ends with helpful example Java code for Spout (generate numbers), Bolt (decide if prime) and Topology (configures spouts and bolts and submits).

The chapter then takes a look at Spark, a popular in-memory distributed processing framework. Spark is very popular with steaming/ analytics, used for fast interactive queries. Processing with Spark is much faster than comparable MapReduce jobs (often 100 times faster). Additionally, Spark can process iterative- type jobs, which is not possible with MapReduce.

The section continues with a look at the various Spark libraries that take advantage of the Spark core, including:

Spark SQL (works with various data sources e.g. Hive, Parquet, JSON files)
GraphX (graph-based algorithms)
MLib (scalable machine learning library)
Spark streaming (library allows processing of streaming data in real-time, various inputs)

Spark’s architecture is discussed next, being based on Resilient Distributed Datasets (RDDs). Spark computations are performed lazily, allowing the Directed Acyclic Graph (DAG) engine to eliminate and optimize some steps.

The section then looks at operations in Spark, namely:

Transformations – creates another RDD, lazy evaluation e.g. map, filter, union
Actions - initiate the work e.g. count, first, take

The chapter ends with some VERY brief Spark example code, in both Scala and Java.

This chapter provides a useful overview of stream and real-time processing using Storm and Spark. In both cases, the architecture, components, and advantages of the technologies were discussed.
The SparkContext is mentioned but not defined (it links Spark to the underlying environment/data).

Conclusion
This book aims to give you an understanding of Hadoop and some of its major components, and largely succeeds. For a short book, it covers a wide area. I think only a little understanding of Java (or a comparable language) is needed to read this book. The extensive use of diagrams is helpful.

The book should prove useful to developers wanting to know more about Hadoop and its major associated technologies. The book provides a helpful overview of Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, Storm and Spark. While not comprehensive (e.g. Impala and Hue are not discussed), it does cover many of the popular components.

The English grammar in some sections is substandard, making the book awkward to read. An editor with a good understanding of English would improve the book’s readability. Some sentences are illogical e.g. “Hadoop is primarily designed for batch processing and for Lambda Architecture systems.” – But, Lambda Architecture includes batch and stream processing! Additionally, some sections seem muddled – probably amplified by the bad grammar and illogical thought.

Overall, if you can bypass the problems, this is a useful book, wide in scope and quite detailed for a short book.

The Nature of Code

Author: Daniel Shiffman
Publisher: No Starch
Date: September 2024
Pages: 640
ISBN: 978-1718503700
Print: 1718503709
Kindle: B0CG8F2VKM
Audience: General
Rating: 5
Reviewer: Mike James
The nature of code - what is it?

+ Full Review

The Rust Programming Language, 2nd Ed

Author: Steve Klabnik and Carol Nichols
Publisher: No Starch Press
Date: June 2023
Pages: 560
ISBN: 978-1718503106
Print: 1718503105
Kindle: B0B7QTX8LL
Audience: Systems programmers
Rating: 4.8
Reviewer: Mike James

There's a new edition of what has become the standard text on Rust. Has it matured along with [ ... ]

+ Full Review

More Reviews

<< Prev - Next

Last Updated ( Wednesday, 10 June 2015 )

Recent Articles

Recent Book Reviews

Popular Articles