Page 1 of 2
Author: Nishant Garg
Publisher: Packt Publishing
Audience: HBase programmer novices
Reviewer: Ian Stirk
This book aims at getting you started in programming with HBase, how does it fare?
Hadoop is the most popular platform for processing big data, and HBase is the NoSQL database included with Hadoop. The book is aimed at software developers that have no previous experience of HBase, wanting a hands-on approach.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Introducing HBase
This chapter opens with the observation that relational database management systems (RDBMS) are unable to scale to process huge amounts of data in a timely manner. Instead, this is the realm of big data databases such as HBase.
The chapter defines big data in terms of huge volumes, high velocity, and varied types of data. Currently 20 terabytes of data is created every second, and this metric is increasing. The foundations of big data processing were established by Google in their papers on the Google File System (GFS) storage, and MapReduce processing model. Google also produced details of a scalable Bigtable database which could handle millions of columns and billions of rows, HBase is based on Bigtable.
The chapter continues with a look at how to install HBase together with its prerequisites. The various modes of running HBase are discussed, these are: local, pseudo-distributed and fully distributed.
The various HBase cluster components are briefly described, namely:
HBase Master – coordinates HBase cluster/admin operations
Zookeeper – centralized service for distributed synchronization
Region Servers – store horizontal rows as regions
The chapter ends with a look at the HBase shell, this provides an environment to run interactive commands or batches, it’s a useful tool for experimenting with HBase. Various commands are introduced, including: status, create table, list, get, put, scan, delete, describe, and drop.
This chapter provides a useful introduction to HBase in terms of the history of big data, the need to process huge amounts of data, and an outline of the HBase components. Useful installation instructions are provided, together with a HBase shell tutorial for playing and testing with HBase.
This chapter provides well written discussions which are generally easy to read, has helpful diagrams, outputs, scripts, and brief practical walkthroughs. There are useful links to other chapters. Sometimes, terms are used before they are defined (e.g. HMaster and Zookeeper). These traits apply to the whole of the book.
Chapter 2 Defining the Schema
HBase ties together related columns via column-families; this chapter discusses concepts of column-family databases.
In HBase it is possible for each row to store different number of columns, having different datatypes. Each row is identified by a unique rowkey. Columns are organised into column-families, which are physically stored in a HFile as key-value pairs. Contiguous rows of data are stored in regions on the Region Server The write ahead log (WAL) and the MemStore are briefly discussed.
The chapter continues with a brief look at table design considerations, including: rowkey structure, number of historic values to keep, and the number of column-families. Next, factors that affect performance are discussed, including: columns in a column-family are stored together on disk and columns with different access paths should be in different column-families.
The chapter ends with a look at accessing HBase data via various clients (e.g. shell, REST, Thrift, Java API). Code is provided to create a connection to HBase via the Java API. This is followed with HBase API code for the various CRUD (Create, Read, Update, and Delete) operations.
This chapter provides a useful overview of the structure of a HBase table, i.e. rows, column-families, rowkey, and cells. There’s a discussion about the factors to consider when creating HBase tables.
While the discussions are helpful, much more could have been said, and illustrated with practical examples. The diagram relating to MemStore and Block Cache is incorrectly annotated (The items: sales, customer, orders should read: Customer, Address, Orders).
Chapter 3 Advanced Data Modeling
This chapter extends the previous chapter. The chapter opens with a look at keys, with the rowkey giving unique access to the entire row. Although the data can be viewed logically as a table, it is physically implemented as a linear set of cells, where each cell represents the column, data, rowkey and timestamp. Since accessing related data can lead to performance hotspots, suggestions are given to overcome this (e.g. salting and hashing).
The chapter next discusses HBase table scans, which are useful for accessing a complete set of records for a specific value by applying filters. The scan class and its various function overloads are briefly discussed. It is possible to filter the results by column-family, timerange, version, and stop/start rows.
The chapter ends with a look at implementing filters, and useful walkthroughs are provided for utility, comparison and custom filters.
This chapter discusses table access considerations. A deeper understanding of the rowkey is given, together with table access via scans. Filtering the results was discussed and code provided.