HBase Essentials |
Page 2 of 2
Author: Nishant Garg
Chapter 4 The HBase Architecture This chapter opens with a look at data storage, with a table being composed of regions (related rows), distributed across the cluster, being stored on Region Servers. The HBase Master process controls the distribution on the Region Servers. Details are provided on how the client connects to Zookeeper to access metadata, discovering the regions containing the rows of interests, and then the appropriate Region Server is accessed. A useful diagram of this is given. Next, data replication is examined. Holding data on different clusters is useful for both disaster recovery (DR) and High Availability (HA). Securing HBase is briefly examined, there is no security by default, but there is optional support for authentication (via Kerberos) and authorization - configuration details are given to enable these. The chapter ends with a look at using HBase and MapReduce (Hadoop’s distributed processing model). There’s a helpful discussion and diagram showing the integration of HBase and MapReduce. Code examples are provided showing this integration. This chapter provided a look at some of the more peripheral aspects of HBase development, especially in regards to administration. The examples showing the integration of HBase with MapReduce should prove useful to the hands-on developer. Chapter 5 The HBase Advanced API This chapter opens with a look at counters, these can record clicks and page views etc. The counters inherent in HBase are easy to use and allow real-time viewing of values. Code examples are provided for both single and multiple counters. The chapter next discusses coprocessors, these provide an easy way to run custom code on Region Servers. Custom features like complex filters and secondary indexes can be developed. An example is provided that simulates a RDBMS trigger. The chapter ends with a look at the administrative API. The data definition API’s various functions are outlined, including: tables, column–families. Finally, the HBaseAdmin API is discussed, this provides various admin functions, including: isMasterRunning, getConnection, and deleteTable. This chapter provides helpful code for extending HBase processing via counters and coprocessors. The data definition and HBaseAdmin APIs are useful entry points for manipulation and admin tasks. Chapter 6 HBase Clients The chapter opens with a look at the HBase shell. This is the easiest way to access HBase, providing a command line interface. Commands types discussed include: data definition commands (e.g. create, alter), data manipulation commands (e.g. put, scan), and data-handling commands (e.g. balancer). Next, Kundera is discussed. This is a popular object mapper, and an easy way to use HBase within Java applications. Code examples are provided for CRUD operations, querying HBase, and using filters with a query. The chapter continues with a look at REST clients. REST allows clients to interact with objects over the Web. Code is given to start REST service on HBase master, this is followed by getting data in various formats (e.g. plain, XML, JSON). Finally, sample code given for a Java client for REST. The chapter ends with a look at how the Hadoop components interact with HBase, in particular MapReduce batch processing. Hive is used as the example Hadoop component (others will be similar), and code is provided to create a HBase-backed database. This chapter provides a helpful overview of the various types of clients that can make use of HBase. Some useful sample code is provided. Chapter 7 HBase Administration This chapter opens with a look tasks associated with cluster management. These tasks include: stopping and starting the HBase cluster, adding and removing nodes, upgrading the cluster, and importing/exporting HBase data. Next, cluster monitoring is examined. Hadoop provides the underlying metrics framework for many components, including HBase. Various HBase components and their metrics are briefly examined, including the Master and Region Servers. Metrics from several related tools are also briefly examined, including: JVM metrics, Info metrics, Ganglia, and Nagios. The chapter continues with a brief look at performance tuning. Performance is typically measured in terms of response time, and is affected by many factors, including: hardware, network, OS, JVM, and HDFS. Amongst the factors briefly discussed are: compression, load balancing, and merging regions. The chapter ends with a brief look at troubleshooting. A useful list of tools is listed, including: jps – what java processes are running for user, jmap – view java heap summary, and jstat – monitors JVM. Finally, some common errors and their solutions are listed (e.g. DataStreamer Exception). This chapter provided some interesting insights into various HBase admin function. There’s a useful section on performance recommendations for the different patterns of data access (e.g. heavy writes, random reads). Despite the chapter being useful, much more could have been discussed. Conclusion The book has well-written discussions which are generally easy to read, helpful diagrams, outputs, scripts, and brief practical walkthroughs. There are useful links to other chapters. This book aims to get you started in programming with HBase, however, it deviates from this, containing as much administration detail as programming. Sometimes terms are used before they are defined (e.g. HMaster and Zookeeper), suggesting you need some knowledge of Hadoop. It would have been helpful to list where to go next to extend your HBase knowledge. This book will help you get up and running with HBase, show you how to use HBase from various clients, and give you an understanding of its internal structure, I can recommend it as a starter book.
|
||||||
Last Updated ( Saturday, 03 October 2015 ) |