Hadoop Essentials |
Page 2 of 2
Author: Shiva Achari ISBN: 978-1784396688 Audience: Developers new to Hadoop
Chapter 4 Data Access Components – Hive and Pig The section continues with a look at HiveQL, Hive’s SQL-like language. Data Definition Language (DDL) operations are given with brief examples, including: creating a database and a table. This is followed with Data Manipulation Language (DML) operations, again with brief examples, including: SELECT, joins, aggregations, built-in functions, and the use of custom user-defined functions (UDFs). This is a useful chapter, explaining the need to use abstract languages to query big data. The architecture of both Pig and Hive is examined, and example code for querying the data is provided. I noted installation details were given for Hive, but not for Pig. The chapter states that Hive can’t handle transactions, however this feature is available from version 0.14.0, released November 2014.
Next, the HBase architecture is discussed, this includes:
The chapter continues with a look at the HBase data model, including: logical components of data model, ACID properties, and the CAP theorem. A helpful section on compaction and splitting is given. Both are important for data management, ideally data should be evenly distributed across the Regions and RegionServers. The chapter ends with a look at various performance tuning optimizations, including:
This is another useful chapter, showing why HBase is needed, what it is, its features, architecture and its components. Useful diagrams are given showing the interaction of HBase, Hadoop, ZooKeeper, RegionServer, HMaster, HDFS, and MapReduce. The performance tuning section says HBase is the most popular NoSQL technology, however the popular database ranking site http://db-engines.com/en/ranking (for June 2015) shows that HBase is much less popular than either MongoDB and Cassandra NoSQL databases. Also, some of the chapter’s explanations seemed muddled.
Sqoop is a popular tool for transferring data between relational databases and Hadoop. Sqoop creates map-only jobs to transfer the data in parallel, using multiple mappers. The earlier version of Sqoop had a simple architecture, and various limitations, including: only command line supported, monitor/debug difficult, security needed root access, only JDBC-based connectors supported. These limitation were overcome with Sqoop 2, additionally, Sqoop 2 integrates with Hive, HBase, and Oozie. The section ends with some very useful example import and export Sqoop commands. The chapter continues with a look at Flume, this is a popular tool for transferring data between various sources and destinations. Flume is distributed, reliable, scalable, and customizable. Flume, like many other Hadoop tools follows a master/slave pattern to processing. This chapter provides a useful overview of two popular tools used to transfer data between various sources and Hadoop. The example Sqoop commands should prove useful, as should the Flume examples. It should be remembered there is much more to learn about these tools.
The chapter opens with a look at Storm, which processes stream data quickly, is highly scalable, fault tolerant, and reliable. The physical architecture of Storm is examined, again there is a master/slave pattern followed. Storm’s components are:
Next, the data architecture of Storm is examined, the components being:
The section ends with helpful example Java code for Spout (generate numbers), Bolt (decide if prime) and Topology (configures spouts and bolts and submits). The chapter then takes a look at Spark, a popular in-memory distributed processing framework. Spark is very popular with steaming/ analytics, used for fast interactive queries. Processing with Spark is much faster than comparable MapReduce jobs (often 100 times faster). Additionally, Spark can process iterative- type jobs, which is not possible with MapReduce. The section continues with a look at the various Spark libraries that take advantage of the Spark core, including:
Spark’s architecture is discussed next, being based on Resilient Distributed Datasets (RDDs). Spark computations are performed lazily, allowing the Directed Acyclic Graph (DAG) engine to eliminate and optimize some steps. The section then looks at operations in Spark, namely:
The chapter ends with some VERY brief Spark example code, in both Scala and Java. This chapter provides a useful overview of stream and real-time processing using Storm and Spark. In both cases, the architecture, components, and advantages of the technologies were discussed.
The book should prove useful to developers wanting to know more about Hadoop and its major associated technologies. The book provides a helpful overview of Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, Storm and Spark. While not comprehensive (e.g. Impala and Hue are not discussed), it does cover many of the popular components. The English grammar in some sections is substandard, making the book awkward to read. An editor with a good understanding of English would improve the book’s readability. Some sentences are illogical e.g. “Hadoop is primarily designed for batch processing and for Lambda Architecture systems.” – But, Lambda Architecture includes batch and stream processing! Additionally, some sections seem muddled – probably amplified by the bad grammar and illogical thought. Overall, if you can bypass the problems, this is a useful book, wide in scope and quite detailed for a short book.
|
||||||
Last Updated ( Wednesday, 10 June 2015 ) |