Apache Hive Essentials

Article Index
Apache Hive Essentials
Chapters 6-10, Conclusion

Page 2 of 2

ApacheHive

Chapter 6 Data Aggregation and Sampling

The chapter opens with a look at basic aggregation, involving the functions MIN, MAX, AVG etc. The chapter continues with a look at the GROUP BY clause, and provides example code and outputs.

The chapter then dives deeper into grouping sets, these implement advanced GROUP BY operations, giving a method of producing the same results as using multiple GROUP BYs and UNION ALLs. This is followed by the use of ROLLUP (creates aggregates for the combinations of columns in the GROUP BY hierarchy) and CUBE (creates aggregates for all combinations of columns in the GROUP BY). The HAVING clause applies filtering to the GROUP BY. Throughout, helpful code and output is given.

The chapter continues with a look at analytical functions, these apply functions to a window of rows. Here the window represents a subset of data, and uses the OVER clause. Various functions are examined (without examples), including: RANK, ROW_NUMBER, NTILE, LEAD, and FIRST_VALUE.

Since big data processing typically involves huge volumes, it is possible to sample the data to provide representative data to show trends and patterns. Various random sampling functions are detailed, including RAND and TABLESAMPLE.

This chapter provides a useful overview of data aggregation, however some parts lack detail, example usage of the functions should have been included. The sampling section was interesting, but brief. If you’re familiar with relational databases (e.g. SQL Server), then you should be familiar with many of the analytical functions given here.

Chapter 7 Performance Considerations

Although Hive is concerned with processing massive amounts of data, performance is still important. The chapter opens with a look at the EXPLAIN statement, this provides details of how the query will be executed, without running it. Examining this output will show if a given index is being used or a given partition is being accessed, both of which can improve performance under certain conditions. But don’t be dismayed if you see table scans, with big data systems table scans are often the preferred access mechanism. An EXPLAIN example is discussed.

Another helpful performance utility is the ANALYZE statement. This gathers statistics, which are used by the optimizer to help determine how the query should be executed. An ANALYZE example is discussed.

The chapter then discusses design considerations which should help improve performance, these include the use of: partitioning, buckets, and indexes. Next, data file optimizations are discussed, these include the use of various file formats, compression (gives more data per read, so less IO required), and replicating ‘hot’ data (i.e. often used data).

The chapter ends with a look at job and query optimizations. These include running Hadoop in local mode when the data to process is small, JMV reuse - again useful for small data volumes, parallel execution (where queries are non-dependent), and various join optimizations.

This was an interesting chapter. As you get deeper into Hive you see it has similar utilities and optimization techniques as relational databases (e.g. Hive’s EXPLAIN is similar to Oracle’s EXPLAIN PLAN or SQL Server’s SHOWPLAN). The optimizations discussed, and the code provided, should prove helpful in improving the performance of your Hive’s queries.

Chapter 8 Extensibility Considerations

This chapter discusses various ways of extending Hive’s functionality. The chapter opens with a section on creating functions (typically in Java), that can be used with HQL. Sample code is provided, in a step-by-step example, which creates a simple function in Java, which converts a string to uppercase. This function is then used as part of a Hive query.

The chapter continues with a look at streaming, which provides another method of transforming data. It is possible to use the TRANSFORM clause to change the data in the stream. An example is provided, written in Python, which again converts a string to uppercase.

The chapter ends with a look at SerDe, this is a Serializer/Deserializer technology that Hive uses to map processed data to columns in Hive’s tables. Various examples of the use of SerDe for different file formats are given.

This was an interesting chapter, providing details of some useful methods of extending Hive. Perhaps more examples could have been provided.

Chapter 9 Security Considerations

This brief chapter introduces Hive security in terms of authentication (verifying who you are), authorization (what you can do) and encryption. Authentication looks at the metastore server authentication, and HiveServer2 authentication. In authorization, various mode are examined, specifically: legacy (the default, not so secure), storage-based (relies on HDFS), and SQL standard-based (this uses GRANT and REVOKE, and is the preferred authorization method). The encryption section shows an example of how the new HDFS encryption can be used.

This chapter briefly discusses various security considerations: authentication, authorization, and encryption, and how they relate to Hive.

Chapter 10 Working with Other Tools

This chapter outlines some of the more common Hadoop components that interact with Hive. The tools covered are:

JDBC / ODBC connector (common way for Hive to connect with other tools)
HBase (high performance NoSQL key-value store on Hadoop)
Hue (centralized user-friendly web-interface for many of Hadoop’s components)
HCatalog (metadata management system for Hadoop’s data)
ZooKeeper (centralized service for configuration management and synchronization)
Oozie (workflow and scheduler)

The chapter ends with a look at the Hive roadmap. It shows for each version of Hive, when it was released and its major features. Hive was first released in 2011, and the latest release given is version 1.0.0 in February 2015. There’s a very useful list of expected forthcoming functionality. It might have been useful to show how to get the version of Hive you’re running (i.e. hive --version).

This was a very helpful chapter, putting Hive into context with Hadoop’s other common tools. I was surprised Sqoop wasn’t mentioned (Hive is likely to get data from relational databases, and Sqoop is the main tool for this). The Hive roadmap was very insightful, showing that Hive is a relatively recent product, with plenty of activity (with regular updates a few times each year).

Conclusion

This book provides up-to-date detail on Hive, a very popular platform for storing and analyzing big data on Hadoop.

Most topics are explained in a very readable manner, a few sections could do with more detail (e.g. transactions). Throughout, there are helpful explanations, screenshots, practical code examples, and inter-chapter references. Some links to websites are provided for further information.

This book is especially suitable for developers and data analysts starting out with Hive. Additionally, since it also contains advanced and up-to-date material, it is also suitable for more advanced developers/analysts. If you have a background in SQL the book is even easier to understand.

There are very few books dedicated to Hive, and these tend to be out of date now (especially since Hive changes regularly). If you want an up-to-date, practical, wide-ranging review of Hive’s functionality, I highly recommend this book.

Seriously Good Software

Author: Marco Faella
Publisher: Manning
Date: March 2020
Pages: 328
ISBN: 978-1617296291
Print: 1617296295
Kindle: B09782DKN8
Audience: Relatively experienced Java programmers
Rating: 4.5
Reviewer: Mike James
Don't we all want to write seriously good software?

+ Full Review

Beginning Rust Programming

Author: Ric Messier
Publisher: Wiley
Date: March 2021
Pages: 416
ISBN: 978-1119712978
Print: 1119712971
Kindle: B08WZ2D7WC
Audience: Developers wanting to learn Rust
Rating: 3
Reviewer: Mike James
Everyone seems to want to know what makes Rust special. Does this book give the answers?

+ Full Review

More Reviews

<< Prev - Next

Last Updated ( Friday, 01 May 2015 )