Apache Hive Essentials |
Page 1 of 2
Author: Dayong Du Reviewer: Ian Stirk
This book aims to introduce you to a popular platform for storing and analyzing big data on Hadoop. How does it fare? Increasing amounts of data are being created, and there’s a need to store and process this data to gain competitive advantage. Hive is a popular platform for storing and analyzing big data on Hadoop. Hive tends to be popular because it uses a SQL-like syntax, familiar to many people. With plenty of built-in functionality, big data analysis can be done in Hive without advanced coded skills. The book is aimed at both the beginner and the more advanced audience (data analysts, developers, and users). Some previous experience of SQL and databases is advantageous.
Chapter 1 Overview of Big Data and Hive The chapter opens with a brief overview of the history of data processing, covering: batch, online, relational databases, and the internet. The latter has led to a massive rise in the amount of data being created, requiring new approaches to processing. This big data can be described in terms of various attributes including: volume, velocity and variety. Big data tends to be processed on relatively cheap commodity hardware, using a distributed processing. Hadoop is a popular platform for big data processing. The chapter discusses the major components of Hadoop:
Having described how we arrived at big data and Hadoop, the chapter proceeds with an overview of Hive. Hive allows you to issue queries against petabytes of data, using its Hive Query Language (HQL) which is similar to SQL. Hive gives a table structure to data held in HDFS. Using Hive allows simpler data processing, compared with similar code written in Java. This chapter provides a helpful background on how we arrived at today’s big data and Hadoop platform. An overview of Hadoop and its components is given, together with a very helpful diagram of the Hadoop ecosystem (e.g. HDFS, HBase, Sqoop, Impala, etc). A useful overview of Hive is provided, highlighting its purpose and advantages.
Chapter 2 Setting Up the Hive Environment This chapter opens with step-by-step guidance on installing Hive from the Apache website. The prerequisites are described but not detailed, followed by the installation steps. The chapter continues with a look at installing Hive from vendor packages, this is often the preferred route since it can save time - and you’re ensured the various Hadoop components will work together. The example given shows the step-by-step installation of Hive from Cloudera Distributed Hadoop (CDH). The chapter proceeds with a look at using Hive from the command line and Beeline. Useful example code is provided for both. The chapter ends with a look at the Hive-integrated development environment, this can plug into the Oracle IDE. Additionally, it’s noted that Hue (Hadoop User Experience) can provide a user-friendly web-interface to Hive. It’s important the installation steps are detailed, since without them you’re not going to be able to use the product. This chapter provides very helpful step-by-step instructions for installing Hive via the common methods. Perhaps more emphasis could have been given to Hue, since it provides a centralized user-friendly web-interface to most Hadoop components.
Chapter 3 Data Definition and Description This chapter opens with a look at the datatypes Hive supports. There’s a helpful grid showing the datatypes, descriptions, and example usage. The datatypes are divided into primitives (e.g. INT, VARCHAR) and complex types (e.g. ARRAY, MAP). Helpful example datatype usage is provided. The chapter continues with a brief look at Hive’s Data Definition Language (DDL), which allow the creation, deletion, and alteration of schema objects (e.g. tables). Useful example code is provided, including creating a database. The use of SHOW and DESCRIBE to view various objects is discussed. Hive internal and external tables are examined next. Hive tables are similar to tables in a relational database (e.g. Oracle), and external tables are loaded via the location property. Useful example code is provided, including: create a table as a SELECT, with Common Table Expression (CTE), and an empty table. Further examples include COUNT(*), DROP TABLE and TRUNCATE TABLE. The chapter then looks at the use of partitions. By default, a query will scan all of the table’s data, if partitions are used, then only the relevant partitioned data is loaded, often giving improved performance. Typical columns to partition on are date, region, and department. Buckets allow further breakdown via a hash of the relevant column (e.g. employee_id). The chapter ends with a look at views. These contain query definitions and are not populated until the query is run. Like their relational database counterpart, Hive views allow complexity to be hidden, and filtering or flattening of data. This chapter provides a very useful discussion of the Hive database, in terms of tables, partitions, buckets, datatypes and views. Helpful example code is provided throughout. If you’ve used other databases and SQL, then you’ll understand the concepts in this chapter more easily.
Chapter 4 Data Selection and Scope This chapter is primarily concerned with selecting data. SELECT is examined in terms of DISTINCT (no duplicate data), FROM (tables to select data from), WHERE (filtering data), and LIMIT (limit number of rows returned). Various conditions are examined that stop a MapReduce job from starting (MapReduce jobs split the work over several boxes, at times this may be inefficient). Plenty of example SELECT code is given. The chapter continues with a look at joining tables. Hive supports: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. Example code and output for each join type is provided. There’s a useful grid that describes each join. Hive supports UNION ALL (i.e. keep duplicate rows), but not: UNION (no duplicates), INTERCEPT (data in both sets), or MINUS (data not in one set). Helpful alternate code is provided to simulate UNION, INTERCEPT and MINUS functionality. The chapter shows the most common usage of Hive, to retrieve data. The various ways of limiting the data (WHERE, LIMIT) were examined, as were the various ways of joining tables. Useful workaround code was supplied for UNION, INTERCEPT, and MINUS. If you’re familiar with SQL, the ideas in this chapter should be familiar.
Chapter 5 Data Manipulation This chapter open with a look at loading data into Hive, this is typically done with data held on HDFS. The LOAD keyword loads the data, and the OVERWRITE keyword allows the data to be overwritten. Additionally, the INSERT command can be used to load data from an existing Hive table/partition into a new or existing table. Helpful example code is provided for both methods of loading data. EXPORT and IMPORT can be used to migrate Hive data between systems (e.g. production to development), or for backup purposes. EXPORT will export data and metadata about a table into HDFS, this can then be copied to other clusters via the distributed copy (distcp) utility. The IMPORT command does the reverse of the EXPORT. Again useful example code is provided. The chapter continues with a look at the various ways of ordering data, commands examined include: ORDER BY, SORT BY, DISTRIBUTE BY, and CLUSTER BY. In each case example code is given. Hive includes a number of operators and functions to help with coding. To see a list of these functions you can run SHOW FUNCTIONS. To see the interface for a given function you can run DESCRIBE FUNCTION functionName. Functions can be grouped as: maths, collections, type conversion, dates, conditionals, strings, aggregates, table-generating, and customized. In all cases example code and outputs are given. Additionally, useful links for further information are provided. The chapter ends with a brief look at transactions (and ACID support) and the ability to DELETE and UPDATE data, these welcomed features exist as part of the later versions of Hive. This was an interesting chapter. The ability to import and export data from Hive are important features, as are updating and deleting data. I’m not sure why the various ways of sorting data and the use of functions/operators were discussed in this chapter – I suspect they belong to the previous chapter. Some useful code for creating partitions based on dynamic values is provided. Sqoop is the popular tool used to move data from relational databases into both HDFS and Hive directly. Since Sqoop is likely to be a major method of getting data into Hive, I would have expected some detail of its usage to have been included here.
|
|||
Last Updated ( Friday, 01 May 2015 ) |