Getting Started with Impala |
Author: John Russell ISBN: 9781491905779 Reviewer: Ian Stirk
This book aims to get you up-and-running with Impala – a tool for quickly querying Hadoop’s big data. How does it fare? It is targeted at analysts, developers and users who are new to querying Hadoop’s big data. Experience of SQL and databases would be useful, but is not mandatory. This is a short book, containing 110 pages split into five chapters. Chapter 1 Why Impala? The chapter opens with a look at Impala’s place within Hadoop’s ecosystem of components. Big data stores massive amounts of data, and querying this data is typically a batch process. Impala can often query this data within seconds or minutes at most, giving a near ‘interactive’ response. Additionally, compared with traditional big data tools like Java, Impala provides a much faster development cycle. The chapter continues with a look at how Impala readily integrates with Hadoop’s components, security, metadata, storage (HDFS), and file formats. Impala can perform complex analysis using SQL queries that can calculate, filter, sort and format. Next, it’s suggested you may need to change the way you work with big data. Previously, queries worked in batch mode, which often required a context-switch as you moved back and forth between other tasks. This view changes with Impala, which often provides a near interactive experience. The chapter ends with a look at Impala’s flexibility, able to work with raw data files, in many different formats. This means there are fewer steps than in traditional processing (i.e. no need for filtering or conversion of data), so arriving at solutions is faster. This chapter provided a useful introduction to Impala, describing what it is, what it’s used for, and giving its advantages: quick and easy development, fast queries, and integration with existing Hadoop components. Chapter 2 Getting Up and Running with Impala This chapter opens with the various ways of installing Impala, giving the advantage of each. The methods are:
The chapter continues with a look at connecting to Impala. The book concentrates on connecting via the impala-shell. Examples are provided of connecting to the localhost (the default) and to a remote box, additionally the use of a non-default port is discussed. The chapter ends with some sample SQL queries to run. The initial queries do not have a FROM clause, so don’t access any tables, these queries are especially useful for testing the installation and configuration of Impala is correct. SQL is also provided to create tables and insert data into a table. This chapter provides practical detail on installing and connecting to Impala, via various setup methods. The chapter contains helpful first SQL queries to get you started. There are helpful links to Impala’s discussion forums, mailing list, and community resources. I was surprised the use of Hue to run Impala queries wasn’t examined, since this tool provides a centralized user-friendly web-interface to many of Hadoop’s tools – a boon to all users.
Chapter 3 Impala for the Database Developer The chapter opens with a look at Impala’s SQL language, which contains familiar topics like joins, views, and aggregations. There’s a useful link to Impala’s SQL Language Reference. Various data types are briefly discussed. The EXPLAIN statement can be used to show how the SQL is executed. Various limitations in the language are highlighted (e.g. no indexes or constraints), though if you come from a data warehouse background you’ll appreciate these are often not limitations. There’s a link to Impala’s new features documentation – it’s recommended to check this regularly for updates. The chapter proceeds with a look at big data considerations. Here, big data is taken to mean at least gigabytes of data, containing billions rows. It’s suggested that queries are first tested on a subset of data using the LIMIT clause, if the query output looks correct the query can then be run against the whole dataset. It’s noted that if you come from a traditional transaction databases background, you may need to unlearn a few things, including: indexes less important, no constraints, no foreign keys, and denormalization is good. The physical and logical aspects of the data are discussed next. HDFS has a large blocksize (128MB by default), and large files are broken into smaller files, and stored multiple times (3 by default) across the servers in the cluster. This together with parallel processing ensures the data is queried quickly. Next, the execution steps of a distributed query are discussed, with Hadoop doing all the complex work. A brief review of normalized and denormalized data is given. Denormalized data is more typical of a data warehouse, often giving better performance with fewer tables with more columns. Various file formats are examined, and the advantages of each discussed. Parquet is often the preferred file format, since it offers column-oriented layout, more suitable for certain queries and is amenable to compression. It is possible to switch between the various file formats. The chapter ends with a brief look at aggregations, via the GROUP BY clause. It’s also possible to aggregate smaller files into larger ones (this may improve performance). If you come from a database background, this chapter is more easily understandable, and you can probably use Impala immediately. This chapter puts your existing database knowledge (e.g. views) into a familiar context, while noting you might need to unlearn certain things (indexes, constraints). The chapter discussed some interesting HDFS topics, including blocksize, replication, and reliability – which together with the size of the data tables, impact performance. Impala’s roadmap link is useful for discovering forthcoming features (e.g. INTERSECT, ROLLUP, MERGE are expected in 2015).
Chapter 4 Common Developer Tasks for Impala Having given a background on what Impala is, and how it can be used, this chapter moves on to a set of common tasks that you’re sure to hit as you proceed with Impala. The chapter opens with a look at getting data into an Impala table. This starts with a look at using INSERT... SELECT, and the use of INSERT OVERWRITE to replace data. The section continues with a look at LOAD DATA to move data from HDFS into Impala. Impala can query Hive tables, but you need to issue INVALIDATE METADATA tableName to make Impala aware of the table. Sqoop can be used to move data from a database into Impala, a brief outline of Sqoop’s functionality is given. The next task looks at porting existing SQL code to Impala. It’s noted that most code should port unchanged, but the Data Definition Language (DDL) in particular will need to change. Other changes are likely needed for deletions, updates, and transactions (all are currently missing from Impala). Using Impala from a JDBC or ODBC application is examined next. The main point seems to be that you must remember to close your connections, else you can expect a call from your administrator! It’s possible to use Impala from various scripting languages, including Python, Perl, and Bash. A small example script is provided. A later section discusses writing User-Defined Functions (UDFs) in C++ (for speed) or Java etc, these functions can then be used in your Impala SQL queries. Next, various Impala optimizations are examined, including:
The chapter ends with a discussion about collaborating with your administrator. During development you’ll typically have freedom to do things the way you want. However, your organization will typically have a preferred way to do things, into which you’ll need to integrate. You will save yourself some stress if you consider the below during development:
This was an instructive chapter, answering many of the questions you’re sure to ask as you progress with your Impala work. The section on performance tips was particularly useful (e.g. no indexes, constraints, use partitions, use an optimal file format). Additionally, integrating to other systems via JDBC and ODBC should prove helpful for your development.
Chapter 5 Tutorials and Deep Dives This chapter looks similar to the previous chapter, discussing typical concerns a new user of Impala might face, however this section delves much deeper into Impala’s functionality. Topics covered include:
This chapter provides plenty of deeper content, which at first might seem a little strange in an introductory book. Again there are many topics covered that you’ll surely want to investigate further (it seems performance is a common concern on all systems, even big data systems). Conclusion This short book aims to get you up-and-running with Impala, and succeeds commendably. Throughout, there are helpful explanations, screenshots, practical code examples, inter-chapter references, and links to websites for further information. It’s packed with useful instructions, but some sections could benefit from more code examples. This book is suitable for analysts, developers and users that are starting out with Impala. Although aimed at the beginner, several later sections contain more advanced topics (e.g. performance). If you have a background in SQL, you will have a head start, and if you know about data warehousing, the book is even easier to understand. The world of Hadoop and its components changes frequently, so be sure to check out Impala’s latest functionality on the Cloudera site. Impala is a popular tool for querying Hadoop’s data quickly, much quicker than other tools. Additionally, the development cycle for Impala queries is much shorter than for comparable tools like Java and MapReduce processing. I would suggest Impala should be your first choice for querying data, even if the underlying data is stored in some other component (e.g. Hive). Obviously there is much more to learn about Impala than what’s given in this small book, but this book is a great place to start learning. Highly recommended.
|
|||
Last Updated ( Monday, 11 May 2015 ) |