Hadoop Interview Guide |
Author: Monika Singla and Sneha Poddar This Kindle-only e-book aims to help you pass an interview for a job as a Hadoop developer at the junior or mid-level position, how does it fare? Targeted at existing Hadoop developers, it aims to provide in-depth knowledge of Hadoop and its components. It’s also a starting point for anyone wanting to venture into the Hadoop field from other IT fields. It is divided into 10 sections (Hadoop, HDFS, MapReduce, Flume, Sqoop, Oozie, Hive, Impala, Pig, and Java) with a total of 434 questions, and 23 example tasks to programs or follow along with. Below is a chapter-by-chapter exploration of the topics covered. Chapter 1 Introduction to Hadoop raditionally, Hadoop was considered as a combination of the Hadoop Distributed File System (HDFS) and the batch programming model MapReduce. In Hadoop 2, Yet Another Resource Negotiator (YARN) was added. However, increasingly, Hadoop is taken to mean all these things, together with Hadoop’s wider range of components. Example questions include:
This chapter provides a useful introduction to Hadoop in the wider context of big data. There’s a very useful step-by-step walkthrough on how to set up a standalone Hadoop environment using Cloudera’s QuickStart VM. Chapter 2 HDFS HDFS is the underlying file system for Hadoop, it has built-in functionality to split and distribute files over multiple nodes in the cluster, and to store multiple copies of these files – which helps with both resilience and parallel processing. Example questions include:
This chapter provides some helpful questions about one of the core components of Hadoop. It’s becoming clear that you will need to know what version of Hadoop and its components you’re dealing with, since defaults can change with version (e.g. the book says HDFS default block size is 64MB, but this changed to 128MB in Hadoop 2). Chapter 3 MapReduce MapReduce is a batch programming model used by Hadoop, where the work to be done is split across multiple machines (Map) and the results combined and aggregated (Reduce). Example questions include:
Again a wide range of useful questions is provided. One of the example programs to create is the standard “count the number of words in an input file”. Perhaps a question relating to the current vogue of using Spark in place of MapReduce could have been included.
Chapter 4 Flume Flume is a well known tool for moving unstructured data from various sources (e.g. log files) to various destinations (e.g. HDFS for subsequent processing). Example questions include:
This chapter asks questions about both theoretical and practical aspects of Flume’s data transfer functionality. Sections include: Basics, Configuration, Channels, Sinks, and Interceptors. The example programs illustrate the movement of data from various sources (e.g. twitter, Netcat) to various destinations (e.g. console logger, HDFS). Chapter 5 Sqoop Sqoop is a well known tool for moving structured data (e.g. relational databases) in and out of Hadoop. Example questions include:
The chapter has an in-depth look at Sqoop’s data transfer capabilities. I particularly liked the tables giving the meaning of the various import and export features of Sqoop. Sections include: Basics, Import, and Export. Chapter 6 Oozie Oozie is a workflow and scheduler system for Hadoop jobs. Example questions include:
The chapter provided an in-depth look at Oozie’s workflow and scheduler capabilities. The useful example programs illustrate the integration of Sqoop and Oozie workflow. Chapter 7 Hive Hive is Hadoop’s data warehouse, allowing queries to be processed in batch using MapReduce. Example questions include:
This wide-ranging chapter contains more than 30% of the book’s questions. Sections include: Basics, Hive Query Language (DDL, DML), Partitioning and Bucketing, Views, Query Optimization, Compression, Functions and Transformations, SerDe, and Advanced Hive. Chapter 8 Impala Impala allows queries to be processed interactively, often against Hive tables. Example questions include:
This short chapter contains just 3% of the book’s questions. I often wonder why anyone would want to use Hive queries when Impala is available - since it can query the Hive tables much faster. Chapter 9 Pig Pig provides workflow and scripting functionality, at a higher level than Java and MapReduce programming. Example questions include:
Java can be a difficult language to learn, Pig provides an easier way of programming MapReduce. Perhaps in the future, higher level tools (e.g. Pig and the various querying languages) will be used for most processing, and Java for only the low-level complex work. Sections include: Basics, Datatypes, Pig Latin, and Joins. Chapter 10 Java Refresher for Hadoop Java is a general purpose language, often used as the default language with various Hadoop components. Example questions include:
This section really is just a brief refresher, it contains a list of Java questions that you might be asked when going for a junior Java developer role. Conclusion This book contains a wide-range of questions about Hadoop and its components, and the answers generally provide accurate explanations with sufficient detail. The tasks to program at the end of each chapter should prove useful in demonstrating your practical understanding of the topics. Many other common Hadoop components could have been included e.g. HBase (NoSQL database), Spark (fast MapReduce replacement), ZooKeeper (configuration), and Mahout (machine learning). Perhaps it can be expanded in the future to include these topics - that said, the book does cover many of the core Hadoop technologies. I don’t think you can learn the subjects directly from this book, but you can use it as a benchmark to measure how much you have learned elsewhere. Generally the book is well written, however, some of the questions have substandard English grammar. Some of the longer example code is difficult to read because it is not formatted adequately. The references at the end of the book are useful, but details of publication date, edition, and publisher are missing (one has only the author’s first name!). There is a large list of websites at the end of the book, however, none of the sites is annotated. It should be remembered that Hadoop and its components are rapidly changing. So it’s important to view the answers in the context of the version used e.g. CDH 5.3 has Hive version 0.13.1 which does not support data modifications (as answered in the book), however Hive version 0.14.0 does. This book contains a wide-range of useful questions about Hadoop and many of its components. Overall a helpful book for the interview!
|
|||
Last Updated ( Friday, 03 July 2015 ) |