Field Guide to Hadoop |
Page 1 of 2 Authors: Kevin Sitto & Marshall Presser
This slim book sets out to provide an up-to-date overview of Hadoop and its various components, which seems a worthwhile aim. Hadoop is the most common platform for storing and analysing big data. This book aims to be a short introduction to Hadoop and its various components. The authors compare this to a field guide for birds or trees, so it is broad in scope and shallow in depth. Each chapter briefly covers an area of Hadoop technology, and outlines the major players. The book is not a tutorial, but a high-level overview, consisting of 132 pages in 8 chapters. For each component, details are listed for:
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Core Technologies The chapter opens with a bit of history. The origins of Hadoop can be traced back to a project called Nutch, which stored large amounts of data, together with 2 seminal papers from Google – one relating to the Google File System, and the other about a distributed programming model called MapReduce. The ideas in the papers were incorporated into the Nutch project, and Hadoop was born. Yahoo! began using Hadoop for its search engine, and now Hadoop is the premier platform for processing big data. Hadoop consists of 3 primary resources:
This was an interesting chapter, laying the groundwork for the rest of the book, identifying what Hadoop is, its major components, and how they work. Helpful links to tutorial information are provided, together with outline code examples (as they are throughout the book). Perhaps some emphasis could have been given to describing the attributes of big data (i.e. volume, velocity and variety) that require a system like Hadoop to process it. I’m not sure why Spark was included in this core section.
Chapter 2 Database and Data Management With so much data being stored, there is a need for some kind of database. The chapter opens with a look at the various types of NoSQL databases that exist (e.g. column store, document store, key-value). The chapter continues with a brief overview of the major databases, including:
It should be noted that although the book says Hive does not support delete and update statements, they are supported in later versions (from version 0.14.0 onwards, released November 2014). This chapter provides a useful, up-to-date view of the various types of data stores that can be used with Hadoop. Occasionally, helpful comparisons between the databases are made. The chapter notes that although MongoDB and Cassandra are currently the most popular databases, HBase is increasing popular and may soon be the leader.
Chapter 3 Serialization This chapter looks at the format of stored data. Various tools are described, and the trade-off between tool flexibility and complexity discussed. Tools discussed include:
This chapter provides a useful, up-to-date view of the various types of tool that can be used to serialize/ deserialize data. Helpful example code is provided.
Chapter 4 Management and Monitoring With a diverse collection of tools, and a large collection of machines, it’s important to be able to monitor and manage the system. Various tools are described here, some are concerned with node configuration management, and others provide a system health overview. Tools examined include:
I enjoyed this chapter - anything that makes monitoring and management of a large number of tools across a large number of machines should prove very helpful. Tools like Ambari in particular are known to save hours of work and frustration when installing Hadoop and its tools.
Chapter 5 Analytic Helpers This chapter is concerned with both cleansing/transforming data, and using machine-learning algorithms to categorize and discover things about data. The chapter first looks at MapReduce interfaces, these make MapReduce programming easier. The chapter then looks at analytic libraries, which make data easier to analyze. The tools examined include:
This chapter provides a very helpful overview of the current tools that make MapReduce programming easier, and tools that make data easier to analyze. Perhaps this chapter should have been split into two chapters, relating to the two discrete areas covered here.
|
|||
Last Updated ( Wednesday, 22 April 2015 ) |