Big Data Analytics With Microsoft HDInsight In 24 Hours

Author: Manpreet Singh and Arshad Ali
Publisher: Sams
Pages: 592
ISBN: 978-0672337277
Print: 0672337274
Kindle: B017WRCKEM
Audience: Data scientists and developers
Rating: 4
Reviewer: Kay Ewbank

This is a good introduction to HDInsight and the wide range of other products needed to make use of it.

HDInsight is Microsoft's Hadoop distribution for Azure, and this book provides an overview and introduction to HDInsight, and to the abundance of other products and services that you need to use with it. It doesn't go into depth, but at least shows you where they fit into the Hadoop/HDInsight ecosystem.

The book opens with an introduction to big data and NoSQL. As this is a Sams Teach Yourself in 24 Hours book, each chapter is designed to take around an hour to work through. Hour 2 introduces Hadoop, its architecture and ecosystem, and the Microsoft offerings around it. Alongside the main players such as Hive, Pig and HBase, the chapter gives coverage of less well known members including HCatalog, Sqoop, Mahout, Flume, Oozie, Pegasus and RHadoop. The Hadoop Distributed File System (HDFS) is introduced next, with a useful section on HDFS Rack Awareness, and of the WebHDFS protocol.The final two chapters in this part of the book introduce MapReduce and YARN.

Part two of the book moves on from 'vanilla' Hadoop to the Microsoft cloud implementation, HDInsight, starting with a chapter on setting up your own HDINsight cluster and how to provision it. There's a chapter on the typical components of an HDFS cluster - the name mode, the secondary name mode and the data nodes. HDINsight separates the data from the cluster and uses Azure blob storage rather than HDFS as the default file system for storing data, and this chapter explains what this means, and how Hadoop 2 has made high availability standard.

The next chapter goes into more details on storing data in Azure Storage Blob, what the benefits are, and the tools you use to explore the storage - safe decommissioning, easier sharing, geo-replicated support and built-in scaling.

Microsoft provides an HDInsight emulator that you can use to test MapReduce jobs, and the next chapter looks at getting started with the emulator and setting it up for storage.

Part three of the book covers programming MapReduce and HDInsight Script Action, with a chapter on each topic. This isn't a book aiming to teach you complete programming, but the chapters are enough to give you an idea on the techniques involved. Script Action scripts are PowerShell scripts that are used to automate the installation and configuration of the Hadoop ecosystem.

The authors next move on to Hive, HCatalog, and Tez, starting with a chapter introducing what they do and how they can be used from Visual Studio and the HDInsight .NET SDK. The following chapter looks in more detail at programming with Hive, and in particular the HiveQL query language that's similar to SQL. The chapter goes gently as far as creating table views and writing Select queries, then jumps to more real world examples of data analysis with table partitioning. There was a definite complexity jump at this point; one on page the authors are going in stages through a Where clause that any database developer will know already, next thing they're into full flow on dynamic partition inserts and Clustered By clauses, and taking it at a distinctly faster pace.

Chapters on accessing HDInsight data using Microsoft Power BI, Excel, PowerPivot, SQL Server and Power Query come next. These seem to pull back to a gentler pace and level, essentially showing you what commands are available at the 'hello world' equivalent of analysis. A chapter on integrating HDInsight with SQL Server Integration Services is equally gentle, introducing SSIS, showing how to provision an HDInsight Cluster from it using a PowerShell script, then use another script to execute a Hive query then load the data into a SQL Azure table.

An introduction to the use of Pig for data processing comes next, introducing Pig Latin, and showing how to use HCatalog in a Pig Latin script. There's a short chapter on Sqoop covering its import and export commands, and using it with PowerShell. A similarly short chapter on Oozie workflows and job orchestration goes a little further, showing how to submit an Oozie workflow with the HDInsight .NET SDK, but it's still an introduction.

A chapter on performing statistical computing with R describes what it's used for, how to integrate it with HDInsight, and how to enable it. Spark gets similar treatment; the authors briefly discuss how fast it is at in-memory computation, then give a nice introduction to the Spark programming model. The Azure Machine Learning platform and how you can use it to create and run predictive models is also covered well, if relatively briefly. A chapter on Storm real-time analytics and the SCP.NET library includes an interesting code sample showing how to analyze stream data, then use the results to create a SQL Azure table. The book ends with a chapter on creating an HDInsight cluster with HBase.

Overall, the topics are covered well, but briefly. The problem the authors faced was the sheer size and scale of the topic. Having read the book, you'll know how everything fits together, and where you should be focussing your attention to gain more expertise. I'm not sure you'd be ready to go off and do your own HDInsight analyses purely on the basis of this book, though - there just isn't the space to go into enough detail on the topics.

Related Reviews

Doing Data Science (Rated 5/5)

Big Data Analytics With Spark (Rated 5/5)

Mastering Apache Spark

HBase Essentials

Hadoop Application Architectures

Hadoop: The Definitive Guide (4th ed)

Related Reviews

See also: