Hadoop Application Architectures
Article Index
Hadoop Application Architectures
Chapters 5 to 10, Conclusion

 

Chapter 5 Graph Processing on Hadoop

The chapter opens with a review of what graphs are. Graphs refer to relationships between items (e.g. friends of friends). Much of this processing is iterative – which MapReduce struggles with.

The chapter continues with an overview of Giraph, discussing its various steps, and providing a helpful example. Perhaps the most useful section is the part that details when to use Giraph – it is very powerful, and mature, but relatively complex, and is not included in all Hadoop distributions.

Next, GraphX is discussed, it is part of the Spark project, and not as mature as Giraph. For developers that are familiar with Spark, GraphX may be preferred. A code example is provided and discussed.

The chapter ends with a discussion of when to use each tool. If the problem relates only to graphs, then the mature Giraph tool is probably best, but if graphing is only part of the problem, then GraphX may be preferable.

This chapter provides an overview of the 2 main graphing Hadoop technologies. The section on how to choose the correct tool should prove useful. I do wonder if this chapter is needed, since graph processing is not a widespread area of Hadoop processing.

Chapter 6 Orchestration

This chapter opens with a look at the need for workflow and scheduling tools. Often application solutions involve various steps and these can be tied together using orchestrations.

The chapter first examines the use of scripts for creating workflow, while this is useful for simple problems, as the problem become larger a more robust solution is required. Next, enterprise job schedulers are briefly examined (e.g. Autosys), these have the advantage of being familiar.

The chapter continues with Orchestration Frameworks in the Hadoop. These are typically tightly integrated with Hadoop. While the chapter focuses on Oozie, the other workflow tools are similar.

After discussing terminology (workflow, DAG, co-ordinator, bundle), Oozie workflow is discussed - being defined in an XML file (workflow.xml). Common workflow patterns are examined (Point-to-Point, Fan-Out, Capture-and-Decide), with the aid of configuration file entries and diagrams.

Next, scheduling is examined. Coordinators are used to schedule workflows, being defined in an XML file (coordinator.xml). Jobs can be scheduled to run repeatedly at given time intervals, or in response to time or data triggers.

The chapter ends with a look at how to execute workflows. It’s important to ensure all the relevant configuration files, JARS and any dependencies are included.

This chapter provides a useful overview on factors to consider for workflow in Hadoop.

Chapter 7 Near-Real-Time Processing with Hadoop

The world of Hadoop is changing. Previously the preferred processing model was MapReduce. Now there are tools that can process data much more quickly, in near-real-time, using streams. This chapter concentrates on discussing 2 such tools, Storm and Spark.

Steam processing involves processing data that arrives continuously (e.g. social media feeds). The chapter looks at various aspects of stream processing including: aggregations, windowing averages, data enrichment, persistence, and lambda architecture.

Next, the chapter looks at Storm, having simplicity of development and deployment, scalability, fault tolerance, and broad programming language support. The chapter continues with a look at the architecture of Storm. Storm topologies are examined in terms of spouts and bolts. The integration of Storm with HDFS and HBase is briefly examined. There’s a helpful Storm example that shows the calculation of a moving average. The section ends with a discussion on when to use Storm.

Next, Spark streaming is examined, this is particular suitable if you’re familiar with Spark, and you can handle a small delay in processing (2-5 seconds). It has the advantage that you can re-use code between streaming and batch processing. There’s a useful discussion and diagram that compares Spark streaming with watching videos (the video is really a stream of static pictures).Various examples of Spark steaming are provided, including: simple count, multiple inputs, maintaining state, and windowing. The section ends with a discussion on when to use Spark streaming.

Some other stream processing tools are discussed, including: Trident (wrapper for Storm overcoming some of its problems), and Flume interceptors (they allow events to be processing as soon as they’re collected). It’s noted that some SQL tools are also useful but are not discussed here (e.g. Impala).

This chapter provides a useful overview of some common streaming technologies together with details of when to use them. 

 

HadoopAA

 

Section II. Case Studies

The last 3 chapters of the book provide case studies for: 

  • Clickstream analysis – analyzing events (click data) from users browsing websites

  • Fraud detection – identifying patterns that indicate fraud

  • Data Warehouse – using Hadoop can complement existing data warehouse processing 

Each case study is a complete end-to-end solution, containing aspects of: defining the use-cases, design overview, storage, ingestion, processing, analysis, and orchestration. Discussions, diagrams and code are provided. These chapters apply what has been learned in the previous chapters to the real-world applications discussed here.

Conclusion

This book aims to provide Hadoop current best practices, example architectures and complete implementations – and succeeds in each area.

The book is well written, providing good explanations, examples, walkthroughs, and diagrams. Useful links are given between chapters, and there’s a valuable conclusion at the end of each chapter. The order of the chapters is helpful in understanding the flow of topics. This is not a book for beginners, but does contain useful references to books to get you up to speed.

In many ways, this book follows on naturally from “Hadoop: The Definitive Guide”, which I recently reviewed. It provides practical discussions of the many factors to consider when presented with common Hadoop architectural concerns (e.g. whether to use HDFS or HBase?). The book offers recommendations, and provides supporting information that backs these up.

The book doesn’t cover all Hadoop technologies (e.g. it omits Machine Learning), but it does cover many popular ones. Some of the books referenced are getting old and some chapters have footnotes at the end, which would be better placed on the pages where they are referenced.

Hadoop is changing rapidly, this book suggests the near future will see a decline in MapReduce processing, and a rise in processing using Spark. Similarly, at the higher-level of abstraction, SQL in its various flavours also appears to be in ascendancy.

If you want to know the current state of Hadoop and its components, want a practical discussion of the pros and cons for using various tools, and want solutions to common problems, I can highly recommend this book. 

Banner
 

For more on Hadoop see Ian Stirk's review of Hadoop: The Definitive Guide (4th ed) 

 


Reliable Source: Lessons from a Life in Software Engineering

Author: James Bonang
Date: January 2022
Pages: 608
Kindle: B09QCBVJ9V
Audience: General interest
Rating: 5
Reviewer: Kay Ewbank

This book combines a fun read with interesting insights into how to write reliable programs.



Driving Value With Sprint Goals

Author: Maarten Dalmijn
Publisher: Addison-Wesley
Pages: 256
ISBN: 9780137381920
Print: 0137381921
Kindle:B0C7ZJR7N2
Audience: Scrum developers
Rating: 5
Reviewer: Kay Ewbank

Over the years I've read a lot of books about agile development and Scrum, and most concentrate on the methodology rather tha [ ... ]


More Reviews

 



Last Updated ( Tuesday, 08 September 2015 )