Storm Applied |
Authors: Sean T. Allen, Matthew Jankowski and Peter Pathirana Described as a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams, how does it stack up?
The book opens with a chapter introducing big data, where Storm fits into the picture, and why you'd want to use it. If you're not sure why you would be interested in Storm, it's Apache's distributed, real-time computational framework for processing unbounded streams of data. The developers say Storm will do for real-time processing what Hadoop did for batch processing. Typical uses of Storm include real-time analytics, online machine learning, continuous computation, distributed RPC, and ETL. Storm is also fast: a benchmark clocked it at over a million tuples processed per second per node. The first chapter has a nice comparison of various big data tools including Storm, Spark, Spark Streaming, and Samza. Chapter 2 then moves on to the core Storm concepts. The authors acknowledge that Storm is initially tricky to get to grips with because of the number you need to learn; they make sense when used for real, but are intimidating if looked at as a whole. Instead, the authors introduce them as required. The chapter introduces the first project used to demonstrate how to develop using Storm. This is a GitHub commit count dashboard, and the chapter gives the basic code for starting this project. The two primary components of spouts (stream sources) and bolts (tuple acceptors and transformers) are used for this, and the next chapter goes further into why you need to think of all projects in terms of Storm constructs. The authors also look at how to work with unreliable data sources, how to integrate with external services, and how to understand parallelism in terms of Storm topology. The project used in this chapter is that of a social heat map based on which bars in a town are most popular in terms of being mentioned on social networks. The next chapter covers the creation of robust topologies for use in projects where you need to ensure fault tolerance, and to be able to guarantee that data is processed. The authors then look at how to move from Storm running locally to a remote or production Storm cluster. The Storm UI is also introduced in this chapter, with a good description of how to use it for diagnostics.
Tuning in Storm is the next topic to be covered. There's a good description of the process of tuning, using an example of a Storm system that's running too slowly, and how you can tune it using the Storm UI. Latency and the reasons why you might experience it are also explained well, along with the use of metrics to work out which bits are running slowly, and how bad things are. Resource contention is next on the agenda, and in particular how you need to deal with sharing the various resources that Storm provides – the number of worker processes; the amount of memory allocated to them; and network, socket and I/O contention. The next chapter looks at Storm internals, covering topics such as how an executor works under the covers, how tuples are passed between executors, internal buffering, routing and tasks. The idea is that if you know how Storm works under the covers, you'll be more likely to be able to make your projects work if they encounter problems. The final chapter covers Trident, which is a high-level abstraction that sits on top of Storm's primitives. You can use it to express a topology in terms of the 'what' as opposed to the 'how'. In other words, Trident involves less procedural programming and more declarative definitions, along with more batching of the streams as sets of records. This is a good book if you've not worked with Storm before. If nothing else, you'll come away understanding what concepts such as spouts and bolts are, because there are some great diagrams showing how these and other features of Storm fit together. The authors have a style that's easy to read, and the examples are well thought out and clearly explained. For more experienced Storm users, the chapters on tuning are very good. If you're wondering whether Storm is what you need for stream processing, this book is highly recommended.
|