Scalable Big Data Architecture |
Page 1 of 2 Author: Bahaaldine Azarmi This book aims to be “A practitioner’s guide to choosing relevant big data architecture”. How does it fare? With ever increasing amounts of data to be processed, big data systems are in vogue. There are many competing technologies for each area of big data processing, this book aims to help you decide the relevant architecture and technologies. The book is aimed at: “...developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools to integrate into that pattern”. While the skills needed to read the book are not stated, after reading the book, I would say in some areas you can be new to the subject (e.g. machine learning) but in others a background in big data is assumed (e.g. “infinity converging model”, “funnel conversion”, “Spark Direct Acyclic Graph”). Below is a chapter-by-chapter exploration of the topics covered. Chapter 1 The Big (Data) Problem The opening chapter says, if you have problems with large data volumes, unstructured data, costly hardware, and certain use cases (e.g. need for consumer and sentiment analysis), you probably need to implement a big data solution. There are a great many technologies to choose from, and the author selects some that he thinks you are likely to use in your solutions. Specifically, technologies are briefly discussed for: Hadoop Distribution (Cloudera CDH, Hortonworks HDP, HDFS), Data Ingestion (Flume, Sqoop), Processing Language (YARN, Hive, Spark, Kafka), Machine Learning (Spark MLlib), and NoSQL stores (Couchbase, ElasticSearch). The chapter outlines a specific scenario the solution is aimed at, namely: the user searches for products, the web log is ingested and analysed, a learning application produces recommendations based on the user’s selected product, and a search engine extracts analytics for the processed data. Each of these steps is briefly discussed, with the aid of a high-level diagram. In essence, in this chapter (and the book), the author has given a scenario, and has preselected big data technologies to fulfil this scenario. I have various problems with the chapter, including: it is awkward to read, made worse because there are problems with the grammar; the author doesn’t give sufficient reasons why specific technologies have been selected; and some assertions do not have evidence to support them. The blurb on Amazon says: “This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem.” This is misleading in that the technologies are preselected and there is insufficient discussion as to why they were chosen. The chapter has several nonsense sentences, e.g. “Apache is a distributed publish-subscribe messaging application written by LinkedIn in Scale.” this should read “Apache Kafka is a distributed publish-subscribe messaging application written by LinkedIn in Scala.” When discussing Impala, a reference is made to Base, this should read HBase. The chapter mentions features before they are described, and some are not defined at all, e.g. ZooKeeper, “infinity converging model”, “funnel conversion”, “Spark Direct Acyclic Graph”. I wonder what level of knowledge the book is assuming?! I note that the author works for Elastic, and its products form part of the proposed ingestion solution, despite the author saying earlier: “When you are looking to produce ingesting logs, I would highly recommend that you use Apache Flume...” Perhaps the book’s title should be something more specific (e.g. “An example big data architecture using preselected components, based around Elastic’s software”). Luckily, the first chapter has most problems, the other chapters are generally more readable, and some discussion of other technologies is included (but the technologies are already prescribed!).
Chapter 2 Early Big Data with NoSQL This chapter aims to provide an overview of big data storage technologies, focusing on the earlier preselected technologies Couchbase and ElasticSearch. It opens with a look at the various types of NoSQL databases, giving brief descriptions of Key/value, Column, Document, and Graph databases. It goes on to take a high-level look at the document-oriented NoSQL technology used in the proposed solution. The chapter continues with a deeper look at Couchbase, specifically its architecture, cluster management, and managing documents. Next, ElasticSearch is examined, looking at its architecture, how to monitor it, and how it is used to search. Useful diagrams, discussions, and code are given. The chapter ends with a look at the use of NoSQL database as a cache in SQL-based architecture. Some technologies, not used in the solution, are briefly discussed, this might prove useful to someone wanting to know what other technologies are available, but this should have been given earlier, and also much more detail is needed. Chapter 3 Defining the Processing Topology This chapter aims to explain the processing approaches used in the proposed solution. The chapter opens with a little history of how data processing occurs in IT departments (OLTP and ETL), noting these established methods are often slow to give results, and don’t scale easily. Many organizations are moving towards a Hadoop-based scalable architecture to get the results in a timely manner. The chapter continues with a look at data sources (with Twitter in the example given), and how to process the data. A Hive Tweet structure is overlaid on the tweet data, since it has a tabular structure, and queried. Code is provided for the tweet structure and Hive query. The Hive query is relatively slow, since it translates to a batch MapReduce job. Faster, near real-time process can occur with Spark. The standard batch MapReduce WordCount java program is provided, and compared with the significantly smaller Spark code (which runs much faster too). The chapter next looks at splitting the architecture, which involves providing views on top of batch processed historical data, together with incoming new stream data (the stream data is eventually merged into the historic data). All of this really describes the Lambda architecture made popular in the Manning Press book Big Data, reviewed for I Programmer by Kay Ewbank. It might have been useful to emphasise here that companies are moving away from the slower batch MapReduce processing toward the faster and easier to understand in-memory Spark processing. The author could have mentioned the use of Impala for near real time querying of Hive data. |
|||
Last Updated ( Friday, 04 March 2016 ) |