Hadoop for Finance Essentials |
Page 2 of 2
Author: Rajiv Tiwari
Chapter 5 Getting Started The chapter opens with a look at why regulatory reporting is important, previously many companies went bust due to poor risk evaluation. A risk problem is outlined (duplicate, disparate data), with the potential solution (a simpler centralized data store in Hadoop). Standard naming conventions for configuration files, directories etc are given. Next, details of how to implement the required system are given, code is provided to ingest data using Oozie (a workflow/scheduler), and a discussion is provided on how this solution can also be achieved with popular ETL tools. Next, details are given on how to transform the stored data. Code examples are provided using Hive, Pig and Java MapReduce. There’s a helpful, if simple, flowchart on how to choose between using Hive, Pig and Java MapReduce. The chapter ends with a look at data analysis, specifically the integration of BI tools with Hadoop. An example is provided that shows how to setup the Hortonworks Hive ODBC Driver, and use it to access Hadoop data via a Qlikview client. This chapter provides a useful overview of how to implement a larger Hadoop project. The suggestions for standards are particularly useful. The chapter title seems inappropriate, since several Hadoop projects have already been implemented in previous chapters. Chapter 6 Getting Experienced The chapter opens with a look at what real time big data is. Definitions of real time vary, but it’s typically taken to mean a few seconds or less. Some real time processing is actually micro-batches (e.g. Spark). Tools for real time processing are briefly examined, including: Spark, Storm, and Kafka. The chapter continues with an overview of the project, that identifies fraudulent transactions. The proposed solution involves identifying transactions that are outliers. Historic transactions are used as input to the Markov Chain model (processed as MapReduce batch jobs), and current transactions (held on a queue in Kafka) are compared with these to identify outliers. Code is provided for the MapReduce jobs, and Kafka queues. The Storm and Spark real time architectures are briefly outlined, and code is provided (for both) to implement data collection and transformations. This chapter provides a practical implementation of a real time fraud detection use case. The chapter’s title is incorrect, it should read “Real time processing in Hadoop”.
The chapter suggests getting the business users involved in the projects early. Various project considerations are examined briefly, including: projects with clear benefits, start with small projects, data lakes, lambda architecture, and security/privacy. These are examined in more detail later. The chapter looks at some more big data use cases, these outline a problem and provide brief solutions, they include: customer complaints, algorithm trading, and social media based trading. Next, data lakes are examined, its purpose is to prevent Hadoop silos, combining relational databases with Hadoop. The relational database processes low-volume but high-value data, while Hadoop processes high-volume and new types of data. Some analysis tools are listed (e.g. SAP). The chapter then looks at lambda architecture, this combines batch and real-time processing on one platform. Historical data has aggregated views on it, and this can be combined with newly received data. Periodically the new data is moved to the historic data and the views recalculated. It ends with a look a security. This involves authentication (Kerberos and file system security), authorization (Kerberos principals to usernames), and encryption. This chapter provides a useful list of topics to consider when you want to scale up your Hadoop systems. There’s a useful list of finance use cases, which should give you ideas for your own systems. Integration of Hadoop and relational databases via data lakes was explained, as was the integration of historic data and new data via lambda architecture.
The chapter ends with a look at new trends, including: Hive/Pig getting faster with each release, the increasing use of in-memory processing (e.g. Spark), and the growth of Machine Language and R. This chapter provides some helpful discussions concerning the Hadoop distribution upgrade cycle and how it integrates (or not) with other finance related software upgrades. There are some useful best practices and standards suggested.
Conclusion The book is generally easy to read, has good explanations, useful diagrams, and links to websites for further information. Assertions are backed by supporting evidence. There are plenty of finance use cases for you to consider, and a good section on recommended skills. Sometimes the examples are unnecessarily complex (e.g. online archiving). This is an introductory book, the examples should be simple. The book’s examples relate largely to investment banking rather than finance as a whole. Most sections are brief, and not deeply technical. Perhaps the next book to read is Big Data Made Easy which I gave a rating of 4.5 in my review. I found it to be a useful working introduction and overview of the current state of big data, Hadoop and its associated tools. This book is a useful, if brief, introduction to Hadoop and its major components, using examples from investment banking.
|
||||||
Last Updated ( Friday, 28 August 2015 ) |