Page 2 of 2
Author: Guy Harrison
Date: December 30, 2015
Audience: Architects, DBAs, and Devs
Reviewer: Ian Stirk
Part II: The Gory Details
Chapter 8 Distributed Database Patterns
For relational databases, traditionally there was no need for distribution, instead when scalability became a problem, a bigger machine was purchased. Distribution was achieved later using replication – however there was still a central master database. Real distribution came with running queries in parallel across multiple servers, using “shared nothing” architecture (e.g. Teradata), again there are limitations with transactions, and node workloads can become skewed – the latter may be reduced with shared-disk architecture, but this raises co-ordination problems.
The chapter continues with a brief look at non-relational distributed databases. Since these systems don’t maintain ACID compliance, distributed processing is ‘easier’. Instead the focus is on balancing availability and consistency, often consistency is sacrificed at the cost of availability (if there is a problem, e.g. 2 people booking same hotel room, it can be fixed later by the business). Other factors discussed are the use of cheaper commodity hardware and built-in resilience.
The chapter ends with a discussion of 3 non-relational databases, each having a different architectural model, namely:
MongoDB - sharding model, data distributed across nodes based on shard key
HBase – omniscient master, determines where data should be loaded in cluster
Cassandra – consistent hashing, data distributed across nodes based hash key
In each case, the underlying model, its implementation, and components are discussed and illustrated with helpful diagrams.
This chapter provides a useful overview of how distribution is achieved in both relational databases (with its consistency requirement), and in 3 different ways with non-relational databases.
Chapter 9 Consistency Models
This chapter discusses some different approaches to maintaining data consistency for both NewSQL (relational) and NoSQL databases. The chapter opens with a look at consistency models, with relational databases offering ACID consistency, and non-relational offering eventual consistency. Aspects of consistency discussed include:
ACID and MVCC – optimistic v pessimistic concurrency
Global Transaction Sequence Numbers – used to identify transactions
Two-phase Commit – ACID transactions across databases
Other Levels of Consistency – typically non-ACID transactions (e.g. strict, causal, weak)
The chapter continues with a discussion of how consistency is achieved in the same 3 non-relational databases given in the previously (MongoDB, HBase, Casandra). In each case, the underlying architecture and locking are discussed.
This chapter provides a useful overview of consistency models in both relational and non-relational databases. While interesting, I felt it was less well organized than previous chapters, missing a sentence or two about how the sections are linked together.
Chapter 10 Data Models and Storage
This chapter extends the previous architecture and consistency models chapters, to discuss how they are supported by data models and storage. The chapter opens with a quick review of relational data models, with its normalized tables, joins, and strict schemas. The various non-relational models aim for flexibility and avoid performance degrading joins. Each of the main NoSQL databases (key-value, BigTable, Document, Graph) is examined at the structural level.
The chapter ends with a look at storage, with the logical structure typically abstracted away from the physical implementation. This section looks at the movement away from the relational model’s B-tree storage (optimized for random access) to the log-structured Merge tree (optimized for sequential writes), in each case the tree structure is described. The section ends with a look at secondary indexes, in both relational and various NoSQL databases.
After reading this chapter, I got a better understanding of how the previous 2 chapters fit together. It might have been worthwhile indicating somewhere, near the start, that these 3 chapters are in fact related, and should be read as a whole!
Chapter 11 Languages and Programming Interfaces
The chapter opens with the observation that the acceptance and domination of relational databases was due to a friendly querying language, SQL. Increasingly SQL is also being used with NoSQL databases.
The chapter continues with a look at accessing NoSQL databases, which were typically developed by programmer for programmers, and often have low-level API access. Example access code is provided and discussed, for most of the following: Riak, Hbase, MongoDB,Cassandra Query Language (CQL), MapReduce, Pig, Directed Acyclic Graphs, Cascading, and Spark.
The chapter ends with a look at the return of SQL. The various NoSQL databases each have their own propriety access mechanisms, however, increasingly SQL is returning as a generic access mechanism. The technologies looked at, with example code, include: Hive, Spark SQL, and Drill.
This chapter provides a helpful overview of the various NoSQL APIs for the different NoSQL databases, it also has a useful roundup of the higher level SQL tools to access NoSQL databases.
Chapter 12 Databases of the Future
Here the author gives his personal take on database systems in the near-future. He argues that while the database revolution is still occurring, eventually the various disparate technologies will converge.
The chapter next looks at the inadequacies of the latest databases, including: coupling of logical and physical structures, potential data inconsistencies (not using ACID), programmer instead of business users focused, and having many compromises. It’s acknowledged that relational databases will still be the choice for many applications, but there are niches where NoSQL databases are a better fit.
The author makes an interesting case for having relational and non-relational features within a single database. He envisages a tuneable configuration for data consistency (for example). He argues it is preferable for the database industry to provide a single comprehensive solution, rather than the disparate state that exists currently. Examples of how to combine the different technologies into a single architecture are discussed.
The chapter continues with a look at Oracle, which is providing a converging database, integrating NoSQL features into its relational offering. Features discussed include: JSON support, use of REST, Graph, and Sharding. Other vendors are pursuing similar routes.
The chapter ends with a brief look at some disruptive database technologies (storage, blockchain, quantum computing), this section is much more speculative, but again it is interesting to speculate.
This is both an interesting and original chapter. I’m not sure how correct it will prove in its predictions, but it’s a good starting point for discussions.
Appendix A: Database Survey
This is a short overview of the 16 major database systems in this book (e.g. MongoDB). Each entry lists: licensing, wikipedia description, vendor’s description, author’s take, data model, transactional model, clustering, and APIs.
This book aims to help you choose the correct database technology, in the era of Big Data, NoSQL, and NewSQL, and succeeds. The book is generally easy to read, with useful explanations, considered discussions, helpful diagrams, inter-chapter references, and website links.
The main types of NoSQL database (key-value, graph, document, columnar) are described, as are the newer relational database features (NewSQL). It certainly helps explain the recent database changes in the context of web innovation.
It might have been useful to have a matrix containing details of what type of database to choose for specific scenarios. The book has useful (but unannotated) links for further information at the end of each chapter, these would be better annotated, and placed on the page they are referenced. The book has no introduction (except the back cover), and although the book is split into 2 sections, they are not discussed – a roadmap would be useful.
This book will prove useful to anyone who wants to know how to choose an appropriate database solution in these changing times, and how we arrived at the current mixture of disparate databases. Highly recommended.
Reading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark
Titles in our Data Science category in the Book Reviews section.
To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.