Apache Cassandra Essentials |
Author: Nitin Padalia This book aims to explain the core concepts of Cassandra, how does it fare?
The growth of Big Data has highlighted the scalability limits of relational databases. Several NoSQL databases have arisen to fill this niche, one of the more popular ones is Cassandra. The book is aimed at developers working with Cassandra wanting a more in-depth understanding. To get the most out of this book, a basic understanding of Cassandra is required, together with an appreciation of Java code and database systems. The book is relatively short, containing 146 working pages, spread over seven chapters. Below is a chapter-by-chapter exploration of the topics covered. Chapter 1 Getting Your Cassandra Cluster Ready The book opens with a comprehensive guide to installing Cassandra, covering the prerequisites (e.g. memory requirements), before downloading Cassandra source code, compiling and installing it. Instructions are also provided on installing a precompiled binary. The content of the various directories is outlined. The chapter next looks at the various configuration files, including: cluster, data partitioning, storage, client and security. The major content of each is briefly discussed. The chapter ends with details on running a Cassandra server, both on a single node and on a cluster of nodes. The Cassandra nodetool utility is used to check and monitor the cluster. This chapter provides a useful introduction to getting your Cassandra cluster up and running. The chapter is generally easy to read, with plenty of hands-on detail, useful tables and diagrams, and reusable scripts. There are occasional grammar problems, a recurring problem. Some terms are used without being defined (e.g. NoSQL). These traits apply to the whole of the book. Chapter 2 An Architectural Overview Cassandra has its origins in Google’s Bigtable and Amazon’s DynamoDB. The chapter explains normalization is not a concern. How Cassandra relates to Brewer’s CAP theorem is briefly discussed. Here, like elsewhere, it is stated suboptimally, in essence it should mean this: when you have a partitioned system, you need to choose for availability or consistency. The chapter is on firmer ground when it discusses the cluster topography, with each node being a peer, providing linear scalability. There’s a useful explanation of the Gossip Protocol, which allows the various nodes to get information about each other. Detecting node failure is briefly discussed. Data distribution via automatic sharding in discussed in plenty of detail, with helpful supporting diagrams. Similarly, there’s a section on replication, which aids performance and fault tolerance. The chapter ends with a look at adding nodes to the cluster, before looking at creating a column family (used to group columns). This chapter provides a helpful overview of Cassandra’s architecture. Some terms used before being defined (e.g. keyspace) or are not defined at all (e.g. normalization). Chapter 3 Creating Database and Schema Here the basic database concepts are discussed, including: Keyspace (holds tables), Column families (tables), and primary key (unique row identifier). Various CREATE TABLE options are discussed, as are static rows (fixed number of columns), and wide rows. Next, the chapter looks at data types, namely: native, collections (set, list, map), tuple, user defined types (UDT), and custom. Some example usage is provided. What follows next is a collection of seemingly unrelated topics. The use of secondary indexes to improve performance is briefly discussed with examples. The use of ‘time to live’ (TTL) operations to remove data after a given time period is explained. Conditional querying using the WHERE clause is explained with examples. The use of lightweight transactions and batch statements (transaction) is similarly explained. This chapter explains Cassandra’s underlying database features. Unfortunately, there is no linkage between the sections, resulting in a collection of seemingly unrelated topics (the topics are related, but no attempt is made at linking them) – making it more difficult to put the content into context.
Chapter 4 Read and Write – Behind the Scenes Here we examine the internals of Cassandra. There’s a detailed section on what happens when writes occur, in terms of the various nodes and the CommitLog (details of the change), Memtable (in-memory structure), and SSTable (structures on disk). Similarly, there’s a detailed section on what happens when a read occurs and how this affects the cache. There’s a brief section on how deletes work. The effect of reads and writes on consistency levels is explained in some detail. The chapter ends with a useful discussion on Cassandra’s ability to trace queries, which allows tracing and debugging of poorly performing queries. This chapter provides quite a lot of detail on the internals of read and write operations. On traditional relational databases there is a marked separation between developer and DBA type work, in this book for Cassandra they seem to overlap. Chapter 5 Writing Your Cassandra Client This chapter concentrates on creating a Cassandra client using the Datastax Java driver. Java code is provided to connect to the Cassandra cluster. A useful table showing the compatibility of various Datastax drivers against versions of Cassandra is provided. Cassandra driver policies for load balancing, retry mechanisms, and reconnection are examined. The chapter next looks in detail at reading and writing to the Cassandra cluster, useful annotated Java code is supplied. The Mapping API maps query results to Java classes, this is discussed with example code. The chapter ends with a look at tracing Cassandra queries using the Java driver, useful for monitoring and debugging. This chapter provides useful detail on using the Datastax driver, together with Java code, to interact with Cassandra. Chapter 6 Monitoring and Tuning a Cassandra Cluster It’s important to check the health of your Cassandra cluster, monitoring and tuning is especially important when problems occur. The chapter opens with a look at some of the tools used for monitoring, namely: logging (native and 3rd party), command-line tools (e.g. nodetool cfcstats), JConsole, and 3rd party tools – a website link is provided for this. In each case, example usage is given and the output described. The chapter next looks at tuning, discussing cache configuration (accessing data in caches is much faster than disk access), and bloom filter tuning (reduces disk seeks). Lastly, the common practice of tuning Java itself is described. This chapter provides some very useful options for both monitoring Cassandra operations, and of tuning methods to improve performance. Chapter 7 Backup and Restore Although Cassandra’s data is replicated for fault tolerance and performance, this does not mean backup are not needed (e.g. the centre hosting the cluster may have a problem). Like all databases, backup and restore are critical operations. The chapter opens with example script to backup a Cassandra cluster. This is followed naturally by an annotated script to restore data. The chapter ends with sections on adding, removing, and replacing nodes on a Cassandra cluster. This chapter provides critical information to backup and restore a Cassandra cluster. Conclusion This book aims to explain the core concepts of Cassandra, and generally succeeds. The book is easy to read, with plenty of hands-on detail, useful tables and diagrams, and helpful scripts. This is not a book for the complete Cassandra novice. Additionally, to get the most from the book, an understanding of Java code, and database systems is needed. Some knowledge of related terms is assumed (e.g. NoSQL, normalization), since these are not defined. Occasionally, there is bad grammar, but nothing too onerous. Sometimes the topics in a chapter are not introduced or linked together – making it difficult to make sense of. The book contains a mixture of both developer and administrator tasks, this seems to be the norm with NoSQL databases, whereas these areas are distinct in relational database environments. The book would benefit from a section on where to go next to get further information (e.g. websites, newsletters, forums etc). Overall, this is a useful overview of Cassandra, its internals and functionality. Related ArticlesReading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark Titles in our Data Science category in the Book Reviews section. To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.
|
|||
Last Updated ( Friday, 27 May 2016 ) |