Authors: Jordan Tigani and Siddartha Naidu
Publisher: Wiley, 2014
Aimed at: Analysts and developers wanting to learn BigQuery
Reviewed by: Kay Ewbank
Will this book help you make the most of BigQuery?
When Google released BigQuery, it sounded interesting. A service that lets you analyze terabytes of data in seconds using SQL-like queries and the processing power of Google's infrastructure – obviously a winner. Knowing whether or not it would make your own particular analysis task easier and faster is a different matter. The authors of Google BigQuery Analytics are well placed to give the inside story of BigQuery; Siddartha Naidu was one of the two original engineers who built the original prototype, and Jordan TIgani is also part of the internal Google BiqQuery team. The book has been written to take you from first using BigQuery all the way through to advanced use and using BigQuery in applications from other systems.
The first four chapters of the book cover the BigQuery fundamentals, the Google view of Big Data, and the BigQuery Object Model. The first chapter on ‘the story of Big Data at Google’ is a very good overview of what big data is and how to deal with it, though obviously with a Google spin on the topic, and the next three chapters give an excellent grounding on BigQuery.
Part II looks at ‘Basic BigQuery’. There’s a chapter on talking to the BigQuery API that starts from the raw underlying HTTP format of the REST API, then goes on to look at authentication, an. There’s a nice section on ‘RESTful web services for the SOAP-less masses’ that explains REST in a really clear way. The rest of the chapter concentrates on the REST collections in the API. The techniques for loading data are tackled next, followed by running queries. This takes you from how to send the API requests through to how to construct valid queries in BigQuery’s SQL-like language. The final chapter in this section puts what you’ve been shown so far together by creating an application consisting of an Android client, AppEngine application that uses BigQuery for logging, creating and managing a dashboard with graphs and reports, and also enables the running of ad-hoc queries. The sample code (in Python) is all available online for download, and is well explained.
The third section covers the more advanced aspects of BigQuery. There’s a good chapter on understanding query execution that describes the systems that BigQuery is based on, and how to use them to write good queries. The authors point out that because the BigQuery language looks very like SQL, it’s tempting to use the techniques you’ve developed to make SQL queries run efficiently. In fact under the covers BigQuery is very different and your SQL habits can result in very slow queries, so you need to override your gut feel for what makes a good query and look at what’s really needed. The next chapter goes even further into queries showing some examples that you probably wouldn’t have thought of as a SQL query writer. The BigQuery extensions to SQL such as ‘EACH’ (which tells the query engine to perform a shuffle operation to sort data so it can be processed in parallel) are explained, along with examples of how to use them. Errors specific to BigQuery such as Result too large and Resources exceeded are also explained, and the chapter ends with a set of ‘recipes’ showing analysis such as pivot tables, cohort analysis, parallel lists, how to find concurrency, and trailing averages. This will probably be one part of the book you return to if you’re using BigQuery for real. This part of the book ends with a useful chapter on managing data stored in BigQuery that gives best practices for partitioning data effectively, and for reducing the cost of running BigQuery.
While BigQuery has a lot of interesting features, chances are it will be only part of most solutions, and the final part of the book considers using it with other systems. The first chapter on external data processing shows how to get your data out of BigQuery, then shows using MapReduce to transform BigQuery tables and using Hadoop over your BigQuery data. In something of a change of scale, the chapter also looks at querying BigQuery from a spreadsheet, with techniques for both Google Spreadsheets and Excel.
Third-party tools for data visualization (Tableau and BIME are discussed), client-side encryption, R, and BigQuery via ODBC are the topic of the next chapter, with walkthroughs in each case showing how to connect the different elements to BigQuery. The final chapter looks at querying Google data sources – AdSense, Google Analytics and DoubleClick.
If you’re at all interested in Google BigQuery, this is an excellent book. The descriptions and sample code are clear and easy to understand, and the fact the authors are so involved with the project means they include insights into why things were designed in that particular way. There are useful descriptions of the differences between BigQuery and other tools such as MapReduce, and overall you’ll come out with a much clearer view of the big data scene right now, and how everything fits together.