Pandas for Everyone: Python Data Analysis

Author: Daniel Chen
Publisher: Addison-Wesley
Pages: 416
ISBN: 978-0134546933
Print: 0134546938
Kindle: B0789WKTKJ
Audience: Python developers
Rating: 3
Reviewer: Mike James

Python a general purpose language for data analysis? With Pandas it might be possible.

This is a book about using the Python package Pandas - a data manipulation and analysis library. You could say that Python plus Pandas is the equal of say R or SAS, but with more flexibility.

So you want a book on Pandas but what do you expect from such a book?

The problem is that this is not a deep theoretical subject where there are great concepts to learn and skills to be mastered. Using a package such as this is mostly a matter of finding out what the implementors decided to call some function or exactly how they hide the feature you are looking for. This sounds like you probably need a cookbook - this book is not a cookbook exactly, but it is a collection of task-oriented descriptions.

The bottom line is that this is not really a book that you would sit down and read cover to cover. It is more what you might turn to to solve a problem or get you into a topic.

Banner

The book is divided into five parts, the first of which is an Introduction. Chapter 1 starts off this section with a look at the DataFrame. You are expected to know Python and mostly how to get the programming environment setup. Chapter 2 moves on to consider more general data structures and data. The topics include how to import data. Chapter 3 moves outside of Pandas to use Matplot and Seaborn to create charts.

Part II covers Data Manipulation and this has an interesting approach to the subject focusing on how the data's characteristics and origins effects how it is represented in a Pandas DataFrame. It introduces the idea of "tidy data", almost an informal normal form for statistical data - it's a nice idea. Chapter 4 deals with merging datasets. Chapter 5 is about missing data and what missing data actually means. Chapter 6 is dedicated to the idea of tidy data and explains "columns contain variables not values". This introduces the idea of melting or pivoting or whatever  you want to call it but without being clear what is happening to the data. You are expected to see what the transformation is from the examples i.e. by looking at extracts from the data. I don't think that this is best way of explaining the operation - its an algorithm and you and tell the reader what the algorithm is. Unless you get the idea of how the different columns work together to create a variable this is a difficult chapter.

Part III is on Data Munging, which is another way of saying getting your data into shape for the analysis you plan to use.  Chapters 7 and 8 covers some surprisingly basic topics - data types, strings, and categorical data. It goes into Python and general programming topics such as how to format strings, using regular expressions and so on. Again this is mostly learning by showing rather than by explanation. Chapter 9 is on the Apply method and this really should be fairly obvious material to any reasonable Python programmer. Chapter 10 is on the inevitable, in the sense that you often have to do it, Groupby and Split type operations. The section closes, Chapter 11. with a look at the problems of using date and time data - never as easy as you might expect.

 

Part IV is about Data Modelling and in most cases this is the section that books on using statistical software should leave out unless they are prepared to write a full textbook on the subject. To pretend that you can understand even something as simple as linear regression from a page or two isn't realistic. Go read a statistics book. The section starts in Chapters 12 and 13 with the fairly simple models - regression though generalized linear models, but not in the ANOVA sense. Chapter 14 covers diagnostics.  Chapter 15 is on regularization including LASSO and ridge regression. Chapter 16 goes over the basic methods of clustering and brings the section to a close. Begin such a short section there is a lot that isn't covered - contingency tables and categorical analysis not to mention the whole world that ANOVA style analysis is. There also nothing on factor analysis, principle components, discriminant analysis etc. This isn't a huge problem as even if they were covered you would need another book to do them justice.

The final section, Conclusion, is a look at some fairly off topic subjects. Chapter 17 is about the wider Python community and Chapter 18 is about how to be a self-directed learner - go to meetings, conferences etc. The space could have been better spent on more Pandas.

The book closes with some appendixes on installing Pandas.

Click on cover for details of a print and e-book bundle

 

Conclusion

This is a book that is strong on showing you how to do things rather than explaining how to do things. There isn't much deep principle in a package like Pandas, but there are missed opportunities to point out the generalities of data preparation, model proposal and testing. There are places where it reads more like a set of lecture notes than a complete narrative account of using Pandas. In addition the presentation often makes it harder to see what is being demonstrated with tables split across pages where it would have been easy to adjust the layout to keep the lines together. Overall I found the book more difficult to read than it needed to be.

If you are a fairly good Python programmer there are also places in the book where you are told some very basic things about strings, functions and so on.

This book will suit you if you are prepared to actively investigate the examples you are being shown and think about what is happening. In many cases you will need to study the data to see how the commands are changing it.

See our full list of  Python reviews and for recommendations see  Books for Pythonistas and Python Books For Beginners in our Programmer's Bookshelf section.

We also have many more reviews of Data Science books.

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Banner


Modern Fortran

Author: Milan Curcic
Publisher: Manning
Date: November 2020
Pages: 416
ISBN: 978-1617295287
Print: 1617295280
Audience: Fortran programmers
Rating: 5
Reviewer: Mike James
Not your parents' Fortran?



SQL Server 2022 Query Performance Tuning (Apress)

Author: Grant Fritchey
Publisher: Apress
Pages: 745
ISBN:978-1484288900
Print:1484288904
Kindle:B0BLYD98SQ
Audience: DBAs & SQL Devs
Rating: 4.7
Reviewer: Ian Stirk 

A popular performance tuning book gets updated for SQL Server 2022, how does it fare?


More Reviews

Last Updated ( Wednesday, 05 September 2018 )