Foundations of Data Science with Python (CRC Press) |
Author: John M. Shea I can remember the day that data science was called statistics, but data science is clearly a more attractive option. This book takes a fairly traditional approach to statistics with the help of Python. It isn't about the exciting things going on in AI and it wont help you with database problems - database isn't data science it seems. This is a very pretty book with color used throughout, but it isn't a beginners book. I would say that if you are happy with the level of mathematics used to explain the concepts then you would be better off with a book that operated at a higher level. You need to be mathematically competent to read this book. John Shea starts off with a look at using Python and Jupyter notebooks but this isn't enough to get you started if you don't already know the language. The book really gets going in Chapter 2 where he contemplates the problem of testing if a coin is fair or not. You are clearly expected to know something about probability theory, but not much. The whole idea is explored using a simulation and it leads up to the idea of a statistical test, but not in a rigorous way. In this introduction we are only concerned with what the probability is of getting what seems like an unlikely result from a fair coin. Much of the discussion is on elementary topics such as histograms and scatter plots and how to create a Jupyter notebook that performs the simulation. Chapter 3 moves on to using Pandas. It covers scatter plots, histograms and summary statistics. Here the mathematical level zooms up with set theory notation, sigmas and calculus. For me much of the theoretical ideas get lost in the minutiae of implementing things in a programming language. There are some subtle ideas here, but they are presented without much discussion because there is a lot to say about implementation. Chapter 4 is an introduction to probability theory. It starts in a fairly standard and chatty way, but it then moves on to axiomatic probability theory and we have more equations that might put you off. There is no discussion of what probability actually is, apart from an appeal to relative frequencies to make it all seem reasonable. Of course, this gets more complicated if you want to use Bayesian methods that often need to go beyond relative frequencies. Chapter 5 tackles the ideas of null hypothesis tests. This is a clear exposition, but nothing special. It doesn't mention any of the classical test, but majors on resampling approaches. Is this the key difference between classical statistics and the modern data science approach? Classical statistics eventually caught on to resampling, but is this a reason to abandon the classical statistical test? Are t tests no longer relevant? Only if the central limit theorem has been overturned. This is a huge blind spot in data science. The t test is eventually introduced in Chapter 9, but almost as a side show rather than a main player.
We return to probability theory in Chapter 6 with conditional probability, in preparation for Bayesian methods in Chapter 7. There is nothing wrong with Bayes' theorem when applied to physical probabilities that can in principle be verified as long-term frequencies. The problems start when you apply it to probabilities that have no basis in physical reality - then things become philosophically difficult. No real hint of these difficulties is given here. It isn't even mentioned that what you consider to be an uninformative prior is a relative thing depending on any transformations you may apply to the quantity of interest. That is, to assume that x is uniform means that x2 isn't uniform and an informative prior for x is very informative for x2. But, of course, with resampling being the main practical statistical tool of data science, then a Bayesian formulation becomes very attractive. Chapter 8 looks at random variables, which is a very subtle idea. Here, at long last, we meet the classical distributions, the normal being the most important of them. This allows Chapter 9 to deal with the classical tests for differences in the mean. Chapter 10 is about decision-making, both non-Bayes and Bayesian. Chapter 11 moves into categorical data and contingency tables complete with chi squared testing. Chapter 12 is about linear regression and Chapter 13 covers principle components to decorrelate data. At the end of the book I felt like I'd been subjected to a 1000mph tour of a city. Bits flashed by and it was hard to work out any structure. This approach to data science has no driving principles or ideas. Are we frequentists, Bayesians or decision theorists? Are we doing resampling and/or Monte Carlo methods or are we making use of the central limit theorem and doing classical statistics? The book's main thrust is resampling, but somehow this fades away as it progresses and classical alternatives are introduced. I also found the way detailed examples were used to illustrate ideas more difficult than having the idea simply explained. I'm also not very clear as to why Python figures so prominently. Is it to provide simulations to help you understand the ideas or is it to enable you to learn to analyse data? The simulations and demonstrations included in the book could better be provided on a website and there isn't enough deep analysis performed to demonstrate data cleaning and preparation. When the book gets to powerful techniques like linear regression, but then fails to cover multiple regression, you realize how much more there is to learn. I'm afraid data science is much too big a topic to be covered in a book of this size. What it covers, it does well, but it lacks focus on a single approach.
To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.
|
|||
Last Updated ( Wednesday, 13 November 2024 ) |