Data Analysis with Open Source Tools |
Author: Philipp K. Janert Author: Philipp K. Janert This is a very strange book about data analysis - because it is not written by a statistician but a physicist. This could be good - I was a physicist before I studied statistics and AI, and I know from experience the subject requires intellectual vigour. In this case, however, the result reads very much like an outsider's point of view and in its attempts to be "new wave" it encourages you to view the problem of data analysis in ways that are potentially dangerous.
For example, in Chapter 10 the basic ideas of classical statistics are explained with the background idea that they were developed in a time without computers and not really that relevant to today's situation. The argument is put forward that today we don't need to worry about statistical tests because we have so much data and can draw charts and graphs. This would be laughable if it were not so dangerous and misleading. So, for example, after explaining the ideas of significance, an example is given where a chart clearly shows that two groups of data don't overlap and so no significance test was actually ever necessary. The author even suggests that the orginal statisticans were a little misguided to even have bothered to do so much work when the evidence is clear for all to see. This would be fine but the size of the sample is small and judging the separation of small datasets is not something to be done by eye. Small data sets are heavily influenced by the random component of the signal. Data that appears to be different might only differ because of random fluctuations and the significance test takes this into account by giving you a probability that the difference is due to just this random fluctuation. In short you do need significance tests even if you have a computer that can draw charts. Needless to say, the whole idea that big data makes graphical comparisons more reasonable is dangerous unless you know what you are doing. It is good to encourage people to look at the data and use as much data visualization as possible, but not to use it as an excuse or encouragement not to use statistical testing. The point is that you cannot rely on visual inspection to tell you all you need to know about the data. Later in the same chapter trendy Baysian statistics is promoted as being so much better than significance testing. However what the author fails to mention is that the Baysian approach is still controversial with many unsolved theoretical problems. In many cases whatever it is that the Bayesians are computing, it isn't probability but some sort of measure of belief, and the theory of belief measures is much more sophisticated and complicated than this account. Of course a Baysian statistician would take me to task for being so simplistic but at least we would have a discussion of the difficulties. The book also has huge blind spots. There is no mention of many modern statistical techniques that are core to the new science of "big data". The topics selected are indeed the sort of thing a physicist getting into statistics would pick out based on what extends the sort of physical modelling found in that subject. Part I of the book is all about graphics and here there is little chance of the author misleading the innocent reader. After all what can go wrong with graphics? The chapters do indeed form a good foundation and show how to use NumPy, matplotlib, gnuplot and some others. But why not R which is arguably the simplest and most flexible way of creating statistical charts? Part II is more technical and deals with analytics and in essence data modelling - and this is where the book is most dangerous. It starts very strangely with a chapter on guesstimation - i.e. the art of getting a ballpark figure. This is an art that most physicists are taught and it's nice to see it get a wider audience, but it is hardly mainstream stats. The same is true of the next chapter which deals with working out plausible models from scaling arguments - this includes dimensional analysis which is another of the physicist’s favourite tools, but rarely of use in other subject areas. For example, you can work out that the period of a pendulum has to be proportional to the square root of its length just by knowing that the units have to work out to be pure time. Try the same exercise with IQ and household income and you will find that dimensions don't really help. The same is true for scaling arguments. Even so it's a nice idea and certainly one worth knowing about - but mainstream statistics it isn't and it needs careful application to any real world topic. Then we have Chapter 10 which is the most dangerous and gives a potted overview of statistics that suggests that it was all made up because they didn't have a computer back then. If you now think that "classical" statistics is made redundant because there are computers - think again. Finally we have a chapter on a mixture of topics including parameter estimation by least squares. Why stop at this method alone when you are in the middle of a discussion of the how and why of estimation? Why not at least mention the principle of maximum likelihood - i.e. the best estimate of a set of parameters is the one that maximises the probability of the observed data given the model? It isn't that difficult to explain. Even though this section is on modelling there is no mention of linear models beyond simple regression. Certainly no mention of generalised linear models or anything even slightly sophisticated. Part III is all about Data Mining but in practice it is just more modelling. Chapter 12 is on simulation and includes Monte Carlo simulation and resampling. Then we have a chapter on cluster analysis and one on Principle Components analysis with some AI thrown in. The final part is a collection of applications but these are not case studies. In fact the remaining chapters are just filling in some missing techniques and mostly use trivial data sets to explain more modeling techniques. The book is very light on any real world examples at all and certainly there are no "big data" relevant examples. The main problem with this book is that it seems to be written from an outsider's viewpoint. It includes many topics which are usually left out and it leaves out many topics which are included. There is little coverage of categorical data - no contingency tables, no chi squared, no classical factor analysis, time series analysis is hardly touched on, discriminant analysis is introduced as a form of PCA (which it is when used in feature detection). Although the book mentions non-parametric methods often these don't really make an appearance. Where novel techniques are introduced such as AI techniques the selection is similarly biased - Kohonen maps but not neural networks or GA. I could go on... and on... You could argue that one book isn't sufficient for all of these methods but there are other less important methods covered and just to mention that they exist and complement and extend the methods discussed would take little extra space. The real problem is that this book is like giving the key of the car to a non-driver after telling them where the accelerator is and that's all. This is a deeply flawed book and a dangerous one in the hands of a novice. As you can guess I can't recommend it and if you do choose to read it make sure you read a good book on statistics before committing yourself to a conclusion drawn from real data. And whatever you do don't belive that statistical tests are now redundant because we have computers...
|
|||
Last Updated ( Thursday, 16 December 2010 ) |