Data Cleaning Pocket Primer

Author: Oswald Campesato
Publisher: Mercury Learning
Date: Jan 2018
Pages: 188
ISBN: 978-1683922179
Print: 1683922174
Kindle: B0797MX7PC
Audience: Data scientists committed to Linux or Mac OS.
Rating: 4.5
Reviewer: Alex Armstrong
Data cleaning - break out the command line tools!

If you have worked with any data then you will know that the time it takes to get meaningful results is usually dominated by the time it takes to get the data into a form where it can be analyzed. Data cleaning is a major task and there are lots of books on the topic, but mostly they assume that you are using a programming language that you are also going to use for the analysis.

This particular book doesn't do that. Instead it takes a look at what you can achieve using just the command line tools in Linux and Mac OS commands. This is an interesting idea, but you have to want to work this way for the book to be of much use to you. If you want to use R, say, then you need a book that does data cleaning in R.

 

Banner

Chapter 1 is an introduction to Bash and what the basic commands are e.g. using cat, head, tail, path variables and a basic intro to shell scripts. This is the sort of information you could find in any introductory book on Linux/Unix.

Chapter 2 is called Useful Commands and it is just that with a section on each of the most useful commands - join, fold, split, sort. how to zip files and so on. 

The real data cleaning material comes in Chapter 3 on grep. No it's not a noise that frogs make, well it is, but it is also the fundamental regular expression tool on the command line. If you already know about regular expressions from somewhere else then you will know most of this, but there are many grep-specific things to learn.

You might well have used grep as part of your general use of the Linux command line, but the topic of Chapter 4 is much less well known - sed. This is a stream editor. It reads a file and performs pattern matching and replacement as the file passes through. It is fast and can be used to process large files.

If you can't do the job with sed then you need awk, which is the subject of the final chapter. Awk is best described as a domain specific language for text processing files Essentially you write programs in awk to find text and change it. It is a full programming language with loops and conditionals so it isn't something you take on lightly. The chapter does a good job of introducing it, but if you are going to learn awk why bother with sed?

This is a good book if you are looking for something about bash,  grep, sed and awk. I'm not sure it really fills the role of a book on data cleaning, however. This is more because I think that if you are serious about data you probably need to learn R, Python or similar. Working from the command line is restrictive and if you put the effort in to learn awk why not learn R? However, despite my reservations if you do want to go the Linux/Mac OS command line route then this is a good pocket book to have.

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Banner


Embedded Vision: An Introduction (Mercury Learning)

Author: S. R. Vijayalakshmi and S. Muruganand
Publisher: Mercury Learning
Date: October 2019
Pages: 580
ISBN: 978-1683924579
Print: 1683924576
Kindle: B07YN6JC19
Audience: Developers interested in vision-enabled devices
Rating: 3
Reviewer: Harry Fairhead
The power of small machines is now well able to ta [ ... ]



Expert Performance Indexing in Azure SQL and SQL Server 2022

Author: Edward Pollack & Jason Strate
Publisher: Apress
Pages: 659
ISBN: 9781484292143
Print: 1484292146
Kindle: B0BSWH65ST
Audience: DBAs & SQL devs
Rating: 4 or 1 (see review)
Reviewer: Ian Stirk 

This book discusses indexes, a primary means of improving performance in SQL Server, how does  [ ... ]


More Reviews

Last Updated ( Saturday, 12 October 2019 )