Taming Text

Author: Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris
Publisher: Manning
Pages: 320
ISBN: 978-1933988382
Audience: Java programmers interested in processing text
Rating: 4.5
Reviewer: Alex Armstrong

What do you think a book called "Taming Text" is all about? 

It could be about Unicode or advanced regular expressions or ...

It is important to note that these essentially core text technologies are not what this book is about. What it is about is the task of working with text in an semi-intelligent way.

It is about searching and organizing text in a way that makes sense to a human. Now this is a big task and not just confined to explaining how text is represented in a given programming language. It heads in the direction of Artificial Intelligence (AI) but without needing the complete understanding that such text processing might seem to need. It more or less fits into the category of Natural Language Processing (NLP). In general the methods used in current NLP are statistical and based on any understanding of what the text means.  

 

Banner

 

Chapter 1 starts of by setting the scene - why you might need this sort of text processing. If you already are an NLP enthusiast you probably don't need to read it but it gets you started nice an easily. 

Chapter 2 is where things really get into gear. It explains the workings of language, well the English language, by working its way through the useful levels of looking at text and providing labels for the different parts of speech. Rather than just being a theory lesson, it also points you in the direction of resources that you can use to identify parts of speech for example. It also discusses the problem of actually reading in the text from files in different formats using the first of the many open source programs discussed in the book - i.e. Apache Tika. 

 

tamingtext

 

Chapter 3 deals with the problems of intelligent search using Apache Solr. It is a basic introduction to Solr, how to get it setup and how to customize and optimize it. Chapter 4 moves on to the problems of fuzzy string matching and it first discusses some of the measures of similarity that you can work out. The ideas are implemented with reference to Solr in particular. 

Chapter 5 is called "Identifying people, places and things" and it discusses the named entity recognition problem. This is our first introduction to OpenNLP. Next we find out about clustering text using a range of methods and tools including Carrot and Mahout to implement k-means. Chapter 7 extends this to classification using Lucene. 

In Chapter 8 we discover what the object of the entire exercise has been in that it details the implementation of an example question answering system. To find out much about it you are going to have to run the code provided at the book's website.

The final chapter considers the future of the technology including a quick look at working with other languages, sentiment analysis and the long term goal of semantic analysis. 

This is not a text book nor is it a research monograph. It is aimed at programmers who need to understand enough about NLP to build an intelligent question answering system or similar. You will learn the theory as you go along but it is all explained in fairly plain language and via programming examples. You will need to program in Java and all of the tools are in the main Java oriented. If you are not a Java programmer you can understand the ideas presented but you will probably struggle to get the examples working. The book is also based on opens source tools that are part of the Java eco system - for example Solr, Lucene, Tika, Mahout and so on. If you plan to use other tools or other language then the book will be of less use.

Don't expect the book to show you how to implement complete text understanding, or to show you how to build a system like IBM's Watson question-answering machine. It gives you a very good and very practical overview of what you can achieve fairly easily and with moderate resources.

It is a good Java-oriented introduction to NLP and as such recommended. 

 

Banner


C++ Primer, 5th Ed

Author:Stanley B. Lippman, Josée Lajoie and Barbara E. Moo
Publisher: Addison-Wesley
Pages: 976
ISBN: 978-0321714114
Audience: Intermediate programmers
Rating: 4
Reviewer: Mike James

A new edition of a classic C++ book deals with the shock of the new - C++ 11 that is. Can a classic catch up?



C# in Depth, 2nd Ed

Author: Jon Skeet
Publisher: Manning

Pages: 584
ISBN: 978-1935182474
Aimed at: Intermediate C# programmer who wants to master the language
Rating: 5
Pros: Presents not only the language but also the underlying concepts
Cons: Not for someone looking for a quick guide to the language
Reviewed by: Nikos Vagg [ ... ]


More Reviews

Last Updated ( Wednesday, 04 December 2013 )
 
 

   
RSS feed of book reviews only
I Programmer Book Reviews
RSS feed of all content
I Programmer Book Reviews
Copyright © 2014 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.