Rule-Based Matching In Natural Language Processing
Written by Jannes Klaas   
Monday, 20 May 2019
Article Index
Rule-Based Matching In Natural Language Processing
A Rule To Classify As Product

SpaCy is an open-source software library for advanced Natural Language Processing, written in Python and Cython. Here it is used to build a rule-based matcher that always classifies the word "iPhone" as a product entity

mlfincover

This is an excerpt from the book Machine Learning for Finance written by Jannes Klaas. This book introduces the study of machine learning and deep learning algorithms for financial practitioners.

Before deep learning and statistical modeling took over, natural language processing was all about rules. That's not to say that rule-based systems are dead! They are often easy to set up and perform very well at doing simple tasks.

Imagine you wanted to find all mentions of Google in a text. Would you really train a neural network based named entity recognizer? You would have to run all the text through the neural network and then look for Google in the entity texts. Or would you rather just search for text that exactly matches Google with a classic search algorithm? spaCy comes with an easy-to-use rule-based matcher that allows us to do just that. 

spaCy is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows. The latest spaCy releases are available over pip and conda.

Creating matcher

Before we start this section, we must first make sure that we reload the English language model and import the matcher. This is a very simple task that can be done by running the following code:

import spacyfrom spacy.matcher import Matcher

nlp = spacy.load('en')

 

The matcher searchers for patterns that we encode as a list of dictionaries. It operates token by token, that is word for word, except for punctuation and numbers where a single symbol can be a token. 

As a starting example, let’s search for the phrase "Hello, world." We will define a pattern as follows:

pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True},{'LOWER': 'world'}]

This pattern is fulfilled if the lower case first token is hello. That means if the actual token text is "Hello" or "HELLO" then it would also fulfill the requirement. The second token has to be punctuation, so the phrase "hello. world" or "hello! world" would both work, but not "hello world." 

The lower case of the third token has to be "world," so "WoRlD" would also be fine. 

The possible attributes for a token can be the following: 

  • ORTH: The token text has to match exactly.

  • LOWER: The lower case of the token has to match.

  • LENGTH: The length of the token text has to match.

  • IS_ALPHA, IS_ASCII, IS_DIGIT: The token text has to consist of alphanumeric characters, ASCII symbols or digits.

  • IS_LOWER, IS_UPPER, IS_TITLE: The token text has to be lower case, upper case or title case.

  • IS_PUNCT, IS_SPACE, IS_STOP: Token text has to be punctuation, white space, or a stop word.

  • LIKE_NUM, LIKE_URL, LIKE_EMAIL: Token has to resemble a number, URL or email.

  • POS, TAG, DEP, LEMMA, SHAPE: The tokens position, tag, dependency, lemma or shape has to match.

  • ENT_TYPE: The tokens entity type from NER has to match. 

spaCy’s lemmatization is extremely useful. A lemma is the base version of a word. For example, "Was" is a version of "be," so "be" is the lemma for "was" as well as "is." spaCy can lemmatize words in context, meaning it uses the surrounding words to determine what the actual base version of a word is. 

To create a matcher, we have to pass on the vocabulary the matcher works on. In this case, we can just pass the vocabulary of our English language model by running:

matcher = Matcher(nlp.vocab) 

In order to add the required attributes to our matcher, we must call:

matcher.add('HelloWorld', None, pattern)

 

The add function expects three arguments, which are: 

  • A name of the pattern, in this case HelloWorld, so that we can keep track of the patterns we added.

  • A function that can process matches once found. We pass None here, meaning no function will be applied, but we will use this tool later.

  • Finally, we need to pass the list of token attributes we want to search for. 

To use our matcher, we can simply call matcher(doc). This will give us back all the matches the matcher found. We can call this by running:

doc = nlp(u'Hello, world! Hello world!')matches = matcher(doc)

If we print out the matches, we can see the structure:

matches[(15578876784678163569, 0, 3)]

 

The first thing in a match is the hash of the string found. This is just to identify what was found internally; we won't use it here. The next two numbers indicate the range in which the matcher found something; here tokens 0 to 3 indicate the range. 

We can get the text back by indexing the original document:

doc[0:3]Hello, world



Last Updated ( Monday, 20 May 2019 )