Rule-Based Matching In Natural Language Processing
Written by Jannes Klaas   
Monday, 20 May 2019
Article Index
Rule-Based Matching In Natural Language Processing
A Rule To Classify As Product

Add custom functions to matchers

Let's move on to a more complex case. We know that iPhone is a product. However, the neural network-based matcher often classifies it as an organization. This happens because the word "iPhone" gets used a lot in a similar context as organizations, like in "The iPhone offers..." or "The iPhone sold..." 

Let's build a rule-based matcher that always classifies the word "iPhone" as a product entity. 

First, we have to get the hash of the word PRODUCT. Words in spaCy can be uniquely identified by their hash. Entity types also get identified by their hash. To set an entity of the product type, we have to be able to provide the hash for the entity name. 

We can get the name from the language models vocabulary by running:

PRODUCT = nlp.vocab.strings['PRODUCT']

 Next, we need to define an on_match rule. This function will be called every time the matcher finds a match. on_match rules get passed by four arguments:

  1. matcher: The matcher that made the match

  2. doc: The document the match was made in

  3. i: The index of a match. The first match in a document would have index zero; the second would have index one and so on

  4. matches: A list of all matches made 

There are two things happening in our on_match rule:

def add_product_ent(matcher, doc, i, matches):match_id, start, end = matches[i] #1doc.ents += ((PRODUCT, start, end),) #2


Let's break down what they are:

  1. We index all matches to find our match at index i. One match is a tuple of a match ID, the start of the match and the end of the match.

  2. We add a new entity to the document’s named entities. An entity is a tuple of the hash of the type of entity (the hash of the word PRODUCT here), the start of the entity, and the end of the entity. To append an entity, we have to nest it in another tuple. Tuples that contain only one value need to include a comma at the end. It is important not to overwrite doc.ents, as we otherwise would remove all the entities we already found. 

Now that we have an on_match rule we can define our matcher.  

We should note that matchers allow us to add multiple patterns, so we can add a matcher for just the word "iPhone" and another pattern for the word “iPhone” together with a version number like "iPhone 5":

pattern1 = [{'LOWER': 'iPhone'}] #1pattern2 = [{'ORTH': 'iPhone'}, {'IS_DIGIT':                                              True}] #2 matcher = Matcher(nlp.vocab) #3matcher.add('iPhone', add_product_ent,                                       pattern1, pattern2) #4


So, what makes these commands work?

  1. We define the first pattern.

  2. We define the second pattern.

  3. We create a new empty matcher.

  4. We add the patterns to the matcher. Both will fall under the rule called "iPhone" and both will call our on_match rule called add_product_ent. 

We will now pass one of the news articles through the matcher:

doc = nlp(df.content.iloc[14]) #1matches = matcher(doc) #2


This code is relatively simple, with only two steps:

  1. We run the text through the pipeline to create an annotated document.

  2. We run the document through the matcher. This modifies the document created in the step before. We do not care as much about the matches but about how the on_match methods adds the matches as entities to our documents. 

Now that the matcher is set up, we need to add it to the pipeline so that spaCy uses it automatically.

Adding the matcher to the pipeline

Calling the matcher separately is somewhat cumbersome. To add it to the pipeline, we have to wrap it into a function, which we can achieve by running:

def matcher_component(doc):
matches = matcher(doc)
return doc 

The spaCy pipeline calls the components of the pipeline as functions and always expects the annotated document to be returned. Returning anything else could break the pipeline. 

We can then add the matcher to the main pipeline, as can be seen in the following code:



The matcher is now the last piece of the pipeline, and iPhones will now get tagged based on the matcher's rules. And boom! All mentions of the word "iPhone" (case independent), are now tagged as named entities of the type product. You can validate this by displaying the entities with displacy as we have done in the following code, and that you can see in the following screenshot:



spaCy now finds the iPhone as a product



In this article, we learned about rule-base matching in natural language processing. We created a matcher, following adding custom functions to the matcher. We worked out a rule-based matcher that classifies the word "iPhone" as a product. In the last sub-section, we added the matcher to the pipeline so that spaCy can use it to find the iPhone as a product.

Explore NLP to automatically process language with Jannes Klaas’ debut book Machine Learning for Finance.

  • Jannes Klaas is a quantitative researcher with a background in economics and finance. Currently a graduate student at Oxford University, he previously led two machine learning bootcamps and worked with several financial companies on data driven applications and trading strategies. His active research interests include systematic risk as well as large-scale automated knowledge discovery.

This article is an excerpt from Machine Learning for Finance by Jannes Klaas and published by Packt Publishing, a book which introduces the study of machine learning and deep learning algorithms for financial practitioners.


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Related Articles

Reading Your Way Into Big Data

Last Updated ( Monday, 20 May 2019 )