Programmer's Python Data - Simple Regular Expressions
Written by Mike James   
Monday, 09 December 2024
Article Index
Programmer's Python Data - Simple Regular Expressions
Pattern Matching
Grouping and Alternation

Pattern Matching

If this is all there was to regular expressions they wouldn’t be very interesting. The reason they are so useful is that you can specify patterns that spell out the regularities in a type of data. For example, following “ISBN:” we expect to find a digit – any digit. This can be expressed as “ISBN:\d” where \⁠d is the character class indicator which means “a digit”. If you try this out you will discover that you don’t get a match with the example string because there is a space following the colon. However “ISBN:\s\d” does match, as \s means “any white-space character” and:

import re
ex1 = re.compile(r"ISBN:\s\d")
print(ex1.search(r"ISBN: 978-1871962406")[0])

displays “ISBN: 9”.

You can look up the available character set indicators in the documentation. The most useful are:

  • . (i.e. a single dot) matches any character

  • \s white-space

  • \d digit

  • \w any “word” character including digits

All of the character sets include Unicode characters that fit the description. That is, \d matches any Unicode digit and not just the usual 0 to 9.

There is also the convention that capital letters match the inverse set of characters:

  • \S any non-white space

  • \D any non-digit

  • \W any non-word character

The inverse sets can behave unexpectedly unless you are very clear about what they mean. For example. \D also matches white space and hence r"ISBN:\D\d" matches ISBN: 9.

 

You can also make up your own character group by listing the set of characters between square brackets. So, for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same as \D. Special characters lose their usual meaning between square brackets and here ^ stands for negation.

Anchors

As well as characters and character sets, you can also use location matches or anchors. For example, the ^ (caret) only matches the start of the string. For example, r"^ISBN:" will match only if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else. The most useful anchors are:

  • ^ start of string

  • $ end of string

  • \b word boundary, i.e. between a \w and \W

  • \B anywhere but a word boundary

One subtle point only emerges when you consider strings with line breaks. In this case, by default, the ^ and the $ match only the very start and end of the string. If you want them to match line beginnings and endings you have to specify the MULTILINE flag in the call to the compile function. For example:

ex1 = re.compile(r"ISBN:\s\d",flags=re.MULTILINE)

Greedy and Lazy Quantifiers

We now have the problem that it isn’t unreasonable for an ISBN to be written as ISBN:9 or ISBN: 9 with perhaps even more than one space after the colon. We clearly need a way to specify the number of repeats that are allowed in a matching string. To do this we make use of “quantifiers” following the specification to be repeated. The most commonly used quantifiers are:

  • * zero or more

  • + one or more

  • ? zero or one

  • {n} exactly n times

  • {n,} n or more times

  • {n,m} at least n and at most m times

In many ways this is the point at which regular expression use starts to become interesting and inevitably more complicated. You could even say that the use of * and + is what makes a regular expression into a regular grammar in the wider technical sense. Simple examples are not hard to find. For example:

r"ISBN:\s*\d" 

matches ISBN: followed by any number of white-space characters including none at all followed by a digit. Similarly,

r"ISBN:?\s*\d" 

matches ISBN followed by an optional colon, any number of white-space characters including none, followed by a digit.

So for, example, r"^\d+$"specifies a string consisting of nothing but digits. Compare this to r"^\d*$" which would also accept a null string, i.e no digits. The difference between at least one or zero is important.

Quantifiers are easy, but there is a subtlety that often goes unnoticed. By default, quantifiers are “greedy”, that is they match as many entities as they can, even when the regular expression provides a better match a little further on. To illustrate this by the simplest example, suppose you need a regular expression to parse some <html> tags:

<div>hello</div>

If you want to match just a pair of opening and closing tags you might well try the following regular expression:

ex2 = re.compile(r"<div>.*</div>")

which seems to say “the string starts with <div> then any number including zero of other characters followed by </div>”. If you try this out on the example given above you will find that it matches:

print(ex2.search(r"<div>hello</div>")[0])

However, if you now try it out on the string:

<div>hello</div><div>world</div> 

as in:

print(ex2.search(r"<<div>hello</div>
<div>world</div>")[0])

you will discover that the match is to the entire string. That is, the final </div> in the regular expression is matched to the final </div> in the string, even though there is an earlier occurrence of the same substring. This is because the quantifiers are greedy and attempt to find the longest possible match. In this case the * matches everything including the first </div>. So why doesn’t the * also match the final </div>? The reason is that if it did the entire regular expression would fail to match anything because there would be no closing </div>. What happens is that the quantifiers continue to match until the regular expression fails, then the regular expression engine backtracks in an effort to find a match. Notice that all of the standard quantifiers are greedy and will match more than you might expect based on what follows in the regular expression.



Last Updated ( Monday, 09 December 2024 )