Programmer's Python Data - Simple Regular Expressions

Written by Mike James

Monday, 09 December 2024

Article Index
Programmer's Python Data - Simple Regular Expressions
Pattern Matching
Grouping and Alternation

Page 1 of 3

Regular expressions solve problems but they also cause problems - but we use them not just because they are useful they are also fun. Find out how to understand what they are and do in this extract from Programmer's Python: Everything is Data.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Python – A Lightning Tour
The Basic Data Type – Numbers
Extract: Bignum
Truthy & Falsey
Dates & Times
Extract Naive Dates
Sequences, Lists & Tuples
Extract Sequences
Strings
Extract Unicode Strings
Regular Expressions
Extract Simple Regular Expressions ***NEW!!!
The Dictionary
Extract The Dictionary
Iterables, Sets & Generators
Extract Iterables
Comprehensions
Extract Comprehensions
Data Structures & Collections
Extract Stacks, Queues and Deques
Extract Named Tuples and Counters
Bits & Bit Manipulation
Extract Bits and BigNum
Bytes
Extract Bytes And Strings
Extract Byte Manipulation
Binary Files
Text Files
Creating Custom Data Classes
Extract A Custom Data Class
Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

If you think regular expressions are trivial and boring, you've not seen the whole picture. Here we reveal that in Python they are amazingly powerful and not to be missed. Regular expressions are addictive. Playing with these compressed and cryptic patterns is better than solving a Sudoku. If you are wondering what this is all about because, obviously, regular expressions are just the use of “*” and "?" then read on. The truth is a lot more subtle and the result is a lot more powerful than you might suspect. If you already know the basics of regular expressions then jump on to find some deeper explanations of less common features.

As Python uses UTF-8 for its string encodings the regular expression patterns have been extended to include Unicode characters that fit the description. For example, if you are searching for a digit then by default you are not just searching for the usual 0-9, but any character that Unicode classifies as a digit. If you want to restrict your attention to ASCII characters then use a bytes object or a byte array which also support regular expressions and work only in ASCII, see Chapter 12. There is also a regular expression flag that restricts the character set to ASCII, see later.

Python’s regular expression facilities are all provided by the re module which has to be loaded before you can make any of the following examples work.

Regular Fundamentals

The idea of a regular expression starts with the idea of specifying a grammar for a particular set of strings. All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern. The simplest sort of pattern is the string literal that matches itself.

So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string that is exactly “ISBN:”.

To actually use this you have to first create a regular expression object with the regular expression compiled into it:

import re
ex1 = re.compile(r"ISBN:")

The use of the r at the start of the string to make a raw string is optional, but it does make it easier as it often avoids the use of /, the escape character. Recall that strings starting with r are represented “as is” without any additional processing or conversion by Python.

You don’t have to compile the regular expression before you use it, but it is more efficient to do so. A cache of compiled expressions is also kept which makes it memory efficient as well as faster. Compiling an expression also gives you the opportunity to specify flags that determine how the expression will be used. The returned regular expression object also has methods that are slightly more advanced than the equivalent functions used for non-compiled expressions.

To evaluate the regular expression we need one of the methods of the regular expression object or its equivalent re function. The most obvious to start with is the search method, which applies the expression to a specified string and returns a matching object if the expression matched anywhere in the string and None if it didn’t. Notice that only the first match is returned.

The match object always returns True so it is easy to test if a match occurred:

import re
ex1 = re.compile(r"ISBN:")
if(ex1.search(r"ISBN:978-1871962406")):
    print("matched")
else:
    print("no match")

Notice that if there is no match then None is returned and this tests as False.

You could have written the call to search as:

if(re.search(r"ISBN:",r"ISBN:978-1871962406")):

that is without first compiling the regular expression. This works, but isn’t as efficient and it lacks the extra parameters of the regular expression object. For example, the search method has optional start and stop parameters that can be used to specify the portion of the string to be searched while the re.search function doesn’t.

In short you always have two ways to use a regular expression:

Compile it and call methods on the regular expression object that is returned.
Don’t compile it and use the more limited functions that the re module supplies.

In the rest of this chapter we will use the regular expression compiler approach where possible.

The match object that is returned has some useful attributes and methods that provide information about the match.

The Span property gives the position of the match in the search string as a tuple (start,end):

import re
ex1 = re.compile(r"ISBN:")
print(re.search(ex1,r"ISBN:978-1871962406").span())

which in this case returns (0, 5) to indicate that the match is from character 0 to character 5. You can also use the start and end methods to get the location of the match. If you want the characters that matched as a string then the simplest solution is to use indexing to retrieve the first element of the match, what the other elements are is explained when we look at groups:

import re
ex1 = re.compile(r"ISBN:")
m = re.search(ex1,r"ISBN:978-1871962406")
print(m[0])

which just prints ISBN. Discovering the string that matches becomes much more interesting when the regular expression defines multiple strings that could match.

There are two variations on the search method which do the same task and return a match object:

match – only match the beginning of the string

fullmatch – only match if the entire string matches

there is also:

finditer

which returns an iterator for each of the non-overlapping matches and returns a match object for each.

There are other methods that perform string manipulation, more of these later.

Prev - Next >>

Last Updated ( Monday, 09 December 2024 )

Programmer's PythonEverything is Data

Is now available as a print book: Amazon

Contents

Regular Fundamentals

Programmer's Python
Everything is Data