Programmer's Python Data - Simple Regular Expressions
Written by Mike James   
Monday, 09 December 2024
Article Index
Programmer's Python Data - Simple Regular Expressions
Pattern Matching
Grouping and Alternation

If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following any of the standard quantifiers by a question mark, ?. To see this in action, change the previous regular expression to read:

ex2 = re.compile(r"<div>.*?</div>")

With this change in place, the result of matching to:

r"<div>hello</div>world</div>"

is just the first pair of <div> brackets, that is <div>hello</div>.

All of the quantifiers, including ?, have a lazy version and you can write ?? to mean a lazy “zero or one” occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings. Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

Grouping and Alternation

Regular strings often have alternative forms. For example, the ISBN prefix could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator, as in x|y which will match an x or a y.

For example, r"ISBN:|ISBN-13:" matches either ISBN: or ISBN-13:.

This is easy enough but what about:

r"ISBN:|ISBN-13:\s*\d"

At first glance this seems to match either ISBN: or ISBN-13: followed by any number of white space characters and a single digit, – but it doesn’t. The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d. To match the white space and digit in both forms of the ISBN prefix we would have to write:

r"ISBN:\s*\d|ISBN-13:\s*\d"

Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit, a subexpression, and grouping has a higher priority than the alternation operator. So, for example:

r"(ISBN:|ISBN-13:)\s*\d"

matches either form of the ISBN prefix followed by any number of white space characters and a single digit because the parentheses limit the range of the alternation operator to the substrings to the left and right within the bracket.

The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous ungrouped expression, but without the colon, r"ISBN|ISBN-13". In this case the first pattern, ISBN, will match even if the string is ISBN-13. It doesn’t matter that the second expression is a “better” match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first. The solution is to swap the order of the subexpressions so that the longer comes first or to include something that always marks the end of the target string. In this case for example, if we add the colon then the ISBN: subexpression cannot possibly match the ISBN-13: string. 

Groups can also be repeated. For example (ab)* matches any number of repeats of ab.

In chapter but not in this extract

  • Capture Groups
  • Backward References
  • Advanced Capture
  • String Manipulation
  • Using Regular Expressions

Summary

  • Python’s regular expressions are best compiled for efficiency and this returns a regular expression object which has methods that uses the expression.

  • If you don’t want to compile the expression you can use the alternative regular expression functions, but these aren’t as capable as the equivalent methods.

  • The methods and functions return a match object, or None if there is no match, which has methods that allow you to find out about the nature of the match.

  • Regular expressions only become useful when you start to use pattern matching.

  • You can also use anchors to specify where a match is allowed to happen.

  • To make patterns easier to write you can use a quantifier symbol to specify the allowable number of repeats.

  • By default quantifiers are greedy and will always attempt to find the longest match.

  • You can make a quantifier lazy by adding a ?.

  • The alternation operator, |, can specify a match to one of two possible patterns.

  • Grouping can be used to override the precedence of the regular expression operators.

  • Grouping also leads to the idea of a capture group in which a group matches part of the string. If you don’t want a group to be a capture group you can start it with (?:

  • You can refer to a capture group by number or you can assign and use a name.

  • Capture groups are useful for determining which parts of an expression matched and for backward references.

  • A backward reference lets you match against the results of previous matches.

  • Assertions are advanced expressions which modify what is captured.

  • You can also use regular expressions to modify strings and to split strings.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

pythondata360Contents

  1. Python – A Lightning Tour
  2. The Basic Data Type – Numbers
       Extract: Bignum
  3. Truthy & Falsey
  4. Dates & Times
       Extract Naive Dates
  5. Sequences, Lists & Tuples
       Extract Sequences 
  6. Strings
       Extract Unicode Strings
  7. Regular Expressions
       Extract Simple Regular Expressions ***NEW!!!
  8. The Dictionary
       Extract The Dictionary 
  9. Iterables, Sets & Generators
       Extract  Iterables 
  10. Comprehensions
       Extract  Comprehensions 
  11. Data Structures & Collections
       Extract Stacks, Queues and Deques
      
    Extract Named Tuples and Counters
  12. Bits & Bit Manipulation
       Extract Bits and BigNum 
  13. Bytes
       Extract Bytes And Strings
       Extract Byte Manipulation 
  14. Binary Files
  15. Text Files
  16. Creating Custom Data Classes
        Extract A Custom Data Class 
  17. Python and Native Code
        Extract   Native Code
    Appendix I Python in Visual Studio Code
    Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

Related Articles

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

espbook

 

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner



Last Updated ( Monday, 09 December 2024 )