Programmer's Python Data - Simple Regular Expressions |
Written by Mike James | ||||
Monday, 09 December 2024 | ||||
Page 3 of 3
If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following any of the standard quantifiers by a question mark, ?. To see this in action, change the previous regular expression to read: ex2 = re.compile(r"<div>.*?</div>") With this change in place, the result of matching to: r"<div>hello</div>world</div>" is just the first pair of <div> brackets, that is <div>hello</div>. All of the quantifiers, including ?, have a lazy version and you can write ?? to mean a lazy “zero or one” occurrence. The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings. Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match. Grouping and AlternationRegular strings often have alternative forms. For example, the ISBN prefix could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator, as in x|y which will match an x or a y. For example, r"ISBN:|ISBN-13:" matches either ISBN: or ISBN-13:. This is easy enough but what about: r"ISBN:|ISBN-13:\s*\d" At first glance this seems to match either ISBN: or ISBN-13: followed by any number of white space characters and a single digit, – but it doesn’t. The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d. To match the white space and digit in both forms of the ISBN prefix we would have to write: r"ISBN:\s*\d|ISBN-13:\s*\d" Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit, a subexpression, and grouping has a higher priority than the alternation operator. So, for example: r"(ISBN:|ISBN-13:)\s*\d" matches either form of the ISBN prefix followed by any number of white space characters and a single digit because the parentheses limit the range of the alternation operator to the substrings to the left and right within the bracket. The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous ungrouped expression, but without the colon, r"ISBN|ISBN-13". In this case the first pattern, ISBN, will match even if the string is ISBN-13. It doesn’t matter that the second expression is a “better” match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first. The solution is to swap the order of the subexpressions so that the longer comes first or to include something that always marks the end of the target string. In this case for example, if we add the colon then the ISBN: subexpression cannot possibly match the ISBN-13: string. Groups can also be repeated. For example (ab)* matches any number of repeats of ab. In chapter but not in this extract
Summary
Programmer's Python
|
||||
Last Updated ( Monday, 09 December 2024 ) |