Master JavaScript Regular Expressions

Written by Ian Elliot

Thursday, 13 July 2017

Article Index
Master JavaScript Regular Expressions
Quantify
Back References

Page 2 of 3

Quantify

Of course we now have the problem that it isn’t unreasonable for an ISBN to be written as ISBN: 9 or ISBN:9 with perhaps even more than one space after the colon.

We clearly need a way to specify the number of repeats that are allowed in a matching string.

To do this we make use of “quantifiers” following the specification to be repeated.

The most commonly used quantifiers are:

* zero or more
+ one or more
? zero or one
{n} exactly n times
{n,} n or more times
{n,m} at least n at most m times

In many ways this is the point at which regular expression use starts to become interesting and inevitably more complicated however the basic idea is fairly simple - how many repeats is allowed for a match.

Perhaps the key concept is that the * means something is optional, but + means it must occur. In both cases, whatever it is can occur multiple times. Contrast this to ? which means optional but only once.

For example:

/ISBN:\s*\d/

matches “ISBN:” followed by any number of white-space characters including none at all followed by a digit. Similarly:

/ISBN:?\s*\d/

matches “ISBN” followed by an optional colon (not multiple colons), any number of white-space characters including none followed by a digit.

datastruct

Greedy!

Quantifiers are easy but there is a subtlety that often goes unnoticed.

Quantifiers, by default, are “greedy”.

That is they match as many entities as they can even when you might think that the regular expression provides a better match a little further on. The only way to really follow this is by the simplest example.

Suppose you need a regular expression to parse some HTML tags:

<div>hello</div>

If you want to match just a pair of opening and closing tags you might well try the following regular expression:

ex2= /<div>.*<\/div>/

which seems to say “the string starts with <div> then any number including zero of other characters followed by </div>”. If you try this out on the example given above you will find that it matches.

However if you now try it out against the string:

<div>hello</div><div>world</div>

as in:

var ex2= /.*<\/div>/; var a=ex2.exec("<div>hello</div> <div>world</div>");

you will discover that the match is to the entire string.

That is the final </div> in the regular expression is matched to the final </div> in the string even though there is an earlier occurrence of the same substring.

This is because the quantifiers are greedy by default and attempt to find the longest possible match.

In this case the .* matches everything including the first </div>.

So why doesn’t it also match the final </div>?

The reason is that if it did the entire regular expression would fail to match anything because there would be no closing </div>.

What happens is that the quantifiers continue to match until the regular expression fails, then the regular expression engine backtracks in an effort to find the longest match.

Notice that all of the standard quantifiers are greedy and will match more than you might expect based on what follows in the regular expression.

If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following the standard quantifiers by a question mark.

To see this in action, change the previous regular expression to read:

var ex2= /.*?<\/div>/;

With this change in place the result of matching to

"<div>hello</div>world</div>"

is just the first pair of <div> brackets – that is <div>hello</div>.

Notice that all of the quantifiers, including ?, have a lazy version and yes you can write ?? to mean a lazy “zero or one” occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings.

Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

Anchors

As well as characters, character sets and quantifiers you can also use location matches or anchors.

For example, the ^ (caret) only matches the start of the string. For example, /^ISBN/ will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else.

The most useful anchors are:

^ start of string
$ end of string
\b word boundary – i.e. between a \w and \W
\B anywhere but a word boundary

So for example:

 /^\d+$/

specifies a string consisting of nothing but digits. Recall that the + symbol means match 1 or more times.

Compare this to

 /^\d*$/

which would also accept a null string.

One subtle point only emerges when you consider strings with line breaks.

In this case by default the ^ and $ match only the very start and end of the string.

If you want them to match line beginnings and endings in a multiline string you have to specify the /m option.

Grouping and alternatives

Regular strings often have alternative forms. For example the ISBN designator could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations.

You can specify an either/or situation using the vertical bar |, the alternation operator as in x|y which will match an x or a y.

For example:

/ISBN:|ISBN-13:/

matches either ISBN: or ISBN-13:. This is easy enough but what about:

/ISBN:|ISBN-13:\s*\d/

At first glance this seems to match either ISBN: or ISBN-13 followed by any number of white space characters and a single digit – but it doesn’t.

The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d.

To match the white space and digit in both forms of the ISBN suffix we would have to write:

/ISBN:\s*\d|ISBN-13:\s*\d/

Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in.

Anything grouped between parentheses is treated as a single unit – and grouping has a higher priority than the alternation operator.

So for example:

/(ISBN:|ISBN-13:)\s*\d/

matches either form of the ISBN suffix followed by any number of white space characters and a single digit because the brackets limit the range of the alternation operator to the substrings to the left and right within the bracket.

The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous un-grouped expression but without the colon:

/ISBN|ISBN-13/

In this case the first pattern, i.e. “ISBN”, will match even if the string is “ISBN-13”. It doesn’t matter that the second expression is a “better” match.

No amount of grouping will help with this problem because the shorter match will be tried and succeed first. In this case the solution is to either swap the order of the sub-expressions so that the longer comes first or include something that always marks the end of the target string.

For example, in this case if we add the colon then the

ISBN:

subexpression cannot possibly match the ISBN-13: string.

<< Prev - Next >>

Last Updated ( Thursday, 13 July 2017 )