Master JavaScript Regular Expressions |
Written by Ian Elliot | |||||||
Thursday, 13 July 2017 | |||||||
Page 3 of 3
Capture and backreferenceNow that we have explored grouping it is time to introduce the most sophisticated and useful aspect of regular expressions – the idea of “capture”. You may think that brackets are just about grouping together items that should be matched as a group, but there is more. A subexpression, i.e. something between brackets, is said to be “captured” if it matches and captured expressions are remembered by the engine during the match. Notice that a capture can occur before the entire expression has finished matching – indeed a capture can occur even if the entire expression eventually fails to match at all. You can see the captures as part of the array returned by the exec or string search operation. The first element of the array is the full match and the subsequent array elements are the captures. Each capture group, i.e. each sub-expression surrounded by brackets, can be associated with one or more captured string. To be clear, the expression:
has two capture groups which by default are numbered from left-to-right with capture group 1 being the (<div>) and capture group 2 being the (</div>). The entire expression can be regarded as capture group 0 and each result is returned in the corresponding element of the result array. If we try out this expression on a suitable string and get the array of results back with the capture matches as well as the full match.
Then, in this case, we have three capture groups returned as part of the result array – the entire expression returned as result[0], the first bracket i.e. capture group 1 is returned as result[1] and the final bracket i.e. capture group 2 as result[2]. The first group, i.e. the entire expression, is reported as matching only once at the start of the test string – after all we only asked for the first match. Now consider the same argument over again but this time with the expression:
In this case there are four capture groups including the entire expression. Capture group 0 is the entire expression ((<div>)(</div>))* and this is captured once matching the entire string of three repeats. The next capture group is the first, i.e. outer, bracket ((<div>)(</div>)) and it is captured once and then the remaining two capture groups (<div>) and (</div>) Back referencesSo far so good but what can you use captures for? The answer is two-fold: more sophisticated regular expressions and replacements. Let’s start with their use in building more sophisticated regular expressions. Using the default numbering system described above you can refer to a previous capture in the regular expression. That is, if you write \n where n is the number of a capture group the expression will specify that value of the capture group – confused? It’s easy once you have seen it in action. Consider the task of checking that html tags occur in the correct opening and closing pairs. That is, if you find a <div> tag the next closing tag to the right should be a <\div>. You can already write a regular expression to detect this condition but captures and back references make it much easier. Note: You cannot parse HTML using regular expressions as it requires a higher order grammar but you can parse small subsets of HTML. If you start the regular expression with a sub expression that captures the string within the brackets then you can check that the same word occurs within the closing bracket using a back reference to the capture group:
Notice the \1 in the final part of the expression tells the regular expression engine to retrieve the last match of the first capture group - which should be div in this case. If you try this out you will find that it matches <div><\div> but not <div><\pr>, say. You could have done the same thing without using a back reference but its easy to extend the expression to cope with additional tags. For example :
matches correctly closed div, pr, span and script tags. If you are still not convinced of the power of capture and back reference try and write a regular expression that detects repeated words without using a back reference to a capture. The solution using a back reference is almost trivial:
The first part of the expression simply matches a word by the following process – start at word boundary capture as many word characters as you can, then allow one or more white space characters. Finally check to see if the next word is the same as the capture. The only tricky bit is remembering to put the word boundary at the end. Without it you will match words that repeat as a suffix as in “the theory”. If you need to group items together but don’t want to make use of a capture you can use:
This works exactly as it would without the ?: but the bracket is left out of the list of capture groups. This can improve the efficiency of a regular expression but this usually isn’t an issue.
Lookahead CaptureThere are two lookahead captures. Zero-width positive lookahead assertion
This continues the match only if the regex matches on the immediate right of the current position but doesn’t capture the regex or backtrack if it fails. For example,
only matches a word ending in a digit but the digit is not included in the match. That is it matches Paris9 but returns Paris as capture 0. In other words, you can use it to assert a pattern that must follow a matched subexpression. Zero-width negative lookahead assertion
This works like the positive lookahead assertion but the regex has to fail to match on the immediate right. For example:
only matches a word that doesn’t have a trailing digit. That is, it matches Paris but not Paris9. ReplacementsSo far we have created regular expressions with the idea that we can use them to test that a string meets a specification or to extract a substring. These are the two conventional uses of regular expressions. However you can also use them to perform some very complicated string editing and rearrangements. The whole key to this idea is that you can use the captures as part of the specified replacement string. The only slight problem is that the substitution strings use a slightly different syntax to a regular expression. The replace method is a String function and it accepts a RegExp object to specify the match :
simply takes every match of the associated regular expression and performs the substitution specified. Notice that it performs the substitution on the first match and the result returned is the entire string with the substitution made. For example, if we define the regular expression:
and apply the following replacement:
then the ISBN suffix will be replaced by ISBN-13. Notice that an ISBN-13 suffix will also be replaced by ISBN-13 so making all ISBN strings consistent. This is easy enough to follow and works well as long as you have defined your regular expression precisely enough. More sophisticated is the use of capture groups within the substitution string. You can use:
or
to refer to capture group n. There are a range of other substitution strings but these are fairly obvious in use:
As an example of how this all works consider the problem of converting a US format date to a European format date i.e. to change mm/dd/yyyy to dd/mm/yyyy First we need a regular expression to match the mm/dd/yyyy format:
This isn’t a particularly sophisticated regular expression but we have allowed one or two digits for the month and day numbers but insisted on four for the year number. You can write a more interesting and flexible regular expression for use with real data. Notice that we have three capture groups corresponding to month, day and year. To create a European style date all we have to do assemble the capture groups in the correct order in a substitution string:
This substitutes the day, month and year capture groups in place of the entire matched string, i.e. the original date. Avoid overuseRegular expressions are addictive in a way that can ultimately be unproductive. It isn’t worth spending days crafting a single regular expression that matches all variations on a string when building one or two simpler alternatives and using a wider range of string operations would do the same job as well if not as neatly. Resist the temptation to write regular expressions that you only just understand and always make sure you test them with strings that go well outside of the range of inputs that you consider correct – greedy matching and backtracking often result in the acceptance of a wider range of strings that was originally intended. If you take care, however, regular expressions are a very powerful way of processing and transforming text without the need to move to a complete syntax analysis package. Releated Articles
|
JavaScript Canvas - Fetch API Working with lower-level data is very much part of graphics. This extract from Ian Elliot's book on JavaScript Graphics looks at how to use typed arrays to access graphic data. |
JavaScript Jems - The Inheritance Tax JavaScript should not be judged as if it was a poor version of the other popular languages - it isn't a Java or a C++ clone. It does things its own way. In particular, it doesn't do inheritance [ ... ] |
Other Articles |