.NET Regular Expressions In Depth

Written by Mike James

Thursday, 16 July 2020

Article Index
.NET Regular Expressions In Depth
Quantifiers
Capture
Back references
Reduction

Page 1 of 5

If you think regular expressions are trivial and boring, you've not seen the whole picture. Here we reveal that in .NET they are amazingly powerful and not to be missed.

Deep C#

Buy Now From Amazon

Chapter List

Why C#?
I Strong Typing & Type Safety
Strong Typing
Extract Why Strong Typing
Value & Reference
Extract Value And Reference
Structs & Classes
Extract Structs & Classes
Inheritance
Extract Inheritance
Interfaces & Multiple Inheritance
Extract Interface
Controlling Inheritance
II Casting & Generics
Casting - The Escape From Strong Typing
Extract Casting I
Generics
Advanced Generics
Anonymous & Dynamic
Typing III Functions
Delegates
Multicast Delegates
Anonymous Methods, Lambdas & Closures
IV Async
Threading, Tasks & Locking
The Invoke Pattern
Async Await
The Parallel For ***NEW!
V Data - LINQ, XML & Regular Expressions
The LINQ Principle
XML
LINQ To XML
Regular Expressions
VI Unsafe & Interop
Interop
COM
Custom Attributes
Bit Manipulation
Advanced Structs
Pointers

Extra Material

<ASIN:1871962714>

<ASIN:B09FTLPTP9>

Regular expressions are addictive.

Playing with these compressed but powerful patterns is better than solving a Sudoku.

If you are wondering what this is all about because, obviously, regular expressions are just the use of “*” and "?" then read on because the truth is a lot more subtle and the result is a lot more powerful than you might suspect.

Equally, regular expressions are something that you will find in more than just C#, they are useful in Javascript, Perl, Java, Ruby and even in applications such as word processors.

If you know the basics of regular expressions then jump to the end of the article where you will find some deeper explainations of less used features.

Regular fundamentals

It all starts with the idea of specifying a grammar for a particular set of strings. All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern.

The simplest sort of pattern is the string literal that matches itself. So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string of the form “ISBN:”.

To actually use this you have to first create a Regex object with the regular expression built into it:

Regex ex1 = new Regex(@"ISBN:");

The use of the “@” at the start of the string is optional but it does make it easier when we start to use the “/” escape character.

Recall that strings starting with “@” are represented “as is” without any additional processing or conversion by C#.

To actually use the regular expression we need one of the methods offered by the Regex object.

The Match method applies the expression to a specified string and returns a Match object.

The Match object contains a range of useful properties and methods that let you track the operation of applying the regular expression to the string.

For example, if there was a match the Success property is set to true as in:

MessageBox.Show( ex1.Match(@"ISBN:978-1871962406"). Success.ToString());

The index property gives the position of the match in the search string:

MessageBox.Show(ex1.Match( @"ISBN: 978-1871962406").Index.ToString());

which in this case returns zero to indicate that the match is at the start of the string.

To return the actual match in the target string you can use the ToString method. Of course in this case the result is going to be identical to the regular expression but in general this isn’t the case.

Notice that the Match method returns the first match to the regular expression and you can use the NextMatch method which returns another Match object.

regex

Pattern matching

If this is all there was do regular expressions they wouldn’t be very interesting.

The reason they are so useful is that you can specify patterns that spell out the regularities in a type of data.

For example following the ISBN: we expect to find a digit – any digit.

This can be expressed as “ISBN:\d” where \d is character class indicator which means “a digit”.

If you try this out you will discover that you don’t get a match with the example string because there is a space following the colon. However “ISBN:\s\d” does match as \s means “any white-space character” and:

Regex ex1 = new Regex(@"ISBN:\s\d"); MessageBox.Show(ex1.Match( @"ISBN: 978-1871962406").ToString();

displays “ISBN: 9”.

There’s a range of useful character classes and you can look them up in the documentation. The most useful are:

. (i.e. a single dot) matches any character.
\d digit
\s white-space
\w any “word” character including digits

There is also the convention that capital letters match the inverse set of characters:

\D any non-digit
\S any non-white space
\W any word character

Notice that the inverse sets can behave unexpectedly unless you are very clear about what they mean.

For example. \D also matches white space and hence

@"ISBN:\D\d"

matches ISBN: 9.

You can also make up your own character group by listing the set of characters between square brackets.

So for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same thing as \D.

There are also character sets that refer to Unicode but these are obvious enough in use not to need additional explanation.

Anchors

As well as characters and character sets you can also use location matches or anchors.

For example, the ^ (caret) only matches the start of the string. For example, @"^ISBN:"

will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else. The most useful anchors are:

^ start of string
$ end of string
\b word boundary – i.e. between a \w and \W
\B anywhere but a word boundary

So for example:

@"^\d+$"

specifies a string consisting of nothing but digits. Compare this to

@"^\d*$"

which would also accept a null string.

One subtle point only emerges when you consider strings with line breaks.

In this case by default the ^ and $ match only the very start and end of the string.

If you want them to match line beginnings and endings you have to specify the /m option. It’s also worth knowing about the \G anchor which only matches at the point where the previous match ended – it is only useful when used with the NextMatch method but then it makes all matches contiguous.

Prev - Next >>

Last Updated ( Thursday, 16 July 2020 )