.NET Regular Expressions In Depth
Written by Mike James   
Thursday, 16 July 2020
Article Index
.NET Regular Expressions In Depth
Quantifiers
Capture
Back references
Reduction

If you think regular expressions are trivial and boring, you've not seen the whole picture. Here we reveal that in .NET they are amazingly powerful and not to be missed.

Deep C#

 Buy Now From Amazon

DeepCsharp360

 Chapter List

  1. Why C#?
    I Strong Typing & Type Safety
  2. Strong Typing
       Extract 
    Why Strong Typing
  3. Value & Reference
  4.    Extract Value And Reference
  5. Structs & Classes
       Extract
    Structs & Classes 
  6. Inheritance
      
    Extract
    Inheritance
  7. Interfaces & Multiple Inheritance
      
    Extract Interface
  8. Controlling Inheritance
    II Casting & Generics
  9. Casting - The Escape From Strong Typing
      
    Extract Casting I
  10. Generics
  11. Advanced Generics
  12. Anonymous & Dynamic
    Typing
    III Functions
  13. Delegates
  14. Multicast Delegates
  15. Anonymous Methods, Lambdas & Closures
    IV Async
  16. Threading, Tasks & Locking
  17. The Invoke Pattern
  18. Async Await
  19. The Parallel For ***NEW!
    V Data - LINQ, XML & Regular Expressions
  20. The LINQ Principle
  21. XML
  22. LINQ To XML
  23. Regular Expressions
    VI Unsafe & Interop
  24. Interop
  25. COM
  26. Custom Attributes
  27. Bit Manipulation
  28. Advanced Structs
  29. Pointers 

Extra Material

 <ASIN:1871962714>

 <ASIN:B09FTLPTP9>

Regular expressions are addictive.

Playing with these compressed but powerful patterns is better than solving a Sudoku.

If you are wondering what this is all about because, obviously, regular expressions are just the use of “*” and "?" then read on because the truth is a lot more subtle and the result is a lot more powerful than you might suspect. 

Equally, regular expressions are something that you will find in more than just C#, they are useful in Javascript, Perl, Java, Ruby and even in applications such as word processors.

If you know the basics of regular expressions then jump to the end of the article where you will find some deeper explainations of less used features. 

 

Banner

 

Regular fundamentals

It all starts with the idea of specifying a grammar for a particular set of strings. All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern.

The simplest sort of pattern is the string literal that matches itself. So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string of the form “ISBN:”.

To actually use this you have to first create a Regex object with the regular expression built into it:

Regex ex1 = new Regex(@"ISBN:");

The use of the “@” at the start of the string is optional but it does make it easier when we start to use the “/” escape character.

Recall that strings starting with “@” are represented “as is” without any additional processing or conversion by C#.

To actually use the regular expression we need one of the methods offered by the Regex object.

The Match method applies the expression to a specified string and returns a Match object.

The Match object contains a range of useful properties and methods that let you track the operation of applying the regular expression to the string.

For example, if there was a match the Success property is set to true as in:

MessageBox.Show(
        ex1.Match(@"ISBN:978-1871962406").
                         Success.ToString());

The index property gives the position of the match in the search string:

MessageBox.Show(ex1.Match(
    @"ISBN: 978-1871962406").Index.ToString());

which in this case returns zero to indicate that the match is at the start of the string.

To return the actual match in the target string you can use the ToString method. Of course in this case the result is going to be identical to the regular expression but in general this isn’t the case.

Notice that the Match method returns the first match to the regular expression and you can use the NextMatch method which returns another Match object.

regex

Pattern matching

If this is all there was do regular expressions they wouldn’t be very interesting.

The reason they are so useful is that you can specify patterns that spell out the regularities in a type of data.

For example following the ISBN: we expect to find a digit – any digit.

This can be expressed as “ISBN:\d” where \d is character class indicator which means “a digit”.

If you try this out you will discover that you don’t get a match with the example string because there is a space following the colon. However “ISBN:\s\d” does match as \s means “any white-space character” and:

Regex ex1 = new Regex(@"ISBN:\s\d");
MessageBox.Show(ex1.Match(
           @"ISBN: 978-1871962406").ToString();

displays “ISBN: 9”.

There’s a range of useful character classes and you can look them up in the documentation. The most useful are:

  •           (i.e. a single dot) matches any character.
  • \d         digit
  • \s         white-space
  • \w        any “word” character including digits

There is also the convention that capital letters match the inverse set of characters:

  • \D       any non-digit
  • \S       any non-white space
  • \W      any word character

Notice that the inverse sets can behave unexpectedly unless you are very clear about what they mean.

For example. \D also matches white space and hence

@"ISBN:\D\d"

matches ISBN: 9. 

You can also make up your own character group by listing the set of characters between square brackets.

So for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same thing as \D.

There are also character sets that refer to Unicode but these are obvious enough in use not to need additional explanation. 

Anchors

As well as characters and character sets you can also use location matches or anchors.

For example, the ^ (caret) only matches the start of the string. For example, @"^ISBN:"

will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else. The most useful anchors are:

  •          start of string
  • $          end of string
  • \b         word boundary – i.e. between a \w and \W
  • \B        anywhere but a word boundary

So for example: 

@"^\d+$"

specifies a string consisting of nothing but digits. Compare this to

@"^\d*$"

which would also accept a null string.

One subtle point only emerges when you consider strings with line breaks.

In this case by default the ^ and $ match only the very start and end of the string.

If you want them to match line beginnings and endings you have to specify the /m option. It’s also worth knowing about the \G anchor which only matches at the point where the previous match ended – it is only useful when used with the NextMatch method but then it makes all matches contiguous.

Banner



Last Updated ( Thursday, 16 July 2020 )