Advanced Perl Regular Expressions - Extended Constructs
Written by Nikos Vaggalis   
Monday, 11 January 2016

Perl is still the leader, almost the standard, when it comes to regular expressions. However it also possesses some lesser known features that deserve special attention, like the embedded code constructs for runtime code evaluation. Let's find out what Perl regular expressions are really like with Perl inside.

How does Perl let you mix code and regular expressions?

It provides the (?{ code }) construct which essentially allows for embedding Perl code that gets executed upon every match of the pattern, inside the regular expression. 

For example, let's say we have a file that is about to be distributed to multiple platforms/Operating Systems therefore its file name needs to be portable and compatible with the various OSs' file systems. What is the best way to achieve this?

By  renaming the file to contain characters only from the universally recognizable ASCII character set, which means we have to strip it out of all the non-ASCII characters.

How are we going to do that?

Well there is the [[:ascii:]] POSIX class and/or the Unicode \p{InBasicLatin} block that do match all ASCII characters,thus by  negation [^[:ascii:]] or P{InBasic_Latin} we get to all non-ASCII ones. As everything in Perl, TMTOWTDI (there's more than one way to do it). and this example can be the basis for forming more elaborate use cases later on.

What do we actually mean by ASCII?

We mean characters with ordinal values below 128 (in other words US English only),thus we need to remove those beyond 127 which leads us to a 'remove all characters whose ordinal value is > 127' condition for use in constructing the regex. Besides ASCII characters, our file name also contains Hindi DEVANAGARI characters,intermixed:

 myimageऄwithधDevanagariमcharsफ'.png;

 

So to get at the desirable outcome,we will use code like:
(code available from the Codebin and online at http://ideone.com/T3Bd0B)

#!/usr/bin/perl
use utf8;
my $file='myimageऄwithधDevanagariमcharsफ.png';
my $filenew=$file=~ s/(.)(?{ if (ord $1 >127 ){'x'} 
else {$1} })/$^R/gr; print "before: ",$file,"\n"; print "after: ",$filenew,"\n";

before: myimagewithDevanagarichars.png
after: myimagexwithxDevanagarixcharsx.png

Let's break the expression apart: 

(.) matches any character,furthermore the enclosing parentheses capture that character into special variable $1. We need $1 for checking whether the character's ordinal value is beyond the allowed range of 127,aka ord $1 >127

The /g modifier moves the engine to the next matching attempt,so for example if we had a string of 'abc' then the first (.) would match 'a',the second 'b' and the third 'c',thus we get one character after another.

Then for each captured character the ?{ if (ord $1 >127 ){'x'} else {$1} } code block is evaluated.It checks whether the conditions proves to be true and returns a simple 'x' in place of the captured character .

Thus, 'ऄ' would be replaced by 'x' , 'ध' too and so on, while the rest of the characters are returned unprocessed through the use of else {$1}. The 'returned' replacement in both instances, is placed into special variable ^R which is used by the embedded code construct and its cooperation with the substitution operator for performing the aforementioned substitutions.

Also a handy a non-destructive version of the substitution operator is introduced with the /r flag. This means that the target string is not modified, but instead the processed string is returned hence you can replace code like the following :

use 5.010000;
my $name="a1b2c3d";
my $temp;
$temp = $name =~ s/\d/x/g;
say $temp;
#prints: 3 which is
#the number of substituted characters
say $name;
prints: axbxcxd

or this which keeps the target string ($name) intact while storing the resulted string in $temp:

use 5.010000;
my $name="a1b2c3d";
my $temp;
( $temp = $name ) =~ s/\d/x/g;
say $temp;
#prints: axbxcxd
say $name;
#prints: a1b2c3d

with code using the /r modifier:

my $name="a1b2c3d";
$temp=$name =~ s/\d/x/gr;
say $temp;
#prints: axbxcxd
say $name;
#prints: a1b2c3d

 

The 'use utf8' directive is essential since we notify Perl that we use embedded Unicode characters in our actual code text,such as:

$file='myimageऄwithधDevanagariमcharsफ'.png'

If it wasn't for this directive the outcome would be quite different:

before: myimagewithDevanagarichars.png
after: myimagexxxwithxxxDevanagarixxxcharsxxx.png

(code available online at http://ideone.com/SlqgCG)

Which leads to the reasonable questioning of 'why three substitutions for each Unicode character?Because without use utf8, Perl is not notified of the embedded Unicode characters, therefore it uses byte semantics thus each Hindi character is decomposed down to the 3 bytes whose combination gives the utf8 coding for that Unicode character.

Looking at the chart:

Unicode code point Character UTF-8 (hex.) Name
U+0904 e0 a4 84 DEVANAGARI LETTER SHORT A
 U+092E  म e0 a4 ae  DEVANAGARI LETTER MA
U+092B e0 a4 ab DEVANAGARI LETTER PHA

(source UTF-8 chartable)

confirms that indeed a 3-byte combination is needed for each given character.

Now to a more complex and practical example. We receive a serialized stream out of a web service that has all special characters encoded in HTML entity hex encoding. Obviously,we need to convert it back to the actual character representation,that is '£' would become the pound £ symbol.

Additionally we need to extract the integer (not decimal,for keeping the example simple) money values next to those entities, add 100 to them and remove everything else.  

Can this happen in one go?

So given the requirements the string '£50cost€200cost' should become '£150€300'

Unfortunately the sample code can't be compiled on the ideone pad because of its dependency to HTML::Entities as ideone cannot import modules. Therefore we have to rely on our offline development environment which would require a Perl installation of 5.14 and above and the HTML::Entities module installed.

For the example to work we have to have a Unicode enabled text editor like Notepad++,and save the example scripts as utf8 without BOM, files.

Because it is very reasonable that our shell, cmd in Windows for example, does not understand Unicode,we avoid printing to STDOUT, and instead print to a file which subsequently opened in the text editor would reveal the correct Unicode enabled result.

That is, file 'out.txt' would contain £150€300.

Now let's examine the code: 

use HTML::Entities qw(decode_entities);
my $stream='£50cost€200cost';
$stream=~ s/((&\#x....;)|(\d+)|(.))
(?{ if (defined $2){decode_entities "$2"}
elsif(defined $3) {$3+100} else {''}
})/$^R/xg; open(FILE,">out.txt"); print FILE $stream; close FILE;
 

The outermost enclosing parentheses ( (&\#x....;)|(\d+)|(.) )  play a significant role , since they activate the code evaluation for all intermittent patterns. Reshuffling to /(&\#x....;)|(\d+)|(.)(?{})/ would execute the code only in the case that the last pattern (.) matched,ignoring the rest of the patterns.

Let's break the expression down:
( (&\#x....;)|(\d+)|(.) ) makes $1,(&\#x....;) makes $2,(\d+) makes $3,(.) makes $4.

When we match a hex HTML entity, hence there is value inside $2 (represented by the if (defined $2) block),we replace that captured value with the outcome resulting out of evaluating the code 'decode_entities "$2"' which  translates that value i.e £ and € to the pound £ and euro € characters accordingly.

Then,when we have matched a sequence of one or more digits (stored in $3) i.e 50 and 200,we increment each of those captured values by 100.

After that, upon matching any character, stored in $4, we replace
that captured character with nothing (as represented by the 'else' clause), thus in this case we remove both instances of 'cost'

The order that the sub patterns are placed is significant for getting to a successful match. So the order of (&\#x....;) and (\d+) which are placed first and second does not matter and can be freely changed, but the order of (.) placed last does,as the reverse of placing (.) first would result into matching any character never progressing to the next  patterns (&\#x....;) and (\d+).

Finally,the x modifier is used to notify the engine that white space is not significant,something that allows for using extra spaces and comments to annotate and beautify our regular expression.

That was one construct but there's more.

The use cases are limited only by imagination, going beyond the traditional regular expression model and in fact extending it. As Programming Perl puts it:

Unmatched power for text processing and scripting

 

More Information

Mastering Perl global matching
Perl regex Tester
perlre
Ideone

Related Articles

Unicode issues in Perl
Programming Perl book review

To be informed about new articles on I Programmer, sign up for our weekly newsletter,subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

 

Banner


Sequin - Open Source Message Stream Built On Postgres
31/10/2024

Sequin is a tool for capturing changes and streaming data out of your Postgres database, guaranteeing exactly once processing. What does that mean?



Meta Releases OpenSource Podcast Generating Tool
28/11/2024

Meta has released an open source project that can be used to automatically convert a PDF file into a podcast. Meta says Notebook Llama can be considered an open-source version of Google's NotebookLM.

 [ ... ]


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 25 July 2016 )