Unicode issues in Perl |
Written by Nikos Vaggalis | ||||||
Monday, 14 February 2011 | ||||||
Page 4 of 5
The UTF8 flagLet's return to the earlier example where we fed our program with the file called 'Δ.rar': #Example1.pl This produces: C:\unicode>perl example1.pl Here conversion was successful as demonstrated by the fact that we get the correct value in ANSI cp, but our Perl program still does not know that we have a string encoded in Greek/cp1253 and treats the byte sequence as a Latin1 string. This poses issues such as not being able to use the Unicode regex facilities, for example: $file=~s/\N{GREEK CAPITAL LETTER DELTA}/!/; because the string is not Unicode encoded. Hence we must upgrade the bytes into characters: #Example6.pl which produces C:\unicode>perl example 6.pl FLAGS = (PADMY,POK,pPOK) C:\unicode> After upgrading the string to Unicode the UTF8 flag is on which denotes that we are working with Unicode; and now regex is also working! If you do not specify an input code page as: $unicode_file=decode(cp1253,$file), whereby we specifically instruct Perl to encode bytes into characters by using the cp1253 code page, Perl treats the string as Latin1. To prove this point we will use the following example : #Example7.pl while ( my $file = readdir($MYFILE) ) { Dump($file),"\n"; $file=~s/\N{GREEK CAPITAL LETTER $file=~s/\N{LATIN CAPITAL LETTER Which produces E:\unicode> chcp 1253 E:\unicode>example7.pl Bytes implicitly upgraded into wide Bytes implicitly upgraded into wide SV = PV(0x243ba4) at 0x18207b4 greek? Δ.rar GREEK CAPITAL LETTER DELTA and LATIN CAPITAL LETTER A WITH DIAERESIS share the same ordinal value \304. For the Unicode enabled regex to work, the string was treated as bytes with Latin1 encoding and was implicitly upgraded into Latin1's equivalent UTF8 and that is why our regex search for Greek char LETTER DELTA failed while it was successful for LATIN CAPITAL LETTER A
As perlunicode states : "By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 code points in Unicode happens to agree with Latin-1." If you need to explicitly upgrade the bytes to UTF8 , you can use utf8::upgrade() which upgrades a string in native format (Latin1) to Unicode: E:\unicode>example7a.pl <ASIN:0596000278> |
||||||
Last Updated ( Monday, 04 April 2011 ) |