Unicode issues in Perl |
Written by Nikos Vaggalis | ||||||
Monday, 14 February 2011 | ||||||
Page 2 of 5
Readdir in actionWe have a directory with three files, and we use the readdir function to get the filenames into our script:
Three test files with "interesting" names #Example1.pl By using the module Devel::Peek we can take a look at the internals of what our Perl program is fed after conversion takes effect: C:\unicode>perl example1.pl “f.rar” and “Δ.rar” got there successfully but “à.rar” was not (it returned “C33E~1.RAR”). The ANSI representation of “Δ.rar” is ordinal value \304. Our Perl program gets the correct value but it does not know that it should be treated as Greek as it gets unencoded bytes. We will return later to explore this issue in detail. Let’s add a few directories and observe again:
Some more test directories C:\unicode>perl example1.pl C:\unicode> Directory "f" and file "f.rar" have been converted correctly, so has directory "Δ" and file "Δ.rar", but file "à.rar" produced "C33E~1.RAR" and directory "à" produced the sequence "0E00~1" both of which are erroneous. In fact the only thing that you can derive from the garbage is that 0xC30xA0 is the UTF8 internal byte representation of character à in hex, while 0E00 is the equivalent UTF16 representation. Console Input and OutputConsole Input and Output is controlled by using the command "chcp". For example chcp 1253 changes the input and output code page of the console to code page 1253/Greek. This means that the output of our Perl program will be treated as Greek by our Console. Let's reuse our example but this time instead of dumping the internals we will print the filenames to STDOUT:
#Example2.pl First we set console output to 1251 and then to 1253: C:\unicode>chcp 1251 C:\unicode>perl example2.pl C:\unicode>chcp 1253 C:\unicode>perl example2.pl Our Perl program spits out bytes which are intercepted by the console and, depending on the codepage set, they are translated into the equivalent ANSI characters. Thus in the first case the file with Greek letter "Δ" which has the ordinal value \304 is translated to Cyrillic character "Д" because the code page is set to 1251/Cyrillic and \304 corresponds to that Cyrillic character. Microsoft Windows Code Page 1253 Microsoft Windows Code Page 1251 Source : http://www.columbia.edu/kermit/cp1251.html By setting the code page with chcp to the correct page 1253 we interpret the bytes correctly. However we can change the console output programmatically by using: Win32::Console::OutputCP( 1253 ); which supersedes any chcp settings. In the following example we set both the Input code page as well the Output code page of the console to 1251 but we supersede the Output code page setting from within our Perl program, hence we still get the correct output: #Example3.pl This produces: C:\unicode>chcp 1251 C:\unicode>perl example3.pl C:\unicode> <ASIN:0321480910> |
||||||
Last Updated ( Monday, 04 April 2011 ) |