Unicode issues in Perl

Written by Nikos Vaggalis

Monday, 14 February 2011

Article Index
Unicode issues in Perl
Readdir in action
Byte semantics
UTF8
Greek, Latin and Cyrillic

Page 3 of 5

Opendir,readdir and byte semantics

The issue is that Perl functions opendir and readdir employ byte semantics and when going through the aforementioned conversion process return bytes instead of characters. This poses a problem in a variety of cases when having to deal with Unicode; for example traversing directories or manipulating Unicode files.

One solution is to feed our program Unicode directly hence eradicating the multilingual problematic issues once and for all. After all this is what Unicode was supposed to do.

Unfortunately opendir and readdir and other operators are still not Unicode enabled. We could consider using obscure win32 API functions as possible workarounds (use open pragma, use encoding, -C switch which has an effect on the standard I/O but not on functions like opendir,readdir) or wait for the operators to become Unicode in some later Perl version (see 'Unicode in Filenames' in the PerlTodo section on Perl.org),or we could use the COM facilities provided by Windows right now.

Scripting.FileSystemObject

It is much easier and straightforward to use the Windows COM facilities which are all Unicode enabled.

Inside Windows there are hundreds of Automation objects and one of them is Scripting.FileSystemObject which provides a much higher level of abstraction than the Win32 APIs and we can use it from within Perl through the Win32::OLE module in order to get directories and filenames directly in Unicode/characters and not bytes.

ex4

#Example4.pl
use Win32::Console;
Win32::Console::OutputCP( 65001 );
use Win32::OLE qw(in);
use Devel::Peek;
binmode(STDOUT, ":utf8");

Win32::OLE->Option(CP => 
                   Win32::OLE::CP_UTF8);
$obj = Win32::OLE->
      new('Scripting.FileSystemObject');
$folder = $obj->GetFolder(".");
$collection= $folder->{Files};
foreach $value (in $collection) {
   $filename= %$value->{Name};
     print $filename,"\n";
}

And the result is:

C:\unicode>perl example4.pl
f.rar
à.rar
Δ.rar
C:\unicode>

The filenames were resolved correctly, and with a bit of tweaking it works with subdirectories as well.

#Example5.pl
use Win32::Console;
Win32::Console::OutputCP( 65001 );
use Win32::OLE qw(in);
use Devel::Peek;
binmode(STDOUT, ":utf8");

Win32::OLE->Option(CP => 
                    Win32::OLE::CP_UTF8);
$obj = Win32::OLE-> 
       new('Scripting.FileSystemObject');
$folder = $obj->GetFolder(".");
$collection_folders= $folder->{SubFolders};

foreach $value (in $collection_folders) {
   $filename= %$value->{Name};
    print $filename,"\n";

This results in:

C:\unicode>perl example5.pl
f
à
Å
Æ
Δ
C:\unicode>

Since we told Perl to output UFT8 we have to set the console to the correct codepage as well by using Win32::Console::OutputCP( 65001 ) and enable Unicode support by switching Win32::OLE to the UTF8 codepage (CP => Win32::OLE::CP_UTF8()).

We thus avoided having to go through conversion from ANSI to Unicode, Unicode to ANSI, encoding strings, choosing the correct code pages for the conversion to be successful and most importantly we can work with any file of any language simultaneously without having to set the "Language for non-Unicode programs" every time a new language is needed.

<ASIN:0596102429>

<< Prev - Next >>

Last Updated ( Monday, 04 April 2011 )