Unicode issues in Perl |
Written by Nikos Vaggalis | ||||||
Monday, 14 February 2011 | ||||||
Page 3 of 5
Opendir,readdir and byte semanticsThe issue is that Perl functions opendir and readdir employ byte semantics and when going through the aforementioned conversion process return bytes instead of characters. This poses a problem in a variety of cases when having to deal with Unicode; for example traversing directories or manipulating Unicode files. One solution is to feed our program Unicode directly hence eradicating the multilingual problematic issues once and for all. After all this is what Unicode was supposed to do. Unfortunately opendir and readdir and other operators are still not Unicode enabled. We could consider using obscure win32 API functions as possible workarounds (use open pragma, use encoding, -C switch which has an effect on the standard I/O but not on functions like opendir,readdir) or wait for the operators to become Unicode in some later Perl version (see 'Unicode in Filenames' in the PerlTodo section on Perl.org),or we could use the COM facilities provided by Windows right now. Scripting.FileSystemObjectIt is much easier and straightforward to use the Windows COM facilities which are all Unicode enabled. Inside Windows there are hundreds of Automation objects and one of them is Scripting.FileSystemObject which provides a much higher level of abstraction than the Win32 APIs and we can use it from within Perl through the Win32::OLE module in order to get directories and filenames directly in Unicode/characters and not bytes.
#Example4.pl And the result is: C:\unicode>perl example4.pl The filenames were resolved correctly, and with a bit of tweaking it works with subdirectories as well. #Example5.pl This results in:
C:\unicode>perl example5.pl Since we told Perl to output UFT8 we have to set the console to the correct codepage as well by using Win32::Console::OutputCP( 65001 ) and enable Unicode support by switching Win32::OLE to the UTF8 codepage (CP => Win32::OLE::CP_UTF8()). We thus avoided having to go through conversion from ANSI to Unicode, Unicode to ANSI, encoding strings, choosing the correct code pages for the conversion to be successful and most importantly we can work with any file of any language simultaneously without having to set the "Language for non-Unicode programs" every time a new language is needed. <ASIN:0596102429> |
||||||
Last Updated ( Monday, 04 April 2011 ) |