Perl Unicode Forensics |
Written by Nikos Vaggalis | ||||||
Monday, 28 October 2019 | ||||||
Page 2 of 2
But why the upgrade to UTF?That solved part of the puzzle. What's left is figuring out why the string was ending up as UTF in the first place. After all, we're using Greek and English in an iso server Locale setting. Let's call Devel::Peek to the rescue, the default tool in debugging encoding cases, to dump the internal representation of the strings and figure out what Perl makes of it. On Server A, all the arguments except Greek "A" consumed by test.cgi have the UTF8 flag on:
On the contrary on Server B, all the arguments including Greek "A" consumed by test.cgi have the UTF8 flag OFF:
So server A sees UTF (well as far as the flag is concerned) while Server B sees mere bytes. Still in Server A Greek "A" has it's UTF8 flag off. But somehow before it reaches the database it gets upgraded to turn its UTF flag on too. Why? Such an upgrade happens when you mix bytes and characters, and bytes get always upgraded to UTF8; probably due to concatenation: $sth = $dbh->prepare("insert into testtable "prepare" takes a string of concatenated values so "A", 301 in Octal,193 in Decimal, as dumped by Devel::Peek SV = PV(0x12399a50) at 0x12036080 is upgraded to UTF8 two-byte sequence C381 because it is considered latin1 and Perl thinks that it upgrades latin1 to UTF8. If it had guessed the correct base iso88597 encoding, it would had upgraded it to CE91 GREEK CAPITAL LETTER ALPHA So the problem is two-fold. A wrong guess about the original encoding as well as the subsequent upgrade based on the wrong setting. Never mind the Greek characters. Why aren't the English ones such as "16/09/2019" and "C" ending up as multiple bytes in the database, despite having their UTF flag on? That is because in UTF8 the lower pane/ASCII range uses single bytes and therefore single bytes end up in the database. Multiple bytes are used in the ranges after. Debugging the client But let's observe how the client test.pl encodes the SOAP packet sent. In both cases the SOAP packet sent over to the CGI script is identical: SOAP::Transport::new: () SOAP::Serializer::new: () SOAP::Deserializer::new: () SOAP::Parser::new: () SOAP::Lite::new: () SOAP::Transport::HTTP::Client::new: () SOAP::Lite::call: () SOAP::Serializer::envelope: () SOAP::Serializer::envelope: select 16/09/2019 17:04:00 16/09/2019 13/09/2019 13: 44:09 Α C 3519999 SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Transport::HTTP::Client::send_receive: HTTP::Request=HASH(0x22d5890) SOAP::Transport::HTTP::Client::send_receive: POST http://192.168.10.205/cgi-bin/ test.cgi HTTP/1.1 Accept: text/xml Accept: multipart/* Accept: application/soap Content-Length: 1185 Content-Type: text/xml; charset=utf-8 SOAPAction: "http://192.168.10.205/test#select" <?xml version="1.0" encoding="UTF-8"?><soap:Envelope xmlns:xsi="http://www.w3.or g/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encodi ng/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" soap:encodingStyle="http://sch emas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/env elope/"><soap:Body><select xmlns="http://192.168.10.205/test"><c-gensym3 xsi :type="xsd:string">16/09/2019 17:04:00</c-gensym3><c-gensym5 xsi:type="xsd:strin g">16/09/2019</c-gensym5><c-gensym7 xsi:type="xsd:string">13/09/2019 13:44:09</c -gensym7><c-gensym9 xsi:type="xsd:base64Binary">wQ==</c-gensym9><c-gensym11 xsi: xsi:type="xsd:int">3519999</c-gensym11></select></soap:Body></soap:Envelope> The Content-Type: text/xml; charset=utf-8 is certainly a red flag,so let's try again with the correct encoding this time:
SOAP::Transport::new: () SOAP::Serializer::new: () SOAP::Deserializer::new: () SOAP::Parser::new: () SOAP::Lite::new: () SOAP::Transport::HTTP::Client::new: () SOAP::Lite::call: () SOAP::Serializer::envelope: () SOAP::Serializer::envelope: select 16/09/2019 17:04:00 16/09/2019 13/09/2019 13: 44:09 Α C 3519999 SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Data::new: () SOAP::Transport::HTTP::Client::send_receive: HTTP::Request=HASH(0x22d5890) SOAP::Transport::HTTP::Client::send_receive: POST http://192.168.10.205/cgi-bin/ test.cgi HTTP/1.1 Accept: text/xml Accept: multipart/* Accept: application/soap Content-Length: 1185 Content-Type: text/xml; charset=iso-8859-7 SOAPAction: "http://192.168.10.205/test#select" <?xml version="1.0" encoding="iso-8859-7"?><soap:Envelope xmlns:xsi="http://www.w3.or g/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" soap:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><select xmlns="http://192.168.10.205/test"><c-gensym3 xsi:type="xsd:string">16/09/2019 17:04:00</c-gensym3><c-gensym5 xsi:type="xsd:strin g">16/09/2019</c-gensym5><c-gensym7 xsi:type="xsd:string">13/09/2019 13:44:09</c-gensym7><c-gensym9 xsi:type="xsd:base64Binary">wQ==</c-gensym9><c-gensym11 xsi:xsi:type="xsd:int">3519999</c-gensym11></select></soap:Body></soap:Envelope> This just set the header of the packet to <?xml version="1.0" encoding="iso-8859-7"?> and didn't make any other difference. The problem persisted. Also changing the Apache Content-Type: text/xml; charset=utf-8 to iso7 didn't have any effect.
The Villain So who's the villain then? It has to be the difference in the versions of Perl and/or SOAP::Lite, in between the two server environments. After all SOAP::Lite 0.714 was released on Aug 18, 2011, while SOAP::Lite 1.27 on May 14, 2018. So reasonably a lot must have been changed,even in how the module handles encodings. Thus the moral of the story is to always strive for the same versions of your software when trying to replicate a system. Mere copying and pasting code won't cut it. But, as in this case, this is more easily said than done.This case was constrained by being unable to use a version of Perl other than the system's own Perl, which also drags the available module versions down (or up). There was simply no option of installing another Perl, say with Perlbrew. Server B, where everything worked fine directly out of the box, had archaic Perl revision 5 version 8 subversion 8. Its newer counterpart, Server A, had Perl revision 5 version 10 subversion 1 . The mitigation So with these limitations at hand, what do you do? How do you mitigate the issue? An initial trick that sprung to mind was to make use of the Now the same query
results in :
So Perl indeed understood that my text was iso7 and correctly upgraded it to the correct UTF byte sequence. Still this doesn't make up for the fact that our database is tuned to speak iso7 hence needs to communicate in single, and not multiple, bytes. Another workaround is to turn the UTF8 flag of the arguments to off:
so that the concatenation won't produce any side effects.
Moral When looking to replicate systems always aim for the same Perl and module versions and always use placeholders, or use Docker. If you already do that then well good for you! In that case I still hope you found this deep dive through the inner workings of the character encodings, and the ways that the different software parts of an application interpret them together the methodology employed, as being of value.
More InformationRelated ArticlesConnecting To The Outside World with Perl and Database Events Health Level 7 (HL7) with Perl
Comments
or email your comment to: comments@i-programmer.info |
||||||
Last Updated ( Monday, 28 October 2019 ) |