Perl Unicode Forensics
Written by Nikos Vaggalis   
Monday, 28 October 2019
Article Index
Perl Unicode Forensics
Why UTF?

But why the upgrade to UTF?

That solved part of the puzzle. What's left is figuring out why the string was ending up as UTF in the first place. After all, we're using Greek and English in an iso server Locale setting.

Let's call Devel::Peek to the rescue, the default tool in debugging encoding cases, to dump the internal representation of the strings and figure out what Perl makes of it.

On Server A, all the arguments except Greek "A" consumed by test.cgi have the UTF8 flag on:

SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1f3cef0 "16/09/2019 17:04:00"\0 UTF8 "16/09/2019 17:04:00"
CUR = 19
LEN = 24
$VAR1 = \'16/09/2019 17:04:00';


SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1f3cef0 "16/09/2019"\0 UTF8 "16/09/2019"
CUR = 10
LEN = 24
$VAR1 = \'16/09/2019';


SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1f3cef0 "13/09/2019 13:44:09"\0 UTF8 "13/09/2019 13:44:09"
CUR = 19
LEN = 24
$VAR1 = \'13/09/2019 13:44:09';


SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1f3cef0 "\301"\0
CUR = 1
LEN = 24
$VAR1 = \'Α';


SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1f3cef0 "C"\0 UTF8 "C"
CUR = 1
LEN = 24
$VAR1 = \'C';


SV = PV(0x1f243c8) at 0x1b37880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1f3cef0 "3519999"\0 UTF8 "3519999"
CUR = 7
LEN = 24
$VAR1 = \'3519999';

On the contrary on Server B, all the arguments including Greek "A" consumed by test.cgi have the UTF8 flag OFF:

SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "16/09/2019 17:04:00"\0
CUR = 19
LEN = 24
$VAR1 = \'16/09/2019 17:04:00';


SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "16/09/2019"\0
CUR = 10
LEN = 24
$VAR1 = \'16/09/2019';


SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "13/09/2019 13:44:09"\0
CUR = 19
LEN = 24
$VAR1 = \'13/09/2019 13:44:09';


SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "\301"\0
CUR = 1
LEN = 24
$VAR1 = \'Α';


SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "C"\0
CUR = 1
LEN = 24
$VAR1 = \'C';


SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "3519999"\0
CUR = 7
LEN = 24
$VAR1 = \'3519999';

So server A sees UTF (well as far as the flag is concerned) while Server B sees mere bytes. Still in Server A Greek "A" has it's UTF8 flag off. But somehow before it reaches the database it gets upgraded to turn its UTF flag on too. Why?

Such an upgrade happens when you mix bytes and characters, and bytes get always upgraded to UTF8; probably due to concatenation:

$sth = $dbh->prepare("insert into testtable
(date1,date2,date3,greek,english,id)
values
($date1,$date2,$date3,$greek_string,$english_string,$id)";

"prepare" takes a string of concatenated values so "A", 301 in Octal,193 in Decimal, as dumped by Devel::Peek

SV = PV(0x12399a50) at 0x12036080
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x123a07e0 "\301"\0
CUR = 1
LEN = 24
$VAR1 = \'Α';

is upgraded to UTF8 two-byte sequence C381 because it is considered latin1 and Perl thinks that it upgrades latin1 to UTF8. If it had guessed the correct base iso88597 encoding, it would had upgraded it to CE91 GREEK CAPITAL LETTER ALPHA

So the problem is two-fold. A wrong guess about the original encoding as well as the subsequent upgrade based on the wrong setting.

Never mind the Greek characters. Why aren't the English ones such as "16/09/2019" and "C" ending up as multiple bytes in  the database, despite having their UTF flag on? That is because in UTF8 the lower pane/ASCII range uses single bytes and therefore single bytes end up in the database. Multiple bytes are used in the ranges after.

Debugging the client

But let's observe how the client test.pl encodes the SOAP packet sent.

In both cases the SOAP packet sent over to the CGI script is identical:

SOAP::Transport::new: ()
SOAP::Serializer::new: ()
SOAP::Deserializer::new: ()
SOAP::Parser::new: ()
SOAP::Lite::new: ()
SOAP::Transport::HTTP::Client::new: ()
SOAP::Lite::call: ()
SOAP::Serializer::envelope: ()
SOAP::Serializer::envelope: select 16/09/2019 17:04:00 16/09/2019 13/09/2019 13:
44:09 Α C  3519999 
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Transport::HTTP::Client::send_receive: HTTP::Request=HASH(0x22d5890)
SOAP::Transport::HTTP::Client::send_receive: POST http://192.168.10.205/cgi-bin/
test.cgi HTTP/1.1
Accept: text/xml
Accept: multipart/*
Accept: application/soap
Content-Length: 1185
Content-Type: text/xml; charset=utf-8
SOAPAction: "http://192.168.10.205/test#select"
<?xml version="1.0" encoding="UTF-8"?><soap:Envelope xmlns:xsi="http://www.w3.or
g/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encodi
ng/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" soap:encodingStyle="http://sch
emas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/env
elope/"><soap:Body><select xmlns="http://192.168.10.205/test"><c-gensym3 xsi
:type="xsd:string">16/09/2019 17:04:00</c-gensym3><c-gensym5 xsi:type="xsd:strin
g">16/09/2019</c-gensym5><c-gensym7 xsi:type="xsd:string">13/09/2019 13:44:09</c
-gensym7><c-gensym9 xsi:type="xsd:base64Binary">wQ==</c-gensym9><c-gensym11 xsi:
xsi:type="xsd:int">3519999</c-gensym11></select></soap:Body></soap:Envelope>

SOAP::Transport::HTTP::Client::send_receive: HTTP::Response=HASH(0x2b22ba0) SOAP::Transport::HTTP::Client::send_receive: HTTP/1.1 200 OK Connection: close Date: Mon, 21 Oct 2019 10:23:14 GMT Server: Apache/2.2.15 (Unix) Content-Length: 916 Content-Type: text/xml; charset=utf-8 Client-Date: Mon, 21 Oct 2019 10:23:15 GMT Client-Peer: 192.168.10.205:80 Client-Response-Num: 1 SOAPServer: SOAP::Lite/Perl/1.27

The Content-Type: text/xml; charset=utf-8 is certainly a red flag,so let's try again with the correct encoding this time:

use SOAP::Lite +trace; my $soap = SOAP::Lite->uri("http://192.168.10.205/test")-> proxy("http://192.168.10.205/cgi-bin/test.cgi")->encoding("iso-8859-7")-> select("16/09/2019 17:04:00","16/09/2019","13/09/2019 13:44:09","Α","C",3519999)->result;

print "result is ", @{$soap};

SOAP::Transport::new: ()
SOAP::Serializer::new: ()
SOAP::Deserializer::new: ()
SOAP::Parser::new: ()
SOAP::Lite::new: ()
SOAP::Transport::HTTP::Client::new: ()
SOAP::Lite::call: ()
SOAP::Serializer::envelope: ()
SOAP::Serializer::envelope: select 16/09/2019 17:04:00 16/09/2019 13/09/2019 13:
44:09 Α C  3519999 
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Data::new: ()
SOAP::Transport::HTTP::Client::send_receive: HTTP::Request=HASH(0x22d5890)
SOAP::Transport::HTTP::Client::send_receive: POST http://192.168.10.205/cgi-bin/
test.cgi HTTP/1.1
Accept: text/xml
Accept: multipart/*
Accept: application/soap
Content-Length: 1185
Content-Type: text/xml; charset=iso-8859-7
SOAPAction: "http://192.168.10.205/test#select"
<?xml version="1.0" encoding="iso-8859-7"?><soap:Envelope xmlns:xsi="http://www.w3.or
g/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" soap:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><select xmlns="http://192.168.10.205/test"><c-gensym3 xsi:type="xsd:string">16/09/2019 17:04:00</c-gensym3><c-gensym5 xsi:type="xsd:strin
g">16/09/2019</c-gensym5><c-gensym7 xsi:type="xsd:string">13/09/2019 13:44:09</c-gensym7><c-gensym9 xsi:type="xsd:base64Binary">wQ==</c-gensym9><c-gensym11 xsi:xsi:type="xsd:int">3519999</c-gensym11></select></soap:Body></soap:Envelope>

SOAP::Transport::HTTP::Client::send_receive: HTTP::Response=HASH(0x2b22ba0) SOAP::Transport::HTTP::Client::send_receive: HTTP/1.1 200 OK Connection: close Date: Mon, 21 Oct 2019 10:23:14 GMT Server: Apache/2.2.15 (Unix) Content-Length: 916 Content-Type: text/xml; charset=utf-8 Client-Date: Mon, 21 Oct 2019 10:23:15 GMT Client-Peer: 192.168.10.205:80 Client-Response-Num: 1 SOAPServer: SOAP::Lite/Perl/1.27

This just set the header of the packet to <?xml version="1.0" encoding="iso-8859-7"?> and didn't make any other difference.

The problem persisted.

Also changing the Apache Content-Type: text/xml; charset=utf-8 to iso7 didn't have any effect.

 

The Villain

So who's the villain then? It has to be the difference in the versions of Perl and/or SOAP::Lite, in between the two server environments. After all SOAP::Lite 0.714 was released on Aug 18, 2011, while SOAP::Lite 1.27 on May 14, 2018. So reasonably a lot must have been changed,even in how the module handles encodings.

Thus the moral of the story is to always strive for the same versions of your software when trying to replicate a system. Mere copying and pasting code won't cut it.

But, as in this case, this is more easily said than done.This case was constrained by being unable to use a version of Perl other than the system's own Perl, which also drags the available module versions down (or up). There was simply no option of installing another Perl, say with Perlbrew.

Server B, where everything worked fine directly out of the box, had archaic Perl revision 5 version 8 subversion 8. Its newer counterpart, Server A, had Perl revision 5 version 10 subversion 1 .

The mitigation

So with these limitations at hand, what do you do? How do you mitigate the issue?

An initial trick that sprung to mind was to make use of the "use encoding greek" pragma to force Perl to use iso-7 for its byte semantics instead of latin1. That is, just set use encoding "greek"; at the top of the CGI script.

Now the same query

select greek, hex(greek) from test where id=3519999

results in :

│Ξ\221 │CE9120202020202020│

ce 91 for GREEK CAPITAL LETTER ALPHA in UTF.

So Perl indeed understood that my text was iso7 and correctly upgraded it to the correct UTF byte sequence.

Still this doesn't make up for the fact that our database is tuned to speak iso7 hence needs to communicate in single, and not multiple, bytes.

Another workaround is to turn the UTF8 flag of the arguments to off:

Encode::_utf8_off($date1);
Encode::_utf8_off($date2);
etc

so that the concatenation won't produce any side effects.

The last and best workaround is to use placeholders which don't concatenate any values therefore no upgrade takes place, and also alleviates the sql injection risk!

$sth = $dbh->prepare("insert into testtable
(date1,date2,date3,greek,english,id)
values
(?,?,?,?,?,?)";

defined $sth->execute($date1,$date2,$date3,$greek_string,$english_string,$id)|| $dbh->rollback() && $dbh->disconnect() && return ["ERROR_insert",1];

 

Moral

When looking to replicate systems always aim for the same Perl and module versions and always use placeholders, or use Docker.

If you already do that then well good for you! In that case I still hope you found this deep dive through the inner workings of the character encodings, and the ways that the different software parts of an application interpret them together the methodology employed, as being of value.

 

More Information

Unicode issues in Perl

Related Articles

Connecting To The Outside World with Perl and Database Events

Health Level 7 (HL7) with Perl

Banner


Sequin - Open Source Message Stream Built On Postgres
31/10/2024

Sequin is a tool for capturing changes and streaming data out of your Postgres database, guaranteeing exactly once processing. What does that mean?



IBM Opensources AI Agents For GitHub Issues
14/11/2024

IBM is launching a new set of AI software engineering agents designed to autonomously resolve GitHub issues. The agents are being made available in an open-source licensing model.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info



Last Updated ( Monday, 28 October 2019 )