Perl UTF-8 Hell And How To Find Your Way Out
Character encoding are never easy to deal with. Some trouble has to do with how your tools interpret the encoding. For instance a unix terminal (in my case gterm) might try to “fix” an encoding because its thinks it knows what is best. Or Data::Dumper might handle dumping high order utf8 in its own unique way. On top of the when dealing with external data sources that vary in encoding and using cpan modules that all think they are doing the write thing makes it a challenging to debug and fix.
Here’s my most recent experiences:
Our project notify.me up till today could not display multi-byte utf8 character correctly (anything that was not ascii). I setup a test blog on blogger and added the following text to the subject and body, 人写.
When I used XML::Feed to display the subject and body I first see this in the logs
2008/11/12 22:38:29 DEBUG> rss_processor.pl:413 main:: - Subject => 人写
2008/11/12 22:38:29 DEBUG> rss_processor.pl:414 main:: - Body => 人写
Great.. However in the json object that gets created later I see this diversion
2008/11/12 22:38:29 DEBUG> rss_processor.pl:450 main:: - … “Guid”:”00000004-0000-0055-0005-06491b5ae520”,
“Subject”:”人å”,
“Content”:”人写”}
Hrmm what is going on here? I thought I checked earlier and saw that the strings were the same.. Well this is where we enter the realm of encoding madness.
If I Data::Dumper both strings I get a subject as:
%foo = ( ‘Subject’ => ‘人写’ );
but a content as:
%bar = ( ‘Content’ => “\x{4eba}\x{5199}” );
So the strings are different no doubt. However looking at the Data::Dumper results it would seem that the subject is correct and the content is screwed. However the opposite is true. Data::Dumper escapes the unicode. So what is going on with Subject.
The first thing to do is get the actuall byte representations of characters. To do this we use the ord (you can also use unpack).
$log->debug(“subject => ” . join(” “, map { ord($_ ) } split(//, $item->title)) );
$log->debug(“content => ” . join(” “, map { ord($_ ) } split(//, $item->content->body)) );
what does this do? We are breaking apart each character of the sting (in this case there are two) and then we are using ord to get the numeric 8-bit representation. Then we join these ord values back up into one string. To are amazement they are different values.
2008/11/12 22:38:29 DEBUG> rss_processor.pl:416 main:: - subject => 228 186 186 229 134 153
2008/11/12 22:38:29 DEBUG> rss_processor.pl:417 main:: - content => 20154 20889
The content string looks good, its a high order utf8 char encoding two chars long. But what is going on with subject? Does it really have 6 characters? No it does not because it prints out right. WTF is going on.
Well the jump you have to make is that subject was supposed to be proper utf8 but know one told perl so it thinks its latin1. So how do we test this theory.
We know we have two chars and 6 bytes so that means that there are three bytes per character. Well that’s good because UTF8 can be 3 bytes long but it must match the 3-byte utf8 standard pattern (shown below).
1110-xxxx 10xx-xxxx 10xx-xxx
So if we expand 228 186 186 into byte representations
1110-0100 1011-1010 1011-1010
And we can see the that bytes fit the utf8 pattern the first byte starts with 1110 and the second two start with 10
So replacing the x’s above creates a 2 byte value
0100-1110 1011-1010
ok so that should be the utf8 encoding that the three byte represent. So in hex that would represent
4 e b a
So lets take the hex and convert it to decimal
a (10) * (16^0) = 10
b (11) * (16^1) = 176
e (14) * (16^2) = 3,584
4 (4) * (16^3) = 16,384
10 + 176 + 3,584 + 16,384 = 20154
And HOLY crap that’s the content string number that ord spit out (up above). So our theory is proven right. But what the heck happened to get the subject string not being represented properly? Well there seems to be a magic flag set for variables if the contents should be treated as utf8 and it looks like this flag did not get set for $subject.
A little googling and you stumble upon this gem
http://search.cpan.org/~miyagawa/XML-Atom-0.29/lib/XML/Atom/Feed.pm#UNICODE_FLAGS
and there is some theory to why it is this way http://use.perl.org/~miyagawa/journal/30923
and that’s why languages should start out supporting utf8 from the get go.
You can set the magic utf8 flag by doing a
utf8::upgrade($string);
of cource that did not work for me. And I want into XML::Feed::Atom and set the flag to NOT strip utf8 from strings and I was back in business.
You can also cheat with the map and use this nifty site
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%26%2320154%3B&mode=char
You can see the latin1 reprsentation of the utf8 character was ‘人’ if you scroll to the top of the post you will see thats exactly what was in the json object.
1 year ago