November 12, 2008

Perl UTF-8 Hell And How To Find Your Way Out

Character encoding are never easy to deal with.  Some trouble has to do with how your tools interpret the encoding.  For instance a unix terminal (in my case gterm) might try to “fix” an encoding because its thinks it knows what is best.  Or Data::Dumper might handle dumping high order utf8 in its own unique way.  On top of the when dealing with external data sources that vary in encoding and using cpan modules that all think they are doing the write thing makes it a challenging to debug and fix.

Here’s my most recent experiences:

Our project notify.me up till today could not display multi-byte utf8 character correctly (anything that was not ascii).  I setup a test blog on blogger and added the following text to the subject and body, 人写.

When I used XML::Feed to display the subject and body I first see this in the logs

2008/11/12 22:38:29 DEBUG> rss_processor.pl:413 main:: - Subject => 人写
2008/11/12 22:38:29 DEBUG> rss_processor.pl:414 main:: - Body => 人写

Great..  However in the json object that gets created later I see this diversion

2008/11/12 22:38:29 DEBUG> rss_processor.pl:450 main:: - … “Guid”:”00000004-0000-0055-0005-06491b5ae520”,
“Subject”:”人写”,
“Content”:”人写”}

Hrmm  what is going on here?  I thought I checked earlier and saw that the strings were the same..  Well this is where we enter the realm of encoding madness.

If I Data::Dumper both strings I get a subject as:

%foo = ( ‘Subject’ => ‘人写’  );

but a content as:

%bar = ( ‘Content’ => “\x{4eba}\x{5199}” );

So the strings are different no doubt.  However looking at the Data::Dumper results it would seem that the subject is correct and the content is screwed.  However the opposite is true.  Data::Dumper escapes the unicode.  So what is going on with Subject.

The first thing to do is get the actuall byte representations of characters.  To do this we use the ord (you can also use unpack).

$log->debug(“subject => ” . join(” “, map { ord($_ ) } split(//, $item->title)) );
$log->debug(“content => ” . join(” “, map { ord($_ ) } split(//,  $item->content->body)) );

what does this do?  We are breaking apart each character of the sting (in this case there are two) and then we are using ord to get the numeric 8-bit representation.  Then we join these ord values back up into one string.  To are amazement they are different values.

2008/11/12 22:38:29 DEBUG> rss_processor.pl:416 main:: - subject => 228 186 186 229 134 153
2008/11/12 22:38:29 DEBUG> rss_processor.pl:417 main:: - content => 20154 20889

The content string looks good, its a high order utf8 char encoding two chars long.  But what is going on with subject?  Does it really have 6 characters? No it does not because it prints out right.  WTF is going on.

Well the jump you have to make is that subject was supposed to be proper utf8 but know one told perl so it thinks its latin1.  So how do we test this theory.

We know we have two chars and 6 bytes so that means that there are three bytes per character.  Well that’s good because UTF8 can be 3 bytes long but it must match the 3-byte utf8 standard pattern (shown below).

1110-xxxx 10xx-xxxx 10xx-xxx

So if we expand 228 186 186 into byte representations

1110-0100 1011-1010 1011-1010

And we can see the that bytes fit the utf8 pattern the first byte starts with 1110 and the second two start with 10

So replacing the x’s above creates a 2 byte value

0100-1110 1011-1010

ok so that should be the utf8 encoding that the three byte represent.  So in hex that would represent

4 e b a

So lets take the hex and convert it to decimal

a (10) * (16^0) = 10

b (11) * (16^1) = 176

e (14) * (16^2) = 3,584

4 (4)  * (16^3) = 16,384

10 + 176 + 3,584 + 16,384 = 20154

And HOLY crap that’s the content string number that ord spit out (up above).  So our theory is proven right.  But what the heck happened to get the subject string not being represented properly?  Well there seems to be a magic flag set for variables if the contents should be treated as utf8 and it looks like this flag did not get set for $subject.

A little googling and you stumble upon this gem

http://search.cpan.org/~miyagawa/XML-Atom-0.29/lib/XML/Atom/Feed.pm#UNICODE_FLAGS

and there is some theory to why it is this way http://use.perl.org/~miyagawa/journal/30923

and that’s why languages should start out supporting utf8 from the get go.

You can set the magic utf8 flag by doing a

utf8::upgrade($string);

of cource that did not work for me.  And I want into XML::Feed::Atom and set the flag to NOT strip utf8 from strings and I was back in business.

You can also cheat with the map and use this nifty site

http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%26%2320154%3B&mode=char

You can see the latin1 reprsentation of the utf8 character was ‘人’ if you scroll to the top of the post you will see thats exactly what was in the json object.

Comments (View)
October 28, 2008

Interview on Highscalability

Today HighScalalability.com ran an interview with yours truly about our architecture. It kind of jumps the gun on the subjects that I’ve promised to cover, but I will cover them in a bit more depth. I’m still editing my next post which will primarily cover message queuing using SimpleMQ (the production codebase of which is on sourceforge, I have just not gotten around to putting any kind of useful documentation there).

In the meantime, a big thanks to HighScalability for article — I hope it gives a coherent overview of our architecture.

Comments (View)
October 7, 2008

Great interview on concurrency directions

Channel 9 has an interview from JAOO with Anders Hejlsberg and Guy Steele, two of my favorite languages gurus. Lots of interesting perspectives on concurrency issues and imperative vs. functional solutions going forward. Guy Steele’s Common Lisp reference was my bible back in grad school and Anders’ designs in C# have made it my current favorite language, especially the 3.0 bits giving a taste of functional in the otherwise imperative.

Comments (View)
October 4, 2008

Synchronicity kills

I have quite a back log of technical material I want to post here, so let me just start with one of fundamentals of the tech behind notify.me: Asynchronous Programming

Behind the stories of internet service outages is almost always a bottleneck. And usually this bottleneck is due to synchronous access, i.e. some resource is requested and the requestor ties up more resources, waiting for a response. If the requested resource can’t be delivered in a timely manner, more and more requests pile up until the server can’t accept any new ones. Nobody gets what they want and you have an outage.

Now, there’s plenty of efforts of increasing capacity by adding more horsepower and mitigating bottleneck resources by adding caching at various levels of the architecture. These techniques are proven and solve many problems. However, if the data was delivered asynchronously, instead of synchronously, apparent capacity is drastically increased, since resources aren’t tied up waiting. Because HTTP is a synchronous pull protocol and the internet is basically plumbed by HTTP, the first step in to asynchronicity is usually polling. Instead of asking for a resource and then holding the line until the demand is met, ajax (when talking browser as a client) or simply repeated calls (when the client is another service) keep asking “you got my data?” and little bit later “how about now?” and so on. As long as it’s cheap to say no, polling reduces bottlenecks.

Ideally, the conversation should follow a pattern of “tell me when my data is ready”. In the browser scenario this isn’t strictly possible, although techniques such as comet, are repurposing HTTP to do just that even it it does in fact hold the line. However, holding a couple of thousand sockets open from a simple socket server is still a much better option than doing the same against the webserver.

But I digress, because the inner workings of notify.me only touch web pages in small ways. While we will be looking at implementing some version of Comet on the site itself, our greater concern is our message traffic. And here we can easily use asynchronous methods to avoid having resource scarcity turn into resource overload. Breaking synchronous operations into asynchronous operations by separating request and response into separate message passing actions, stops the resource overload. Instead of a system going down from too many parallel requests, it can works its way through a backlog of requests as fast as it can. And in most cases the request/response cycles are so fast that they appear like a linear sequence of events.

In the the next couple of posts, I will go over the three different types of message passing systems we use:

  1. Store-and-forward message queueing using simpleMQ
  2. REST calls with REST callback URI registration
  3. Xmpp message bus using message stanzas as well as IQ based RPC

I’ll discuss each in detail, including previews of our external REST and XMPP APIs, currently in internal testing.

Comments (View)