Internationalization and Trackbacks
The last straw was when I received a Korean trackback, encoded in euc-kr
.
The Trackback Specification makes no mention of character encodings, and MovableType’s original implementation was blissfully ignorant of any such notion. The sender of a Trackback ping sent a string of bytes (which represented a string of characters in charset
of his blog) and the recipient dutifully published that string of bytes on his blog. If the recipient’s charset
happened not to be the same as that of the sender, well, then, the result was gibberish.
The most recent versions of MovableType convey the sender’s charset
in the HTTP headers of the Trackback. But the recipient doesn’t actually do anything with the information.
As a result, I had a slowly increasing number of gibberish Trackbacks on my blog, with no end in sight.
If you want something done right …
The first order of business is to realize that — somewhere along the line — we need to transcode the Trackback from the sender’s character encoding to the recipient’s. We can do this either before saving the Trackback to the database or after, when we go to build the actual blog pages.
It sounds tempting to do it once and for all and get it over with. But …
Perhaps the sender didn’t specify an encoding. Or perhaps he did, but specified it incorrectly (that Korean blog is supposedly utf-8
, but the Trackback was euc-kr
). Once we’ve transcoded and stored the result in the database, it’s pretty hard to recover. Better to store the original in the database, along with any charset
information we may have received, and do the transcoding later. If we need to, we can add/correct the charset
information and rebuild.
So the first order of business is to add a new column to the mt_tbping
table:
--- lib/MT/TBPing.pm.orig Tue Feb 15 15:25:30 2005 +++ lib/MT/TBPing.pm Tue Feb 15 18:13:11 2005 @@ -11,7 +11,7 @@ __PACKAGE__->install_properties({ columns => [ 'id', 'blog_id', 'tb_id', 'title', 'excerpt', 'source_url', 'ip', - 'blog_name', + 'blog_name', 'tb_charset', ], indexes => { created_on => 1,
--- schemas/mysql.dump.orig Wed Aug 18 19:39:33 2004 +++ schemas/mysql.dump Tue Feb 15 18:15:25 2005 @@ -262,6 +266,7 @@ tbping_source_url varchar(255), tbping_ip varchar(15) not null, tbping_blog_name varchar(255), + tbping_tb_charset varchar(255), tbping_created_on datetime not null, tbping_modified_on timestamp not null, tbping_created_by integer,
The next order of business is to capture the character encoding specified in the HTTP headers and store it in the database. While we’re at it, I’m not sure why MovableType decides to truncate utf-8
strings at a fixed number of bytes, rather than a fixed number of characters. That seems like a recipe for disaster, so I commented it out.
--- lib/MT/App/Trackback.pm.orig Mon Jan 24 18:40:31 2005 +++ lib/MT/App/Trackback.pm Wed Feb 16 03:50:07 2005 @@ -219,7 +219,7 @@ my($title, $excerpt, $url, $blog_name) = map scalar $q->param($_), qw( title excerpt url blog_name); - no_utf8($tb_id, $title, $excerpt, $url, $blog_name); +# no_utf8($tb_id, $title, $excerpt, $url, $blog_name); return $app->_response(Error=> $app->translate("Need a Source URL (url).")) unless $url; @@ -247,6 +247,9 @@ $ping->tb_id($tb_id); $ping->source_url($url); $ping->ip($app->remote_ip || ''); + if ($ENV{'CONTENT_TYPE'} =~ /[Cc]harset=([a-zA-Z0-9-]+)/) { + $ping->tb_charset($1); + } if ($excerpt) { if (length($excerpt) > 255) { $excerpt = substr($excerpt, 0, 252) . '...';
Now, at this point, I could have made an enhancement. It’s very unlikely that a random string of bytes is valid utf-8
. So, even if the Trackback Headers do not specify an encoding, it’s possible to test whether the Trackback could be utf-8
and set the charset
accordingly. E.g.:
} elsif ( _is_utf8($title . $blog_name . $excerpt) ) { $ping-tb_charset('utf-8'); } ... sub _is_utf8 { $_ = shift; m/^( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$/x; }
For the time being, I’ve decided to forgo automated charset
-guessing. If I get a lot of utf-8
-encoded Trackbacks without a charset
declaration, I’ll reconsider that.
Finally, we need to ensure that things get transcoded when the pages are built. The magic happens in _transcode_text()
. We use Text::Iconv
to convert from the original encoding to utf-8
and then we use Encode
(if necessary) to convert from utf-8
to the blog’s native encoding.
--- lib/MT/Template/Context.pm.orig Tue Feb 15 21:12:33 2005 +++ lib/MT/Template/Context.pm Wed Feb 16 01:43:52 2005 @@ -26,6 +26,9 @@ @EXPORT = qw( FALSE ); use vars qw( %Global_handlers %Global_filters ); + +my $publish_charset = _hdlr_publish_charset(); + sub add_tag { my $class = shift; my($name, $code) = @_; @@ -2426,7 +2432,8 @@ sanitize_on($_[1]); my $ping = $_[0]->stash('ping') or return $_[0]->_no_ping_error('MTPingTitle'); - defined $ping->title ? $ping->title : ''; + my $title = defined $ping->title ? $ping->title : ''; + return _transcode_text($ping->tb_charset, $title); } sub _hdlr_ping_url { sanitize_on($_[1]); @@ -2438,7 +2445,8 @@ sanitize_on($_[1]); my $ping = $_[0]->stash('ping') or return $_[0]->_no_ping_error('MTPingExcerpt'); - defined $ping->excerpt ? $ping->excerpt : ''; + my $excerpt = defined $ping->excerpt ? $ping->excerpt : ''; + return _transcode_text($ping->tb_charset, $excerpt); } sub _hdlr_ping_ip { my $ping = $_[0]->stash('ping') @@ -2449,7 +2457,20 @@ sanitize_on($_[1]); my $ping = $_[0]->stash('ping') or return $_[0]->_no_ping_error('MTPingBlogName'); - defined $ping->blog_name ? $ping->blog_name : ''; + my $blog_name = defined $ping->blog_name ? $ping->blog_name : ''; + return _transcode_text($ping->tb_charset, $blog_name); +} + +sub _transcode_text { + my ($text_charset, $text) = @_; + require Text::Iconv; + use Encode; + if (defined $text_charset && $text_charset ne $publish_charset ) { + $text = Text::Iconv->new($text_charset,'utf-8')->convert($text) unless $text_charset eq 'utf-8'; + $text = encode($publish_charset, decode('utf-8',$text), Encode::FB_XMLCREF) unless $publish_charset eq 'utf-8'; + } + $text =~ s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)/&/g; + return $text; } package MTPlugins::SubCategories;
I then went back into the database and defined charset
s for all the “bad” Trackbacks, rebuilt a few pages and …
So, have at it! Trackback this entry and we’ll see what breaks.
Update:
Fixed an inadvertently-dropped bit of patch code forlib/MT/Template/Context.pm
above.Update (2/18/2005):
For those benighted souls, who think that “Useutf-8
.” is the solution to all i18n issues, here’s some data to think about. 93% of the Trackbacks here are plain ASCII. It really doesn’t matter whether (or what) encoding you declare for them. The remaining ones are about equally divided between iso-8859-1
, like this Icelandic Trackback and utf-8
, like this Japanese one. And, even after you’ve gotten the encoding right, there are still serious bidi issues to be resolved, as in this Urdu Trackback.Note:
I should have said that Sam Ruby has been doing pretty much the same thing for half a year now. He transcodes incoming Trackbacks on receipt (and he “auto-detects”utf-8
). As you might expect, this occasionally fails.
Re: Internationalization and Trackbacks
This is quite interesting, in part because I am now making a plug-in which will automatically decode trackbacks for russian blogs (where you can generally have only 2 encodings - cp1251 and utf8).
As MT has a hook for that it looks like you don’t have to modify the source. However, the expression used for matching UTF-8 does not always work for me - I send in sample strings in UTF-8 and they don’t get flagged. The data contains lots of non-latin text so it should not look as ASCII, still scratching my head on how to solve this. Maybe you have any ideas on how to solve this?