Internationalization
Say you want to tag some text on a web page as being in a language other than the main language of the page (English, in the case of this blog). In HTML 4, you would slap a <span lang=".."></span>
around it. In XHTML 1.1, the lang
attribute is gone, and you’d write
<span xml:lang="fr">ma vie en rose</span>
instead.
And therein lies a small problem. No matter how you set your Sanitize Spec in the blog preferences, MovableType will strip out the xml:lang
attribute from any sanitized text like, say, the comments on your blog. It can’t handle attributes with colons in them.
Fortunately, the fix for this is easy.
--- lib/MT/Sanitize.pm.orig Fri Apr 23 08:40:27 2004 +++ lib/MT/Sanitize.pm Fri Apr 23 08:41:42 2004 @@ -98,7 +98,7 @@ (exists $tag_attr->{$name} && $tag_attr->{$name} eq '/')) { if ($inside) { my @attrs; - while ($inside =~ m/(\w+)\s*=\s*(['"])(.*?)\2/gs) { + while ($inside =~ m/([:\w]+)\s*=\s*(['"])(.*?)\2/gs) { my $att = lc($1); if ($ok_tags->{'*'}{$att} || (ref $ok_tags->{$name} && $ok_tags->{$name}{$att})) {
That takes care of easy languages, like French. But say you want to comment in Hebrew. Hebrew’s a Right-to-Left language. If you want to use a phrase in Hebrew in the midst of an English paragraph, you’d paste the Hebrew text into a <bdo dir="rtl" xml:lang="he"></bdo>
.
“<bdo>
” stands for “BiDirectional Override”, which temporarily reverses the direction of the text. If you want an entire paragraph in Hebrew, you’d paste the text into a <p dir="rtl" xml:lang="he"></p>
.
[Update (5/11/2004): According to the W3C Draft on Handling Bi-Directional Text, you can mostly get away without using the <bdo>
element, thanks to the Unicode Bi-Directional Algorithm and the super-secret character entities, ‏
(Right-to-Left Mark) and ‎
(Left-to-Right Mark), which let you control how neutral characters, like punctuation marks are treated. E.g. compare 1705 רחוב בן יהודה. (typed straight) with “1705 רחוב בן יהודה.” (uses some astutely-placed ‏
s). Note: Safari screws this stuff up pretty badly; there are serious bugs in WebCore’s bidi implementation. There are also useful documents on Specifying the Language of Content and the ever-popular subject of Character Encodings (via Phil). ]
All these tags and attributes are allowed in the comments on this blog. The only bad news is with respect to Charset
s. This blog uses ISO-8859-1. That handles Western Europeen languages just fine, but doesn’t know anything about non-Europeen languages. So if you enter
<span dir="rtl" xml:lang="he">הבנתי</span>
into the Comment Form and click “PREVIEW”, your browser will convert the text to numeric entities
<span dir="rtl" xml:lang="he">הבנתי</span>
which will display correctly, but which is not exactly the easiest thing to edit.
If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.
Posted by distler at April 24, 2004 1:46 AM
International keyboard input test
אני מבין עכשו.
That was input using a hebrew keyboard layout under MacOSX. Clicking on “PREVIEW” converted the hebrew text to numeric entities. Before conversion, it looked like
After, it looked like