Bullet-Proofing II
For the second in my series of “How-To” articles on MovableType, I’ll continue on the theme of bullet-proofing against the inclusion of invalid content. Aside from the content you, yourself, write, there’s stuff other people write that gets included in your blog. Even if you trust yourself to produce valid markup, you can’t necessarily trust others to do the same. Hence the need for bullet-proofing.
Last week, we dealt with comments posted to your blog. In this case, the answer was pretty. Since you call the shots, all you have to do is run the comment through the Validator and ask the poster to correct the errors before allowing them to post the comment to your blog. Since Alexei Kosut was kind enough to wrap the W3C Validator Script as a MovableType plugin, the job of setting this up was much-simplified.
Next on the list are Trackbacks and Syndicated RSS Feeds. Since these are, by definition, stuff written elsewhere, you don’t have any control over the content. If it’s invalid, you can’t ask the author to correct it; you just have to deal. Consequently, our solution will be more ham-handed.
Let’s look at the snippet of template code for listing a Trackback on my blog (before any bullet-proofing)
<div class="trackback" id="p<$MTPingID$>"> Read the post <a href="<$MTPingURL$>" target="new"><$MTPingTitle$></a><br /> <b>Weblog:</b> <$MTPingBlogName$><br /> <b>Excerpt:</b> <$MTPingExcerpt$><br /> <b>Tracked:</b> <$MTPingDate$></div>
Of the various <$MTPing*$>
tags in the above code snippet, the ones supplied by the person who sent the Trackback are
Let’s start with the last item. What evil stuff might the <$MTPingPingExcerpt$>
contain? You name it: invalid HTML markup, unescaped entities (eg, &) and control characters.
“Control characters?” you say, “Who would insert control characters in their blog?” Well, if you copied and pasted the previous sentence from your browser window into the composition window of your blog and posted it, depending on what OS you are using, you probably did just that. The trouble is the way non-ascii characters (like the curly quotes above) are handled by your OS. If you want to do it right, do a “View Source” on this page and copy from there. Needless to say, most people don’t do it right, and control characters in blogs are as common as dirt.
<$MTPingURL$>
could very well contain unescaped &s, and you never know what people will put in the title of their posts.
So, what to do?
MovableType provides global filters to strip HTML, encode entities, and last week, I wrote a plugin to strip control characters. The mt-safe-href plugin takes care of escaping &s in URLs. You can use it to to protect your own content with constructions like <$MTEntryBody safe_urls="1"$>
, or here to protect just a single URL.
Let’s change the above code to
<div class="trackback" id="p<$MTPingID$>"> Read the post <a href="<$MTPingURL safe_url="1"$>" target="new"><$MTPingTitle strip_controlchars="1" remove_html="1" encode_html="1" $></a><br /> <b>Weblog:</b> <$MTPingBlogName$><br /> <b>Excerpt:</b> <$MTPingExcerpt strip_controlchars="1" remove_html="1" encode_html="1" $><br /> <b>Tracked:</b> <$MTPingDate$></div>
Voila! Bullet-proofed.
Well, … erm … I didn’t do anything to bullet-proof the Blog Name. I haven’t seen an invalid one yet. I’m kinda curious to see whether any exist. A more cautious sort would bullet-proof that one too.
A similar story with the Syndicated RSS Feeds in my Blogroll. The mt-rssfeed plugin provides the tags
<$MTRSSFeedItemLink$> <$MTRSSFeedItemTitle$> <$MTRSSFeedItemDescription$>
These need to be replaced by
<$MTRSSFeedItemLink safe_url="1"$> <$MTRSSFeedItemTitle strip_controlchars="1" remove_html="1" encode_html="1"$> <$MTRSSFeedItemDescription strip_controlchars="1" remove_html="1" encode_html="1"$>
Similar techniques should take care of other included content you might have.
That leaves only your own content to validate. Guess that will have to wait for another post.
Update (5/13/2003): I just installed Adam Kalsey’s Technorati plugin. This is another brilliant example of how invalid HTML on other people’s blogs — served up via the Technorati API — can mess with an XML parser (in this case, the one used by Adam’s plugin). I found the plugin practically unusable until I applied this patch, which escapes ampersands. Not a complete bullet-proofing job, but good enough.
Update (5/14/2003): The fix is in.
Update (9/13/2003): Well, it finally happened! I got a trackback with an invalid <$MTPingBlogName$>
. I’m afraid, dear readers, that you need to bulletproof that one too.
Update (4/10/2004): I’ve released a new version of the MTStripControlChars plugin, with somewhat more sophisticated behavior.
Update (6/10/2004): Actually, with very recent version of Perl, there is a problem with the technique explained above. With previous version of Perl, the global filters would be executed in a predicable order. Not so any longer! If you have an MT tag with multiple filters applied to it, they will execute in a random order. If those filters “conflict” in some way, you will get random problems when you rebuild your pages. Sometimes it will work right, sometimes it won’t. To fix this, we need to enforce a certain order of execution of the filters. In particular, we want the strip_controlchars
filter to execute before the encode_html
filter. To do this, we use the MTBlock plugin. For instance, we want to write
<b>Excerpt:</b> <MTBlock encode_html="1"><$MTPingExcerpt remove_html="1" strip_controlchars="2"$></MTBlock><br />
in the third line of above code snippet (and similarly for other occurences of strip_controlchars
and encode_html
).
Update (2/20/2005):
With my new, “internationalized” trackback setup, bulletproofing trackbacks is a bit easier. Escaping is handled internally, so all we need to do in the templates is<div class="trackback" id="p<$MTPingID$>"> Read the post <a href="<$MTPingURL safe_url="1"$>" target="new"><$MTPingTitle remove_html="1" strip_controlchars="2"$></a><br /> <b>Weblog:</b> <$MTPingBlogName remove_html="1" strip_controlchars="2"$><br /> <b>Excerpt:</b> <$MTPingExcerpt remove_html="1" strip_controlchars="2"$><br /> <b>Tracked:</b> <$MTPingDate$></div>