Blogspam Update
It’s about time to follow up on my previous articles on Comment Spam. No real surprises, but spambots have gotten better at what they do.
Up till now, I’ve talked about 3 basic moves for combatting robot-posted comment spam:
- Rename
mt-comments.cgi
. - Make sure the new comment script doesn’t get indexed by Google.
- Ditch the comment-entry form on your individual archive page. Make people follow a link to get to the comment-entry form.
The combination of these steps makes it hard for a spambot to find your comment script. And if it can’t find it, it can’t spam you. They also make it possible for you to lead the spambot astray (more on that presently).
You’ve made it hard, but not impossible. Consider the following recent “visitor” to my blog:
proxy1.anon-online.org - - [07/Jan/2004:18:29:31 -0600] "GET /~distler/blog/archives/000080.html HTTP/1.0" 200 13442 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) Java/1.4.1_03" proxy1.anon-online.org - - [07/Jan/2004:18:29:38 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=80;parent_id=38 HTTP/1.0" 200 10035 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) Java/1.4.1_03" proxy1.anon-online.org - - [07/Jan/2004:18:29:43 -0600] "POST //cgi-bin/mt-2.5/sxp-comments.pl HTTP/1.0" 200 3922 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=80;parent_id=38" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Ignore the bogus USER_AGENT string. This was a spambot. It came in to an individual archive page on my blog, searched for a hyperlink with the string “?entry_id=
” in it, followed that link to my comment-entry form, and posted a comment. No pretending to be human by downloading an image or CSS file, just straight to the point. Brutal, efficient, … and (in this case) futile.
I require comment validation on my blog. Posting without validating your comment first lands you in my IP-ban list. Humans have no trouble with the procedure, but robots aren’t expecting the extra hurdle and are tripped-up.
Of course, there are other things I could have done. I could have put a “honeypot” script on my individual archive page:
<div style="display:none"> Clicking on the link below will get you permanently banned from posting comments to this weblog. Don't try it!<br /> <a href="/cgi-bin/MT-2.5/nomore-comments.pl?entry_id=80"> Don't click here to post a comment</a> </div>
and tried to fool the robot into following that instead. And there are enough other tricks we could play that I’m still pretty sanguine that we hold the upper hand against spambots.
Unfortunately, we’re not (just) faced with robots. Consider this visitor:
210.18.114.210.sify.net - - [06/Jan/2004:05:07:26 -0600] "GET /~distler/blog/archives/000236.html HTTP/1.1" 200 44050 "http://www.google.com/search?q=blog/archives+post&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=70&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:07:27 -0600] "GET /~distler/blog/aural.css HTTP/1.1" 200 523 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:07:33 -0600] "GET /~distler/blog/styles-site.css HTTP/1.1" 200 13326 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:07:53 -0600] "GET /~distler/blog/print.css HTTP/1.1" 200 844 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:07:58 -0600] "GET /~distler/blog/ie.js HTTP/1.1" 200 2248 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:03 -0600] "GET /~distler/blog/images/bigthinker.jpg HTTP/1.1" 200 1443 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:51 -0600] "GET /~distler/blog/archives/000237.html HTTP/1.1" 200 12389 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:52 -0600] "GET /~distler/blog/aural.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:53 -0600] "GET /~distler/blog/print.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:53 -0600] "GET /~distler/blog/styles-site.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:58 -0600] "GET /~distler/blog/ie.js HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:58 -0600] "GET /~distler/blog/archives/000235.html HTTP/1.1" 200 8227 "http://golem.ph.utexas.edu/~distler/blog/archives/000236.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:59 -0600] "GET /~distler/blog/images/bigthinker.jpg HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:08:59 -0600] "GET /~distler/blog/images/MathML.png HTTP/1.1" 200 3238 "http://golem.ph.utexas.edu/~distler/blog/archives/000237.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:06 -0600] "GET /~distler/blog/print.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:13 -0600] "GET /~distler/blog/styles-site.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:13 -0600] "GET /~distler/blog/aural.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:36 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237 HTTP/1.1" 200 13992 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:41 -0600] "GET /~distler/blog/aural.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:41 -0600] "GET /~distler/blog/print.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:41 -0600] "GET /~distler/blog/styles-site.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:43 -0600] "GET /~distler/blog/images/smallthinker.jpg HTTP/1.1" 200 554 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:46 -0600] "GET /~distler/blog/images/MathML.png HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:09:46 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=235 HTTP/1.1" 200 9108 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:10:33 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 200 4559 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=235" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:10:58 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 302 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:11:27 -0600] "GET /~distler/blog/archives/000235.html HTTP/1.1" 200 8880 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:11:29 -0600] "GET /~distler/blog/ie.js HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:11:34 -0600] "GET /~distler/blog/images/bigthinker.jpg HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:02 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237 HTTP/1.1" 200 13992 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:09 -0600] "GET /~distler/blog/aural.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:09 -0600] "GET /~distler/blog/print.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:10 -0600] "GET /~distler/blog/styles-site.css HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:10 -0600] "GET /~distler/blog/images/smallthinker.jpg HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:11 -0600] "GET /~distler/blog/images/MathML.png HTTP/1.1" 304 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:12:50 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 200 6539 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=237" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:13:26 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=235 HTTP/1.1" 200 8127 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:13:32 -0600] "GET /cgi-bin/MT-2.5/sxp-comments.pl?entry_id=235 HTTP/1.1" 200 9858 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:13:13 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 302 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:14:08 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 200 5321 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl?entry_id=235" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:14:58 -0600] "POST /cgi-bin/MT-2.5/sxp-comments.pl HTTP/1.1" 302 - "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:15:28 -0600] "GET /~distler/blog/archives/000235.html HTTP/1.1" 200 9527 "http://golem.ph.utexas.edu/cgi-bin/MT-2.5/sxp-comments.pl" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 210.18.114.210.sify.net - - [06/Jan/2004:05:15:32 -0600] "GET /~distler/blog/ie.js HTTP/1.1" 304 - "http://golem.ph.utexas.edu/~distler/blog/archives/000235.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
I’m pretty convinced this was a human. He came in on a Google search, wandered about, and then posted 3 comments to my blog before heading elsewhere. His ISP is in Chennai, India. Searching my logs, the same fellow had unsuccessfully visited twice before. He used different Google searches which, unfortunately for him, landed him on my —nonexistent — mt-comments.cgi
page.
This time he got lucky. And he posted the only comment spam I’ve received since I implemented the above scheme three months ago. (Well, almost; I did receive one more spam in early December. Similar MO, but from an IP address belonging to interbusiness.it
in Italy.)
4 Spams in 3 months ain’t bad, but it was enough to make me take a another look at MT-Blacklist. MT-Blacklist takes a different approach. It doesn’t try to distinguish between robot and human posters. It just filters on banned content, usually, the URLs of web sites being hawked by spammers.
This never seemed to me to be an approach that would scale well. The initial release of MT-Blacklist had some 400 RegExps that were banned. Three months later, the list has grown to over 600. And it keeps on growing. Perl is very fast crunching through Regular Expressions, but it seem to me that the blog owner is still the one holding the short end of the computational stick. I’ll keep looking at it, though…
The reason why I’m interested in MT-Blacklist is not just to stymie the odd person toiling away in Chennai or Bangkok, churning out Comment Spam manually from his PC. I’m interested because I’m worried about Trackback Spam.
It hasn’t happened in a big way yet, but sooner or later, spammers are going to turn to sending trackbacks, rather than posting comments. And, unlike comments, trackbacks are designed to be sent and received in a purely automated fashion. So the sort of tricks one would use to fool Comment Spam robots would not be applicable.
The only technical measure I can think of to fight Trackback Spam is to parse out the hostname of the TBPingURL
, do a DNS lookup of it, and demand that it match the TBPingIP
(the IP address of the host that sent the ping).
On the positive side, spammers would no longer be able to send the trackback ping from anywhere on the internet. They would have to send the ping from the web site they were advertising, which would be much more easily blocked (DNSBL-style, if necessary). On the other hand, it would break 3rd-party Trackback servers, like reedmaniac.
Is that too high a price? Are there other drawbacks? Thoughts?
Update (1/12/2004): MovableType 2.66 has been released. It introduces some basic anti-spam measures: comment throttling and turning the comment-author URL link into a redirect (presumably depriving the spammer of the Google PageRank boost). Unfortunately, their redirection code fails miserably if you are serving your pages as application/xhtml+xml
. Here’s a patch to fix the matter.
Update (1/16/2004): MovableType 2.661 fixes one XHTML issue, but introduces another one. I’ve updated my patch to fix what they broke.
Re: Blogspam Update
It seems that for the (potential as of yet) issue of trackback spam, the AWStatsReferrers plugin for MT could be the type of solution, unless I am missing something. It purports – I have not used it – to check the referrer URL for an actual reference to the site. Checking that the site exists (as you mention) as advertised by the trackback ping, and that it does have a reference to the trackbacked page should help screen the majority of cases. There are unfortunately of course situations when a trackback might be sent to a site legitimately without a direct link reference though, so maybe this would not work as well as I thought… oh well. Perhaps a first line defense, to flag a potential trackback spam for review?