December 07, 2005

Divorce, Internet Style

After weeks of slacking I finally decided I couldn't take it anymore. My internet karma was surely growing blacker by the minute. I could not allow this blog to be infested with spam.

So, I went regular expression fishing, something I have to admit I enjoy on an overcast morning. I typed words and patterns into Blacklist, trying to find the sweetspot -- the patterns which returned the largest amount of old spam with the fewest false positives.

Spam is such a new and interesting phenomenon; if there isn't an army of corpus linguists already doing research into it there should be. For one thing, spam gives essentially random words and phrases ominous, incantatory weight. Who would have ever thought that "ringtone" would become a dirty word? Yet I have this superstitious sense of dread just typing it out.

Spam is probably the single largest incentive in human history to develop new kinds of nonsense patterns. Gigantic lists of words developed by computer scientists for experiments into natural language processing and AI have really come into their own as noise generators for spammers. And in general, spam is the only human activity I can think of where incoherence and obfuscation is the key to success.

I can't help but find something slightly chilling about the use of a channel for human communication in which the most important thing is to make sure nothing is communicated. Well, no, that would be one thing, what I find a little creepy is that spam has to use nonsense, but it also needs to be similar enough to human speech to trick machines. It shows us human language through a computer's eyes. Unlike a lot of email spam, comment spam is truly parasitic: it does not address the hapless blog owner, but a machine known as Google.

Spam also has the intriguing effect of loudly advertising what before presumably would remain in the shadows. Who would ever have guessed that someone would maintain a series of websites on every imaginable flavor of divorce? Canadian divorce, military divorce, pretty much any kind of divorce you're into, they've got it. I want to meet the person who maintains these sites.

I didn't take careful notes of the results of my fishing, but here are some patterns which worked surpisingly well:

hi
hello
cool
interesting
asd, sdf, etc. (patterns bound to appear when someone mashes the keyboard)
/ (letter)

By the way, we all use "interesting" far too often. If we would agree to stop using it weeding out old spam would be much easier. Besides, it's such a filler word. In fact, if a word shows up consistently in spam (and it isn't related to genitalia or brand names), then it's a good bet that it's an overused filler word.

Incidentally, I also found a lot of one-off comments calling us idiots, most of which I suspect we never see because they were added long after the original post.

Posted by Alan Hogue at December 7, 2005 11:58 AM
Comments

Spam is probably the single largest incentive in human history to develop new kinds of nonsense patterns.

That's an amazing and rather wonderful thought. Reminds me of the suggestion that Woody Guthrie's degenerative medical condition, the Huntington's Chorea, stimulated his creativity by forcing his brain to keep developing new paths around the equivalent of broken nerve transmitters.

BTW can you do anything about the very large number of different spams, from different ostensible sources, that share the text phrase "May be this is BAD but is something different"?

Posted by: Martha Bridegam at December 7, 2005 04:02 PM

I found this morning's spam reports to be interesting indeed. (Sorry, Alan) It seems someone's spamming with legitimate, non-spam URLs like the one for the Star-Tribune and such, and no actual URLs to their own gambling site. Any guesses what that's about?

Posted by: Ben Brumfield at December 7, 2005 05:36 PM

BTW can you do anything about the very large number of different spams, from different ostensible sources, that share the text phrase "May be this is BAD but is something different"?

Well, you could put the whole phrase in Blacklist, but then once that caught on the spammers would come up with something else, maybe only slightly different. And anyway I haven't seen any examples of that particular style lately. Maybe it's already defunct?


It seems someone's spamming with legitimate, non-spam URLs like the one for the Star-Tribune and such, and no actual URLs to their own gambling site. Any guesses what that's about?

It is interesting (there I go) and I think a new thing. If you look closely, most but not all of them have one url in there that is clearly spam. It appears to be a new way of thwarting inanttentive bloggers. This is probably the biggest innovation in comment spam that I've seen so far in my career as an obsessed anti-spam...uh...person.

Posted by: Alan Hogue at December 7, 2005 07:44 PM

"Anti-Spam Warrior?"

Posted by: Martha Bridegam at December 7, 2005 10:58 PM

Yeah! I like that. I'll dress up like Marc Bolan and zap spam with my anti-spam electric guitar. Take that, animal sex! Pow!

This morning's crop included a site apparently designed with the animal sex enthusiast in mind, by the way. I just don't see why people'd make a website for this when all you have to do is watch PBS on a sunday afternoon. I guess some people can't get enough.

Posted by: Alan Hogue at December 8, 2005 09:42 AM