Bogofilter Spam Filter Retraining

Over the last couple of weeks my spam filter has slowly been getting worse. I don’t know if it was curruption in the database, or the spammers getting smarter, or what, but I would wake up with about 30-40 spams that had slipped through into my inbox, and over the day get one or two an hour. I figured it was time to do something about it, so I read up a bit on bogofilter, and discovered I was behind by two versions, so I upgraded, and read up on how the best way to do things is.

I tried the technique called training to exaustion, and it worked great! As far as I can tell, you run a script (included in the bogofilter contrib directory), passing it a list of known spam, and known non-spam messages. It then runs each message through it’s wordlists and sees if it is determined to be spam or non-spam. If it’s correct, it moves onto the next message. If not, it re-classifies, or re-looks through the word-lists, or something, until that message is classified properly. I ran from my home directory (right out of the docs): -fn .bogofilter mail/notspam mail/spam ‘-o 0.8,0.2’

I was lucky enough to have some 34,000 messages available to work off of, and after a fair amount of time and numerous “NN false positives, NN false negatives” messages, it quit. Since then (noonish thursday) I’ve had one message slip through, which after 3-4 an hour, is pretty damn good.

So if you use bogofilter, I suggest checking this out, in conjunction with the .17.5 release.