SpamBayes

From DreamHost
Jump to: navigation, search
The instructions provided in this article or section are considered advanced.

You are expected to be knowledgeable in the UNIX shell.
Support for these instructions is not available from DreamHost tech support.
Server changes may cause this to break. Be prepared to troubleshoot this yourself if this happens.
We seriously aren't kidding about this.

Background

DreamHost's Anti-spam features are not always sufficient. They sometimes add 5-10 minutes of delay to mail delivery. The bigger issue is that baseline SpamAssassin is no longer effective.

Spammers have access to SpamAssassin too, and they train their generators to avoid its filters... The only defense against that is to have a customized filter set that is tailored to your personal inbox.

SpamBayes (SB) is a good choice for such a filter. SpamAssassin does have a Bayesian filter, but SB is easier to install and seems to perform better.

The following steps helped me drop from 100+ spams a day to a handful at most.


Install and train SB

Note: Since the initial training is an iterative process, you may want to run these steps locally before working in your shell account.

SpamBayes is easy to install. Just log in to your shell account and run the following commands. (Check for newer releases first...)

   wget http://sourceforge.net/projects/spambayes/files/spambayes/1.1a6/spambayes-1.1a6.tar.gz
   tar xf spambayes-1.1a6.tar.gz
   cd spambayes-1.1a6
   chmod 755 setup.py
   ./setup.py build
   ./setup.py install --user

Now you need to create a new filter database and start training it. The following uses "ham" and "spam" mbox files. Each of these should contain at least a hundred messages from your inbox, spread over a range of time. Here is a related SB FAQ entry.

   $ ~/.local/bin/sb_filter.py -n
   $ ~/.local/bin/sb_filter.py -g ham -s spam > /dev/null

After completing the training, it is a good idea to test SB before setting it loose on your inbox. A good process is to run SB across a few years of mail archives and tweak the training until it stops flagging good mail and has minimal "unknown" messages. Note that this might find some spam in your archives...

   # ~/.local/bin/sb_filter.py mailbox | grep X-Spambayes-Classification | sort | uniq -c

Set up procmail in your shell account

After getting SB working correctly, it is now time to set it up as a mail filter. Run the following on your shell account (after changing the two copies of user@example.com to your actual email address). For background, see Procmail (and disregard the message about it "no longer working").

echo '"|/usr/bin/procmail -t"' > ~/.forward.postfix
mkdir -p ~/.procmail
cat <<_EOF > ~/.procmailrc
# see http://wiki.dreamhost.com/SpamAssassin
# PROCMAIL ENVIRONMENT
# Directory for storing procmail files
PMDIR=$HOME/.procmail

# Procmail log file
LOGFILE=$PMDIR/log

# Shell to use for recipes
SHELL=/bin/sh

# MAIL PROCESSING RECIPES

# catch delivery errors
# break loops and keep them on this shell account
:0
* ^Subject:.*Undelivered Mail Returned to Sender
$HOME/.procmail/

# run through SpamBayes, http://spambayes.sourceforge.net
# use the X-Spambayes-Classification header to prevent loopback
:0fw:spambayes-lock
| $HOME/.local/bin/sb_filter.py

:0efw:spambayes-lock2
| formail -A "X-Spambayes-Classification: unknown; error"

# remove delivery headers that cause the DH servers to think there is an uncontrolled loop
:0fw:formail-lock
| formail -I X-Original-To -I Delivered-To

# forward back to my account
# TODO: filter based on original delivery, to support multiple addresses
:0
| /usr/sbin/sendmail -oi -r user@example.com user@example.com
_EOF


Note: There is one fundamental problem with this process. DreamHost has a sender domain policy that disallows spoofing emails from non-DreamHost addresses. This blocks the re-forwarding of some addresses from your shell account back to your email account. It affects gmail, some banks, and various other domains...

Hook SB into your email message filters

See Message_Filters and Shell-linked_E-mail

Go to Panel -> Mail -> Message Filters and add the following rules.

First, move emails with X-Spambayes-Classification: spam in the headers to spam and then stop.

...

Then, move emails with X-Spambayes-Classification in the headers to INBOX and then stop.

Finally, forward emails that match all of the following:

   * Does not contain X-Spambayes-Classification in the headers
   * Does not contain @gmail.com in the from
   * Contains sb-test in the subject

to $user on $shell_hostname and then stop.

Note: the @gmail.com line is meant to avoid the sender domain policy problem. You may want to add other addresses as well.

Note: Some !@#$ spammer may read this and start putting X-Spambayes-Classification in the headers. If that starts happening, then modify your copy of SB to use a different string...

Test and finish

Send yourself a couple messages with "sb-test" in the subject line. Use a "ham" body for one, and a "spam" body for the other. Make sure they are delivered and end up in the proper folders.

If this works out, then remove the sb-test condition from the last mail filter rule. Then repeat the test for good measure.

For troubleshooting, look at ~/.procmail/log and ~/.procmail/new/* on your shell account.

When everything is good, you may want to disable the baseline Anti-spam to improve message delivery times.

Todo

  • Better handling of messages that are blocked by the sender domain policy. Maybe implement SRS.
  • Set up mail folders for easier spam and ham training.
  • Encourage DreamHost to install SpamBayes so people can use it on personal email and discussion lists... Hint. Hint.

See also