Bayesian Plugin for SpamPal

Home / Settings / Manual / History / Download / Links / Forum

1. Introduction

The Bayesian Plugin is a semi-intelligent solution for recognising Spam. Over time, it learns from incoming email messages and the words they contain, marking each word with a "spam" & a "clean" probablility. It filters emails based on these probabilities producing an overall score for the whole email.

2. Filtering Process

Each word in the incoming email is scored based on its occurence in previous emails.
- A word that has only appeared in emails that have been classified as spam would have a ratio value of 0.99
- A word that only appears in clean email receives a ratio of 0.0
- Any unknown word is given a default value of 0.2
The top #Word Count scoring ( abs(ratio - 0.5) ) words are found
The ratios of these words are combined:

spamratio = ratio_1 * ratio_2 * ... * ratio_wordcount
cleanratio = (1 - ratio_1) * (1 - ratio_2) * ... * (1 - ratio_wordcount)
score = 100 * spamratio / (spamratio + cleanratio)
A decision is made whether the email is spam or clean based on this score (A score higher than Spam Threshold is spam).

Every email is marked with:

X-Bayesian-Result: Spam / Clean
X-Bayesian-Words: advert 0.9900001 bread 0.483210 credit 0.9900001 ...

3. Postprocessing

Each word in the current email is added to the database of known clean/spam words (dependant on the Learning Threshold and the Spam Threshold ).

if(score > learningthreshold) total_number_of_spam_words++
if(score < spamthreshold) total_number_of_clean_words++
spamratio = occurences / total_number_of_spam_words
cleanratio = occurences / total_number_of_clean_words
new_word_ratio = 100 * spamratio / (spamratio + cleanratio)

If the word has a score between Spam Threshold and Learning Threshold (not definitely spam and not definitely clean) it is not added to the database.

4. Plugin Window

Select "Plugins" -> "Bayesian Filter" from the "right-click" menu on the SpamPal tray icon (The umbrella near the Taskbar Clock)
Each email that is processed by the plugin is copied here (only for the current SpamPal session) so that they can be reclassified if necessary. A red icon means the email was classified as spam, as green one means clean.
Functionality of the buttons:
- Spam
  Mark the currently selected emails as spam
- Clean
  Mark the currently selected emails as clean
- Selected
  Remove the currently selected emails
- All
  Remove all emails

5. Configuration Options

Thresholds
- Spam Threshold
  Increasing this reduces the number of false classifications, decreasing it makes the filter think more email should be tagged as spam. Any word with a ratio below this threshold is considered a clean word.
  Default value 90
- Learning Threshold
  Any word with a ratio greater than or equal to this is added to the database classed as spam.
  Default value 99
- Limit message processing
  Only process the first part of an email (see "Amount of message to process")
- Amount of message to process (kb)
  Limit the amount of an individual email that is processed. Set this to avoid timeouts when the plugin can take too long when processing large emails.
Words
- Word Count
  The number of significant words examined during classification of an email.
  Fewer words checked means the filter is more "trigger-happy", more words checked would mean more spam words would be needed to be present for an email to be classified as spam.
  Default value 10
- Min/Max word length
  Set the minimum and maximum size of word that is used during filtering
- Word expiry
  Every word is tagged with the time it was last encountered. This threshold ensures that words that haven't occurred recently are removed from the database.
  If a word has not appeared for X days (word expiry), the number of times the word has appeared (spam & clean) is decremented once per day until they reach zero. When they both reach zero the word is removed from the database.
- Minimum word occurence for filtering
  Sets the minimum number of times a word has to appear before it is used in filtering. A low setting will make the plugin more "trigger-happy", letting it mark emails based on less data.
- Incoming words are case-sensitive
  If unselected, all new email will be converted to lower case before filtering
Options
- Create log file
  Turns on/off logging
- Learn (don't mark spam)
  The plugin will do everything it normally does except it does not mark an email as spam. This has the effect of letting the filter "learn" your email without inital period that may make it mark a lot of email as spam before it "knows" your email.
  Don't forget to turn this option off when you think the filter has seen enough of your email ;-)
- Assume whitelisted email is clean
  Selecting this means that the plugin will score any whitelisted email as zero, i.e. perfectly clean.
- Learn from whitelisted emails
  Whether words found in whitelisted emails are added to the database (Use in conjunction with the above).
  Example: You may be subscribed to a mailing list about spam (containing words that would be scored as spam) that you have whitelisted. If whitelisted email is considered clean then the words in these emails would be added to the database as clean. This option allows you to stop that happening.
- Include headers in filtering
  Select this if you want to include all the emails headers in the Bayesian filtering.
- Add "X-Bayesian-Words" header
  Option whether to add the "X-Bayesian-Words" header that lists the interesting words that were found (and their scores). n.b The "X-Bayesian- Result" header will always be added.
- Learn from SpamPal and other plugins
  If selected, the plugin will learn using results from SpamPal and all other enabled plugins
Ignore
Maintenance of the list of words that the plugin will ignore.
Functionality of the buttons:
- Add
  Add the word from the edit box into the list
- Remove
  Remove the selected words
- Remove all
  Empty the list
- Reset
  Revert back to the state of the list before the configuration window was opened
- Default
  Load the default ignore words
Import
These functions act as if the files were received as email. If the file(s) that are imported are not complete email messages the results are not guaranteed.
- Import directory into database as spam/clean
  Imports all files in a directory into the database
Miscellaneous
- Language
  Choose the language you wish the plugin to use.

6. Files

Configuration
The default files are held in the plugin directory (e.g. C:\Program Files \SpamPal\plugins\Bayesian) and should not be changed. There is a "user" copy in your SpamPal configuration directory.

Your default SpamPal user configuration directory is...

Windows XP: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal \plugins\bayesian\
Windows 2k: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal \plugins\bayesian\
Windows NT: C:\WinNT\Profiles\%USERNAME%\Application Data\SpamPal\plugins \bayesian\
Windows 98: C:\Windows\Application Data\SpamPal\plugins\bayesian\
Windows 95: C:\Program Files\Spampal\config\plugins\bayesian\

n.b. This is also where the log files are saved.

Wordlist.dat format
The format of the wordlist file is shown below:

    Spam = 947                         // number of emails received classed as spam
    Clean = 1744                       // number of emails received classed as clean
    adage = 1,0,0.99000001,1041011569  // word = spam_occurences,clean_occurences,spam_ratio,timestamp
    advert = 1,0,0.99000001,1041011569 
    ...

Ignore.dat format
The format of the ignore file is shown below:
```
    content-type
    base64
    head
    body
    ...
```

7. Recommended plugins

Bayesian plugin
HTMLModify
RegEx
URLBody

8. Other useful plugins

Good Words plugin

Bonded Sender

Whitelist Extender

Logfile