Home / Settings / Manual / History / Download / Links / Forum

 
1. Introduction

The Bayesian Plugin is a semi-intelligent solution for recognising Spam. Over time, it learns from incoming email messages and the words they contain, marking each word with a "spam" & a "clean" probablility. It filters emails based on these probabilities producing an overall score for the whole email.

2. Filtering Process

  • Each word in the incoming email is scored based on its occurence in previous emails.
    • A word that has only appeared in emails that have been classified as spam would have a ratio value of 0.99
    • A word that only appears in clean email receives a ratio of 0.0
    • Any unknown word is given a default value of 0.2
  • The top #Word Count scoring ( abs(ratio - 0.5) ) words are found
  • The ratios of these words are combined:
     
    spamratio = ratio_1 * ratio_2 * ... * ratio_wordcount
    cleanratio = (1 - ratio_1) * (1 - ratio_2) * ... * (1 - ratio_wordcount)
    score = 100 * spamratio / (spamratio + cleanratio)
     
  • A decision is made whether the email is spam or clean based on this score (A score higher than Spam Threshold is spam).
  • Every email is marked with:
    X-Bayesian-Result: Spam / Clean
    X-Bayesian-Words: advert 0.9900001 bread 0.483210 credit 0.9900001 ...
    
3. Postprocessing
  • Each word in the current email is added to the database of known clean/spam words (dependant on the Learning Threshold and the Spam Threshold ).
     
    if(score > learningthreshold) total_number_of_spam_words++
    if(score < spamthreshold) total_number_of_clean_words++
    spamratio = occurences / total_number_of_spam_words
    cleanratio = occurences / total_number_of_clean_words
    new_word_ratio = 100 * spamratio / (spamratio + cleanratio)
     
    If the word has a score between Spam Threshold and Learning Threshold (not definitely spam and not definitely clean) it is not added to the database.
4. Plugin Window
  • Select "Plugins" -> "Bayesian Filter" from the "right-click" menu on the SpamPal tray icon (The umbrella near the Taskbar Clock)
    Each email that is processed by the plugin is copied here (only for the current SpamPal session) so that they can be reclassified if necessary. A red icon means the email was classified as spam, as green one means clean.
    Functionality of the buttons:
    • Spam
      Mark the currently selected emails as spam
    • Clean
      Mark the currently selected emails as clean
    • Selected
      Remove the currently selected emails
    • All
      Remove all emails
5. Configuration Options
  • Thresholds
    • Spam Threshold
      Increasing this reduces the number of false classifications, decreasing it makes the filter think more email should be tagged as spam. Any word with a ratio below this threshold is considered a clean word.
      Default value 90
    • Learning Threshold
      Any word with a ratio greater than or equal to this is added to the database classed as spam.
      Default value 99
    • Limit message processing
      Only process the first part of an email (see "Amount of message to process")
    • Amount of message to process (kb)
      Limit the amount of an individual email that is processed. Set this to avoid timeouts when the plugin can take too long when processing large emails.
  • Words
    • Word Count
      The number of significant words examined during classification of an email.
      Fewer words checked means the filter is more "trigger-happy", more words checked would mean more spam words would be needed to be present for an email to be classified as spam.
      Default value 10
    • Min/Max word length
      Set the minimum and maximum size of word that is used during filtering
    • Word expiry
      Every word is tagged with the time it was last encountered. This threshold ensures that words that haven't occurred recently are removed from the database.
      If a word has not appeared for X days (word expiry), the number of times the word has appeared (spam & clean) is decremented once per day until they reach zero. When they both reach zero the word is removed from the database.
    • Minimum word occurence for filtering
      Sets the minimum number of times a word has to appear before it is used in filtering. A low setting will make the plugin more "trigger-happy", letting it mark emails based on less data.
    • Incoming words are case-sensitive
      If unselected, all new email will be converted to lower case before filtering
  • Options
    • Create log file
      Turns on/off logging
    • Learn (don't mark spam)
      The plugin will do everything it normally does except it does not mark an email as spam. This has the effect of letting the filter "learn" your email without inital period that may make it mark a lot of email as spam before it "knows" your email.
      Don't forget to turn this option off when you think the filter has seen enough of your email ;-)
    • Assume whitelisted email is clean
      Selecting this means that the plugin will score any whitelisted email as zero, i.e. perfectly clean.
    • Learn from whitelisted emails
      Whether words found in whitelisted emails are added to the database (Use in conjunction with the above).
      Example: You may be subscribed to a mailing list about spam (containing words that would be scored as spam) that you have whitelisted. If whitelisted email is considered clean then the words in these emails would be added to the database as clean. This option allows you to stop that happening.
    • Include headers in filtering
      Select this if you want to include all the emails headers in the Bayesian filtering.
    • Add "X-Bayesian-Words" header
      Option whether to add the "X-Bayesian-Words" header that lists the interesting words that were found (and their scores). n.b The "X-Bayesian- Result" header will always be added.
    • Learn from SpamPal and other plugins
      If selected, the plugin will learn using results from SpamPal and all other enabled plugins
  • Ignore
    Maintenance of the list of words that the plugin will ignore.
    Functionality of the buttons:
    • Add
      Add the word from the edit box into the list
    • Remove
      Remove the selected words
    • Remove all
      Empty the list
    • Reset
      Revert back to the state of the list before the configuration window was opened
    • Default
      Load the default ignore words
  • Import
    These functions act as if the files were received as email. If the file(s) that are imported are not complete email messages the results are not guaranteed.
    • Import directory into database as spam/clean
      Imports all files in a directory into the database
  • Miscellaneous
    • Language
      Choose the language you wish the plugin to use.
6. Files
  • Configuration
    The default files are held in the plugin directory (e.g. C:\Program Files \SpamPal\plugins\Bayesian) and should not be changed. There is a "user" copy in your SpamPal configuration directory.

    Your default SpamPal user configuration directory is...

    Windows XP: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal \plugins\bayesian\
    Windows 2k: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal \plugins\bayesian\
    Windows NT: C:\WinNT\Profiles\%USERNAME%\Application Data\SpamPal\plugins \bayesian\
    Windows 98: C:\Windows\Application Data\SpamPal\plugins\bayesian\
    Windows 95: C:\Program Files\Spampal\config\plugins\bayesian\

    n.b. This is also where the log files are saved.

  • Wordlist.dat format
    The format of the wordlist file is shown below:
        Spam = 947                         // number of emails received classed as spam
        Clean = 1744                       // number of emails received classed as clean
        adage = 1,0,0.99000001,1041011569  // word = spam_occurences,clean_occurences,spam_ratio,timestamp
        advert = 1,0,0.99000001,1041011569 
        ...
  • Ignore.dat format
    The format of the ignore file is shown below:
        content-type
        base64
        head
        body
        ...
7. Recommended plugins
  • Bayesian plugin
  • HTMLModify
  • RegEx
  • URLBody
8. Other useful plugins
  • Good Words plugin
  • Bonded Sender
  • Whitelist Extender
  • Logfile