1. Introduction
The Bayesian Plugin is a semi-intelligent solution for recognising Spam. Over
time, it learns from incoming email messages and the words they contain,
marking each word with a "spam" & a "clean"
probablility. It filters emails based on these probabilities producing an
overall score for the whole email.
2. Filtering Process
3. Postprocessing
-
Each word in the current email is added to the
database
of known clean/spam words (dependant on the
Learning Threshold
and the
Spam Threshold
).
if(score > learningthreshold) total_number_of_spam_words++
if(score
< spamthreshold) total_number_of_clean_words++
spamratio = occurences / total_number_of_spam_words
cleanratio = occurences / total_number_of_clean_words
new_word_ratio = 100 * spamratio / (spamratio + cleanratio)
If the word has a score between
Spam Threshold
and
Learning Threshold
(not definitely spam and not definitely clean) it is not added to the
database.
4. Plugin Window
-
Select "Plugins" -> "Bayesian Filter" from the "right-click" menu on the
SpamPal tray icon (The umbrella near the Taskbar Clock)
Each email that is processed by the plugin is copied here (only for the current
SpamPal session) so that they can be reclassified if necessary. A red icon
means the email was classified as spam, as green one means clean.
Functionality of the buttons:
-
Spam
Mark the currently selected emails as spam
-
Clean
Mark the currently selected emails as clean
-
Selected
Remove the currently selected emails
-
All
Remove all emails
5. Configuration Options
-
Thresholds
-
Spam Threshold
Increasing this reduces the number of false classifications, decreasing it
makes the filter think more email should be tagged as spam. Any word with a
ratio below this threshold is considered a clean word.
Default value 90
-
Learning Threshold
Any word with a ratio greater than or equal to this is added to the database
classed as spam.
Default value 99
-
Limit message processing
Only process the first part of an email (see "Amount of message to
process")
-
Amount of message to process (kb)
Limit the amount of an individual email that is processed. Set this to avoid
timeouts when the plugin can take too long when processing large emails.
-
Words
-
Word Count
The number of significant words examined during classification of an email.
Fewer words checked means the filter is more "trigger-happy", more words
checked would mean more spam words would be needed to be present for an email
to be classified as spam.
Default value 10
-
Min/Max word length
Set the minimum and maximum size of word that is used during filtering
-
Word expiry
Every word is tagged with the time it was last encountered. This threshold
ensures that words that haven't occurred recently are removed from the
database.
If a word has not appeared for X days (word expiry), the number of times the
word has appeared (spam & clean) is decremented once per day until they reach
zero. When they both reach zero the word is removed from the database.
-
Minimum word occurence for filtering
Sets the minimum number of times a word has to appear before it is used in
filtering. A low setting will make the plugin more "trigger-happy", letting it
mark emails based on less data.
-
Incoming words are case-sensitive
If unselected, all new email will be converted to lower case before filtering
-
Options
-
Create log file
Turns on/off logging
-
Learn (don't mark spam)
The plugin will do everything it normally does except it does not mark an email
as spam. This has the effect of letting the filter "learn" your email
without inital period that may make it mark a lot of email as spam before it
"knows" your email.
Don't forget to turn this option off when you think the filter has seen enough
of your email ;-)
-
Assume whitelisted email is clean
Selecting this means that the plugin will score any whitelisted email as zero,
i.e. perfectly clean.
-
Learn from whitelisted emails
Whether words found in whitelisted emails are added to the database (Use in
conjunction with the above).
Example: You may be subscribed to a mailing list about spam (containing words
that would be scored as spam) that you have whitelisted. If whitelisted email
is considered clean then the words in these emails would be added to the
database as clean. This option allows you to stop that happening.
-
Include headers in filtering
Select this if you want to include all the emails headers in the Bayesian
filtering.
-
Add "X-Bayesian-Words" header
Option whether to add the "X-Bayesian-Words" header that lists the
interesting words that were found (and their scores). n.b The "X-Bayesian-
Result" header will always be added.
-
Learn from SpamPal and other plugins
If selected, the plugin will learn using results from SpamPal and all other
enabled plugins
-
Ignore
Maintenance of the list of words that the plugin will ignore.
Functionality of the buttons:
-
Add
Add the word from the edit box into the list
-
Remove
Remove the selected words
-
Remove all
Empty the list
-
Reset
Revert back to the state of the list before the configuration window was opened
-
Default
Load the default ignore words
-
Import
These functions act as if the files were received as email. If the file(s) that
are imported are not complete email messages the results are not guaranteed.
-
Import directory into database as spam/clean
Imports all files in a directory into the database
-
Miscellaneous
-
Language
Choose the language you wish the plugin to use.
6. Files
-
Configuration
The default files are held in the plugin directory (e.g. C:\Program Files
\SpamPal\plugins\Bayesian) and
should not
be changed. There is a "user" copy in your SpamPal configuration
directory.
Your default SpamPal user configuration directory is...
Windows XP: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal
\plugins\bayesian\
Windows 2k: C:\Documents and Settings\%USERNAME%\Application Data\SpamPal
\plugins\bayesian\
Windows NT: C:\WinNT\Profiles\%USERNAME%\Application Data\SpamPal\plugins
\bayesian\
Windows 98: C:\Windows\Application Data\SpamPal\plugins\bayesian\
Windows 95: C:\Program Files\Spampal\config\plugins\bayesian\
n.b. This is also where the log files are saved.
-
Wordlist.dat format
The format of the wordlist file is shown below:
Spam = 947 // number of emails received classed as spam
Clean = 1744 // number of emails received classed as clean
adage = 1,0,0.99000001,1041011569 // word = spam_occurences,clean_occurences,spam_ratio,timestamp
advert = 1,0,0.99000001,1041011569
...
-
Ignore.dat format
The format of the ignore file is shown below:
content-type
base64
head
body
...
7. Recommended plugins
-
Bayesian plugin
-
HTMLModify
-
RegEx
-
URLBody
8. Other useful plugins
-
Good Words plugin
-
Bonded Sender
-
Whitelist Extender
-
Logfile
|