Filtering Modules - MailCleaner/MailCleaner8 GitHub Wiki

MailCleaner is largely a tool for centralized management of a handful of common Anti-Spam modules and Content Filtering modules. The coordination of these modules is handled by MailScanner, then the decision for what to do after all of the modules have been evaluated is made by our SpamHandler library.

Each email passing through the Filtering Engine will have a SpamCheck header which summarizes the results of all modules and the decision that is made based on those. For example

X-MailCleaner-SpamCheck: is spam, Newsl (score=-1.0, required=5.0,
	MC_NEWS_HSBJANSWERS=-3, MC_NEWS_NIPRBL=2, position : 0,
	not decisive), Spamc (score=4.0, required=5.0, HTML_MESSAGE 0.0,
	HTML_FONT_LOW_CONTRAST 0.0, MPART_ALT_DIFF 0.7,
	MC_URI_EASYMONEY_LVL3 1.0, MC_SPF_PASS -0.0, MIME_QP_LONG_LINE 0.0,
	DKIM_VALID -0.3, MC_CHECK_PROFIL 2.0, MC_CLIENT_CERT_VALID 0.0,
	MIME_HTML_ONLY 0.0, URIBL_BLOCKED 0.0, DKIM_SIGNED 0.6,
	MIME_HTML_ONLY_MULTI 0.0, position : 6, ham not decisive, PreRBLs 
        (MCIPRBL, position : 7, spam decisive)

The Modules dedicated to filtering Spam (and Newsletters) are described here.

Enabling and marking as Decisive

Each module has two options: 'Enable module' and 'Module is decisive'.

If 'Enable module' is set, this indicates whether or not the module actually runs and logs results.

If 'Module is decisive' is set, this will determine whether the outcome of that module is actually enforced. It will have no effect if the previous setting is not enabled, since there is nothing to enforce.

If you want to evaluate the hypothetical results for a module (via the logs and headers) before enforcing those results, you can enable it but not make it decisive.

Whether the module was the definitive cause of the message being accepted or rejected is logged in the headers after the flag type like spam not decisive or ham decisive.

Module Order

The order that modules trigger is important. Messages flagged by an earlier module will not require filtering by the remaining modules since only one module is required to trigger in order to be classified as spam. As a result, when the first module triggers, the rest will be skipped.

Given this behavior, it is best to put the fastest operating modules first and the slower ones later. Likewise, if you have a module enabled, but not decisive because you would like to evaluate the results, you may wish to put it higher in priority so that you will see all results that it would have shown if it were running.

The module order for each will be noted in the headers like position : 1.

TrustedSources

This module (Configuration->Anti-spam->TrustedSources) operates in the opposite manner to the rest. If a message is flagged by this module, it will not be treated as spam.

Example header: X-TrustedSources: is ham (message authenticated by SMTP from [146.4.119.1]) position : 1, ham decisive

For best performance, it should be first in the evaluation order to prevent spending any time filtering a message which will not be quarantined.

NiceBayes

This module (Configuration->Anti-spam->NiceBayes) performs Bayesian statistical analysis on emails based on training data from numerous user reports of both confirm spam and non-spam messages.

Example header: X-NiceBayes: is not spam (18.72%) position : 6, not decisive

A simple explanation of bayesian analysis, is that breaks messages into tokens - individual words and phrases - within training emails and classifies each token with a certain confidence that this token indicates that a message is spam depending on the frequency that the same token has appeared in spam messages, vs. the frequency that it appears in non-spam messages.

For a token like 'hello', it will appear very often in both data sets, it will have a very low confidence in either direction. For a token like 'viagra', it is likely that it will appear much more often in spam than non-spam reports and so will indicate a stronger confidence in that direction.

The NiceBayes module is an implementation of this algorithm with will identify the tokens in the currently scanning message with the highest levels of confidence and generate an overall confidence score. Messages with a high enough overall confidence will be flagged.

This is a simple form of machine learning, which means that it will adapt quickly to new spam trends, but also means that the decisions are fairly opaque. If a recent spam wave with a relatively unique token was reported many times, and then an innocuous email that happens to have the same token is seen, it is likely that this token could lead to it being caught.

This is to say that this module tends to be the most responsive to new trends, but it is also the most prone to false positives.

Note that the SpamC filter also has it's own Bayes plugin which should find similar results to NiceBayes, but which only applies a score to overall SpamC total based on it's confidence rather than guaranteeing that it is hit.

See our guide on reporting false-positives and false-negatives which provides information on how to help train the Bayes databases.

Note that you can manually run bogofilter to see if a message would have been classified differently after a recent Bayes updates you can try:

/opt/bogofilter/bin/bogofilter -c /usr/mailcleaner/etc/mailscanner/prefilters/bogo_nicebayes.cf -vvv < /var/mailcleaner/spam/<domain>/<recipient>/<exim-id>

The paths provide above are to the bogofilter command, then the nicebayes configuration file, then an example path to an already quarantined item. This is just information, it will not actually cause the quarantined item to to be re-classified.

ClamSpam Module

The ClamSpam module (Configuration->Anti-spam->ClamSpam) runs a second ClamAV daemon with specific signatures designed to identify spam instead of viruses.

Example header: X-ClamSpam: is spam (ScamNailer.Phish.info_AT_noreply.org.UNOFFICIAL), position 7, decisive

You can manually scan a message with clamspam again by either uploading it to the MailCleaner server or locating it's path in the quarantine (as shown here):

/opt/clamav/bin/clamdscan --config-file=/usr/mailcleaner/etc/clamav/clamscand.conf /var/mailcleaner/spam/<domain>/<recipient>/<exim-id>

PreRBLs

The PreRBLs module (Configuration->Anti-spam->PreRBLs) performs DNS lookups for listings of the IP (and perhaps hostname) of the sending SMTP server to any configured RBL server. It will flag the message as spam if a listing is found, because this indicates that the sending machine has a poor reputation (ie. it is know to be a common source of spam).

Example header: X-PreRBLs: is spam (MCIPRBL) position : 8, spam decisive

See the RBLs Wiki for more.

UriRBLs

Similar to PreRBLs, except that the UriRBLs module (Configuration->Anti-spam->UriRBLs) performs DNS lookups for content within the email, including URLs, sender addresses, domains, bitcoin addresses, message hashes, etc. It will flag the message as spam if a listing is found, because this indicates that the content has been found in other emails that were reported to be spam.

Example header: X-UriRBLs: is spam (MCURIBL), position : 9, spam decisive

See the RBLs Wiki for more.

SpamC

SpamC is the daemonized instantiation of SpamAssassin, a heuristic based anti-spam tool. This module (Configuration->Anti-spam->SpamC) allows for searching the header and body of emails for complicated patterns and combinations of patterns and to apply scores for each. MailCleaner is configured to treat a cumulative SpamC score of 5 or more to be spam.

Example header: X-Spamc: is spam (6.0/5.0) position : 6, spam decisive

SpamC is the most performance heavy anti-spam moule, since the shear volume of rules takes a fair amount of time to evaluate. Because of this, it is recommend to have it run last.

We have a separate Wiki discussing how most SpamC rules are created and how you can create and adjust them. However, SpamC also has a number of plugins which perform more complex actions and return a score to the parent SpamC process. This includes plugins which duplicate some of the other modules, including a Bayesian analysis plugin, RBL plugins and an OCR plugin. Since all rules and plugins contribute to a single score, no one rule causes a message to be considered spam (unless intentionally set so high that it is impractical for it to not flag the message).

Because of this, SpamC perhaps the most fickle module, except perhaps the NiceBayes module. Hitting or missing as single, minor rule could bump a message with a score of 4.9 to a score of 5.0, or vice versa, resulting in the determination flipping. Because of this, SpamC rule adjustments and additions are the most dynamic way of altering the false-positive to false-negative ratio and where much of the effort goes for Enterprise Edition.

Newsletter Module (Newsl)

The Newsletter module is a hidden second instance of the SpamC module which is always run and which has a separate decision process at the end. This module runs with a different set of rules than SpamC which are dedicated to recognize newsletters instead of spam.

Example header: Newsl (score=5.0, required=5.0, MC_NEWS_NIPRBL=5)

This functions exactly the same as the SpamC module, except that the decision on how to handle a positive result is done separately, according to a different user preference and with a different Whitelist (called the "Newslist").

The default preference will be to quarantine Newsletters (ie. treat them like spam), but this can be configured per-domain from Configuration->Domains->[select domain]->Filtering, per-user by the admin from Management->Users->[search email address]->Address settings or by the user themselves at Configuration->Address settings.

Items identified as a Newsletter will be grouped separately in Quarantine Report emails, they will have an extra indicator in the web Quarantine and the option to Whitelist will be replaced with the option to Newslist.

Note that an email can be recognized as “newsletter” AND “spam”. Both are separated and an email can be quarantined as spam even if the user asks to receive it as newsletter.

MessageSniffer (add-on)

This add-on module uses a curated set of rules from MessageSniffer (similar to SpamC) for an extra level of filtering.