Spam filters use a wide variety of techniques to fight unsolicited e-mail messages, but Bayesian spam filtering is by far the most effective.
The practice of flooding Internet forums and personal mailboxes with junk mail is an increasingly common form of advertising, mainly due to its high cost effectiveness and ease of maintenance. Since the mid-90s ad hoc computer programs, commonly referred to as antispam filters, have therefore been developed to fight spam using a variety of techniques.
The techniques used by antispam software to identify unsolicited e-mails are typically simple. Experience has in fact shown that such messages are often one or a combination of the following:
The reason behind the uncommon or misspelled words is a specific -- but often unsuccessful -- attempt to avoid Bayesian filtering, a complex but extremely powerful technique that has proven to be the most effective to date and is therefore being implemented in all modern antispam software.
Bayesian spam filtering relies on the formula for the conditional probability of two random events. In other words, this formula -- called Bayes' Law -- makes it possible, given two random events A and B, to calculate the probability of the event A occurring, given that B has already occurred. In a spam filter, A is the event "The current message is junk mail", while B is the likelihood of finding the message words in a spam message.
When analyzing a message, the filter parses one word at a time searching its internal database for the probability P associated to that word. Such values would be high (close to 90%) for words such as "buy", "watch", "investment" or "pharmacy", which are typically found in a junk mail message, and significatively lower for other terms. A cumulative value for all the words in the message is then computed and assigned to B, whose value will be used in the Bayes' Law to determine whether the message is spam.
By inserting unusual or misspelled words (Bayesian poisoning), spammers aim to avoid raising the value of B to a suspicious level and having their message filtered. In HTML messages, such words are often colored in white, blended into the background, so to prevent users from noticing them. However, they can still be identified when parsing the HTML code of the message.
The main reason why attempts to avoid a Bayesian filter are largely unsuccessful is that its word database is easily modifiable and updateable, thus making the filter highly adaptive.
The most advanced filters -- namely Google's Gmail filter -- will also use user spam reports to build a comprehensive, shared database to be used in conjunction with the Bayesian filter. The sharing of such information can yield extremely positive results, although user privacy issues can sometimes deter software houses from adopting this solution.
Precise data on the actual effectiveness of the Bayesian filter are highly dependent on implementation and word database used, but the percentage of false positives in a commercial software has been reported to be typically below 0.05%, while over 95% of spam messages are being filtered successfully.
Sheldon M. Ross, A first course in probability (Pearson Education, Inc.)
Sahami, Dumais, Heckerman, Horvitz, "A Bayesian Approach to Filtering Junk E-mail"