Spam Filtering Techniques

How Antispam Software Detects and Deletes Junk Mail

© Dario Borghino

Spam messages cluttering a mailbox., Kai-Martin Knaak

Spam filters use a wide variety of techniques to fight unsolicited e-mail messages, but Bayesian spam filtering is by far the most effective.

The practice of flooding Internet forums and personal mailboxes with junk mail is an increasingly common form of advertising, mainly due to its high cost effectiveness and ease of maintenance. Since the mid-90s ad hoc computer programs, commonly referred to as antispam filters, have therefore been developed to fight spam using a variety of techniques.

Common Ways Filtering Programs Identify a Spam Message

The techniques used by antispam software to identify unsolicited e-mails are typically simple. Experience has in fact shown that such messages are often one or a combination of the following:

The reason behind the uncommon or misspelled words is a specific -- but often unsuccessful -- attempt to avoid Bayesian filtering, a complex but extremely powerful technique that has proven to be the most effective to date and is therefore being implemented in all modern antispam software.

What Is a Bayesian Spam Filter?

Bayesian spam filtering relies on the formula for the conditional probability of two random events. In other words, this formula -- called Bayes' Law -- makes it possible, given two random events A and B, to calculate the probability of the event A occurring, given that B has already occurred. In a spam filter, A is the event "The current message is junk mail", while B is the likelihood of finding the message words in a spam message.

When analyzing a message, the filter parses one word at a time searching its internal database for the probability P associated to that word. Such values would be high (close to 90%) for words such as "buy", "watch", "investment" or "pharmacy", which are typically found in a junk mail message, and significatively lower for other terms. A cumulative value for all the words in the message is then computed and assigned to B, whose value will be used in the Bayes' Law to determine whether the message is spam.

By inserting unusual or misspelled words (Bayesian poisoning), spammers aim to avoid raising the value of B to a suspicious level and having their message filtered. In HTML messages, such words are often colored in white, blended into the background, so to prevent users from noticing them. However, they can still be identified when parsing the HTML code of the message.

Bayesian Filter Effectiveness

The main reason why attempts to avoid a Bayesian filter are largely unsuccessful is that its word database is easily modifiable and updateable, thus making the filter highly adaptive.

The most advanced filters -- namely Google's Gmail filter -- will also use user spam reports to build a comprehensive, shared database to be used in conjunction with the Bayesian filter. The sharing of such information can yield extremely positive results, although user privacy issues can sometimes deter software houses from adopting this solution.

Precise data on the actual effectiveness of the Bayesian filter are highly dependent on implementation and word database used, but the percentage of false positives in a commercial software has been reported to be typically below 0.05%, while over 95% of spam messages are being filtered successfully.

Sources

Sheldon M. Ross, A first course in probability (Pearson Education, Inc.)

Sahami, Dumais, Heckerman, Horvitz, "A Bayesian Approach to Filtering Junk E-mail"


The copyright of the article Spam Filtering Techniques in Internet Security is owned by Dario Borghino. Permission to republish Spam Filtering Techniques must be granted by the author in writing.


Spam messages cluttering a mailbox., Kai-Martin Knaak
       


Post this Article to facebook Add this Article to del.icio.us! Digg this Article furl this Article Add this Article to Reddit Add this Article to Technorati Add this Article to Newsvine Add this Article to Windows Live Add this Article to Yahoo Add this Article to StumbleUpon Add this Article to BlinkLists Add this Article to Spurl Add this Article to Google Add this Article to Ask Add this Article to Squidoo