E-mail security: detecting spam (III)

Before talking about other methods for detecting spam, let’s have a closer look to Bayesian filters and programs using this technique to classify mail. This will be a technical post, so it might not interest to all of you. In next posts we’ll see some software which uses these filters.

I’m not a mathematician, so I might make a few errors when trying to explain the theory behind the filters. Please forgive me. The article in Wikipedia explains it better than I can do it.

The main formula where Bayesian filtering stands is:

Bayes1

which says that the probability of an e-mail being spam given the words contained in it is equal to the probability of these words appearing in a spam message, multiplied by the probability of a message being spam divided by the probability of the words appearing in any message.

Wow, it looks quite complicated. One of the most known papers about this kind of filtering is A plan for spam from Paul Graham. We’ll see some code from it.

Well, to be able to calculate this result we need, in first place, to break the message in words, which are called tokens, from where the probabilities are taken. This partitions are really important, as they will affect the final result depending on how they are done. If we have the sentence It’s a shame we could break the words in It-s-a-shame or maybe in Its-a-shame or even It’s-a-shame and each of them might give different results when used.

Once the message is broken in tokens, we can calculate the Pr(word|spam) with the next code (this was code in Lisp originally):

(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (< (+ g b) 5)
(max .01
(min .99 (float (/ (min 1 (/ b nbad))
(+ (min 1 (/ g ngood))
(min 1 (/ b nbad)))))))))

When we have calculated the probabilities for all the tokens in the message, we get the most relevant ones (the ones which probability is farther from 0.5, so the nearest to 0 or 1). Paul decided to use the 15 most relevant and stores them in a list called probs, applying the next formula to it:

(let ((prod (apply #&* probs)))
(/ prod (+ prod (apply #&* (mapcar #&(lambda (x)
(- 1 x))
probs)))))

If the result is bigger than 0.9 we consider that the e-mail is spam and classify it as such. So, although the theory may look hard once implemented it is far easier. Maybe the only problem with this code is it’s Lisp, which not so many people know about.

Let’s make this even easier by looking at the source code of Mozilla Thunderbird, the famous opensource mail reader, which includes a Bayesian module to classify mail. The implementation in Thunderbird is slightly different from the original, but the concept remains the same.

The algorithm is implemented in the file mozillamailnewsextensionsbayesian-spam-filtersrcnsBayesianFilter.cpp in the function classifyMessage. It’s implemented in C++, but we are seeing it in “pseudo-code”. It uses some different variables:

  • mGoodCount: number of non-spam messages classified
  • mBadCount: number of spam messages classified
  • mGoodTokens: hash table with good tokens and number of times they have appeared
  • mBadTokens: hash table with spam tokens and number of times they have appeared

Take care, as the same token might appear in both hash tables with different number of apparitions. For example, the word hello is equally probable in spam and non-spam messages. When the algorithm is not yet trained default values are assigned:

if (mGoodCount == 0 || mGoodTokens.count() == 0)
message is spam
si (mBadCount == 0 || mBadTokens.count() == 0)
message is not spam

If the algorithm has been trained then it’s applied with the next formula (adapted from Bugzilla):

for each token {
hamcount = number of token appearances in non-spam
spamcount = number of token appearances in spam
hamratio = hamcount / nGoodCount
spamratio = spamcount / nBadCount

prob = spamratio / (hamratio + spamratio)

n = hamcount + spamcount
prob = (0.225 + n * prob) / (.45 + n)

distance = abs(prob - 0.5)
if (distance > = .1) {
token.distance = distance
token.prob = prob
}
}

With this code, we have the probability for each token. This is saved in a list sorted by distance (distance is taken as the difference between probabilities) and the first 150 elements are taken. A probability distribution chi2 is calculated and if the result is bigger or equal to 0.9 the message will be classified as spam.

But, we don’t need to know all of this unless we want to write one filter ourselves. There are lots of already available filters which work quite good and get rates of detection around 99%, sometimes even better than a human.




Search

Search in the Becoming paranoid Archive