<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Becoming paranoid &#187; Spam</title>
	<atom:link href="http://becomingparanoid.com/category/spam/feed/" rel="self" type="application/rss+xml" />
	<link>http://becomingparanoid.com</link>
	<description>Tips about computer security, privacy and staying safe online</description>
	<lastBuildDate>Wed, 03 Oct 2007 13:03:29 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Microsoft Word vulnerability</title>
		<link>http://becomingparanoid.com/2006/05/22/microsoft-word-vulnerability/</link>
		<comments>http://becomingparanoid.com/2006/05/22/microsoft-word-vulnerability/#comments</comments>
		<pubDate>Mon, 22 May 2006 10:49:51 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Medium]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/05/22/microsoft-word-vulnerability/</guid>
		<description><![CDATA[Some years ago, macro viruses inside documents became the new trend. Almost any new virus used this, hiding inside Office documents and executing when the unsuspecting user opened the file.
Most users got conscious and disabled the use of macros, so the virus couldn&#8217;t get executed and many mail providers blocked e-mails with attached Office documents.
This [...]]]></description>
			<content:encoded><![CDATA[<p>Some years ago, macro viruses inside documents became the new trend. Almost any new virus used this, hiding inside Office documents and executing when the unsuspecting user opened the file.</p>
<p>Most users got conscious and disabled the use of macros, so the virus couldn&rsquo;t get executed and many mail providers blocked e-mails with attached Office documents.</p>
<p>This is not the case anymore, as macro viruses are very rare now, but a recent Word vulnerability has made DOC files dangerous again. This time the problem is not with macros inside the document, but a vulnerability that allows to execute malicious code when the document is open.</p>
<p>There is no patch yet for this vulnerability, as Microsoft won&rsquo;t release it until June, so you should be extremely careful with documents you receive, specially if they are unexpected.</p>
<p>For now, this doesn&rsquo;t seem too widespread, as only one attack has been detected against a company, and it was a very targeted one, directed specially to them, but it wouldn&rsquo;t be strange to find it in the wild in some days.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/05/22/microsoft-word-vulnerability/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>E-mail security: avoiding spam</title>
		<link>http://becomingparanoid.com/2006/04/19/e-mail-security-avoiding-spam/</link>
		<comments>http://becomingparanoid.com/2006/04/19/e-mail-security-avoiding-spam/#comments</comments>
		<pubDate>Wed, 19 Apr 2006 10:35:26 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Medium]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/04/19/e-mail-security-avoiding-spam/</guid>
		<description><![CDATA[Following&#160;the series of articles about spam I last wrote about detection of spam by analyzing the content. This usually works great but it is a big waste of resources for the user receiving the spam, as he has to download the mail (mostly free if you are on a residential line, but might be expensive [...]]]></description>
			<content:encoded><![CDATA[<p>Following&nbsp;the series of articles about spam I last wrote about detection of spam by analyzing the content. This usually works great but it is a big waste of resources for the user receiving the spam, as he has to download the mail (mostly free if you are on a residential line, but might be expensive if you are on the road) and analyze it (spending computer time).</p>
<p>It would be better if the server was able to avoid these messages being sent. Although some mail servers analyze the content of the message before delivering there are some other techniques which have been proposed to work against spam. Some of them are even standards, but haven&rsquo;t usually been widely deployed. Let&rsquo;s have a look at some advantages and disadvantages of them.</p>
<p><span id="more-58"></span></p>
<p><strong><img alt="Avoidspam" src="http://becomingparanoid.com/images/avoidspam_small.jpg" align="left" border="0" />DNS blacklist. </strong>This is one of the most old and known techniques which tries to avoid the spam being delivered by checking if the computer sending it is a &ldquo;probable&rdquo; spammer. It looks up its IP address in an online service (for example, MAPS or ORBS) and if the address is listed the mail will be rejected (or at least, flagged as suspicious).</p>
<p>There are various online services, each one listing IP addresses depending on different factors. For example, some of them list dynamic and dial-up IP addresses, which usually should send the mails through their server. Other only list IP addresses which have sent spam in the past.</p>
<p>Some controversy has built around these services because sometimes IP addresses have been added in error and this makes a &ldquo;legal&rdquo; server unable to send mail to whoever uses this filtering system. Furthermore, spammers are trying to take down some of this sites, usually by DDoS them, so users can&rsquo;t check them.</p>
<p>You can find more technical information about <a href="http://en.wikipedia.org/wiki/DNSBL">DNS Blacklists</a>&nbsp;at the Wikipedia.</p>
<p><strong><img alt="Avoidspam2" src="http://becomingparanoid.com/images/avoidspam2_small.jpg" align="right" border="0" />Greylisting. </strong>If a server uses greylisting when a new mail is received it will check the IP address of the sender, his mail address and the recipient mail address against a local database. If this combination has already been seen before the mail is delivered. If it hasn&rsquo;t been seen before the message is rejected with a &ldquo;Try later&rdquo; message.</p>
<p>This works because most spammers will never retry to send the message but legitimate mail server will try again in a short time, so when they retry the mail will be accepted. This can be a really powerful technique while spammers don&rsquo;t adapt to it (if many servers use this they will finally retry to send the mails).</p>
<p>The disadvantage of this technique is that it delays all messages coming from unknown sources, be it spam or not, which might not be suitable for everyone, especially online business. Even more, if the sending server is not configured correctly it might not retry to send the mail. </p>
<p>This can be combined with whitelisting (addresses which are always accepted) and blacklisting (addresses which are always rejected).</p>
<p>More information about <a href="http://en.wikipedia.org/wiki/Greylisting">greylisting</a>&nbsp;at the Wikipedia and links to <a href="http://projects.puremagic.com/greylisting/links.html">implementations for different servers</a>.</p>
<p><strong><img alt="Avoidspam3" src="http://becomingparanoid.com/images/avoidspam3_small.jpg" align="left" border="0" />SPF and DomainKeys. </strong>Many spammers use fake mail addresses as the remitent of the message and send the mail from hacked machines or open relays. To solve this one can build a list of IP addresses allowed to send mail from one domain. This is the way SPF works, adding a DNS registry which tells the authorized IP addresses to send mail for that domain. This solution is really simple but it requires the &ldquo;big players&rdquo; to use it in their mail servers, which hasn&rsquo;t happened yet.</p>
<p>DomainKeys is another similar technique proposed by Yahoo which uses cryptography to authenticate the message and check it comes from the mail server indicated. One of the possible disadvantages of this method is that it requires more resources to check the cryptographic signatures, although this shouldn&rsquo;t be a problem in servers with a low number of users. Also, as SPF it has to be implemented by most mail servers to be really useful.</p>
<p>More information about <a href="http://en.wikipedia.org/wiki/Sender_Policy_Framework">SPF</a>&nbsp;in wikipedia or at their <a href="http://www.openspf.org/">homepage</a>.&nbsp;Also about <a href="http://en.wikipedia.org/wiki/DomainKeys">DomainKeys</a>&nbsp;in Wikipedia or at <a href="http://antispam.yahoo.com/domainkeys">Yahoo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/04/19/e-mail-security-avoiding-spam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>E-mail security: detecting spam (V)</title>
		<link>http://becomingparanoid.com/2006/04/05/e-mail-security-detecting-spam-v/</link>
		<comments>http://becomingparanoid.com/2006/04/05/e-mail-security-detecting-spam-v/#comments</comments>
		<pubDate>Wed, 05 Apr 2006 22:44:46 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Email]]></category>
		<category><![CDATA[Medium]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/04/05/e-mail-security-detecting-spam-v/</guid>
		<description><![CDATA[We saw some techniques spammers use to try to evade Bayesian spam filters and how the use of this techniques is making spam a bit less effective and, sometimes, even more easy to detect.
But spammers know this and they wont&#8217; allow their business to go down so easily. So what is the future of filter [...]]]></description>
			<content:encoded><![CDATA[<p>We saw some techniques spammers use <a href="http://becomingparanoid.com/2006/03/29/e-mail-security-detecting-spam-ii/">to try to evade Bayesian spam filters</a> and how the use of this techniques is making spam a bit less effective and, sometimes, even more easy to detect.</p>
<p>But spammers know this and they wont&#8217; allow their business to go down so easily. So what is the future of filter evasion? I have been thinking about some techniques which would probably evade most of current filters and perhaps it&#8217;s time to prepare against them before it&#8217;s too late.</p>
<p>The idea for this list came from a post by <a href="http://vivekjishtu.blogspot.com/2006/03/beware-of-new-form-of-spam-greetings.html">Vivek Jishtu</a> where he explains how a spammer is using the Yahoo greeting cards to send his messages without being detected by filters. This service allows anyone to send a card to someone, who will be notified by e-mail and will receive a link to go to a site to view the card. In this card, the spammer can include arbitrary content so he can put his spam message there and as this will not pass through any filter it won&#8217;t be detected. So, if the user receiving the card visits the link he will see this (translated from Chinese by Google Translator):</p>
<p><center><br />
<img src="http://becomingparanoid.com/images/spamgreeting.png"><br />
</center><br />
<span id="more-51"></span><br />
With another link to the site the spammer is promoting. This is a neat trick and a difficult one to avoid. The only solution is to educate to user not to follow links coming in unexpected mails or from unknown sources.</p>
<p>But there are also other methods that spammers might use now or in the future (I&#8217;m not aware any of this is currently in use, but you never know). </p>
<p>The first technique is copied from viruses or worms which have used this for a long time. Instead of sending the content of the spam in the main body of the message, <strong>a ZIP file can be attached containing a text file with the advertisement</strong> from the spammer. If this becomes popular, Bayesian spam filters might be unable to detect it as the analyzed content can have no malicious word and can look innocuous. To be able to analyze the spam, the filter should decompress the ZIP file and search for text files inside it. This also can be avoided with another technique coming from the virus world, the use of ZIP files protected with a password, like the <a href="http://www.f-secure.com/v-descs/bagle_j.shtml">Bagle-J</a> virus did. The user is told to open the ZIP file using a password contained in the main body, so the filter won&#8217;t be able to decompress the file but the user will.</p>
<p>Another technique, similar to the use of images instead of text, is <strong>sending their advertisements in attached files in some popular file format</strong>, like PDF or Microsoft Word files. Again, the content of the main body might be totally innocuous, asking the user to open the attached file. The filter will need to understand the file format to be able to extract the text and analyze it, which will consume resources from the computer, something sometimes not feasible in servers with lots of users. </p>
<p>These two techniques can be stopped by disallowing the use of attached files or, at least, restricting the formats accepted, as some servers already do to prevent the reception of viruses. We also can educate the users not to open attached files coming from unkown sources, although I doubt this will work as we can see with the expansion of some viruses which work this way.</p>
<p>Spammers could even do another loop and send their spam inside a PDF file compressed in a ZIP file protected by a password&#8230; OK, enough, enough,&#8230;</p>
<p>I don&#8217;t know if any of these or similar techniques will be used by spammers in a near future. If they do use them it will be harder to filter the spam but, at the same time, will mean we are winning a battle in this war against spam. We should better be prepared before it&#8217;s too late.</p>
<p>Of course, Bayesian filtering is not the only way to detect spam, although we have been concentrating on it. There are other techniques currently in use, which probably might be more effective against these new attacks and we&#8217;ll see them in another post.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/04/05/e-mail-security-detecting-spam-v/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>E-mail security: detecting spam (IV)</title>
		<link>http://becomingparanoid.com/2006/04/03/e-mail-security-detecting-spam-iv/</link>
		<comments>http://becomingparanoid.com/2006/04/03/e-mail-security-detecting-spam-iv/#comments</comments>
		<pubDate>Mon, 03 Apr 2006 09:15:01 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Email]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/04/03/e-mail-security-detecting-spam-iv/</guid>
		<description><![CDATA[Knowing how Bayesian filtering works we will try to find some programs which use it and see which is the most useful one for us. I&#8217;ll give a list and you should choose the most appropiate for you.
We can split the filtering programs depending on where they work: on the server or on the client. [...]]]></description>
			<content:encoded><![CDATA[<p>Knowing how Bayesian filtering works we will try to find some programs which use it and see which is the most useful one for us. I&#8217;ll give a list and you should choose the most appropiate for you.</p>
<p>We can split the filtering programs depending on where they work: on the server or on the client. The programs working on the server have some advantages, as they look at more mail messages (they see mail from all users in a system) it is easier and faster to train them. Furthermore, there is only one place to administer it, making the administrator task easier. At the same time, the users don&#8217;t need to receive the spam so they don&#8217;t spend additional bandwith and time. On the other hand, they are not so customizable by the user, which might prefer his own techniques to detect spam and false postivesand, if the user doesn&#8217;t have access to the server he will not be able to install it.</p>
<p>One of the most known server-side filtering software is <a href="http://spamassassin.apache.org/">SpamAssassin</a>, which uses different checks to test for spam, one of them being Bayesian filtering. Each one of this tests adds or substracts a score from the mail and at the end of the runs this score will determine if the mail is spam or not. Amongst other these test include mail-header tests, text-content rules, white-lists and black-lists and collaborative databases, making this program one of the most accurate. This can also be used as client-side filtering, although the installation will not be as easy as others.<br />
<span id="more-50"></span><br />
Another aproach to server-side filtering is the one used by <a href="http://assp.sourceforge.net/">ASSP</a> which works with any king of mail server, as it stands as a proxy (getting the data and transmiting) in front of the real mail server and filters the data before it is delivered. It also uses Bayesian filtering and allows the settings of white-lists so you can define addresses which will be always accepted. It can also scan messages against viruses, which will drop even more malicious mail.</p>
<p>The last server-side software we are going to see is <a href="http://dspam.nuclearelephant.com/">DSpam</a>. This has some characteristics differentiating it from other Bayesian filters. In this case, the tokens are not only analyzed one by one, but also in pairs, which gives a better view to know if a mail is spam or not. It laso uses Bayesian Noise Reduction and other new approaches to filtering, which promise to give a hight detection rate. It includes a web-based interface to administer it, where each user can train it depending on the personal tastes.</p>
<p>If we don&#8217;t have access to the server or we don&#8217;t want to play with it, we can use a client-based filter, installed in our computer which will analyze the mail once it has been downloaded (or while downloading) and will flag it as spam or legitimate mail. The advantage of this kind of approach is that it can be highly integrated in our mail reader, so might be easier to use by the user.</p>
<p>If we use <a href="http://www.mozilla.com/thunderbird/">Thunderbird</a>, it already includes a filter, as we saw in <a href="http://becomingparanoid.com/2006/03/30/e-mail-security-detecting-spam-iii/">the last post about spam</a>. This is really easy to use, as we only have to click a button to tell it if we think a message is spam and once it is trained it will move automatically all spam to a predefined folder, or can even delete it automatically (I don&#8217;t recommend it in case of false positives).</p>
<p>If we use Outlook instead of Thunderbird, one good option is <a href="http://spambayes.sourceforge.net/">SpamBayes</a>. This software also uses some new approaches which are explained in the <a href="http://spambayes.sourceforge.net/background.html">background page</a>. One interesting characteristic of SpamBayes is that it doesn&#8217;t have only two states, spam and non-spam, but also a third one, unsure, when it doesn&#8217;t know how to classify a message. This way, we can choose what we want to do with it: keep it, delete it or use it to train the program. Although it includes a plugin for using it with Ooutlook, it can also be used with other programs as a proxy, and even in other operating systems like Linux or Mac OS X.</p>
<p>To finish this list, we are going to have a look at one of the first mail filters I used. It&#8217;s called <a href="http://popfile.sourceforge.net">POPFile</a> and works as a proxy in front of the mail server. Our mail client will connect to POPFile and POPFile will connect to the mail server, analyzing the mail as it downloads. One of the things I like most about it is the ability to classify any kind of e-mail, not only spam. So, POPFile can distinguish between work-related mail, mail from our children,&#8230; or any other different classification we want to do. We only have to create the categories and assign some messages to each one to train it and it will classify the received e-mails. It also has a web-based interface to manage all of this.</p>
<p>The list of spam filters is quite long and this is only a selection of some of them. You will have to see which one fits you better and use it. Remember to always train it correctly before you do automatic actions on the mail received as you could lose some mails if you don&#8217;t do it correctly.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/04/03/e-mail-security-detecting-spam-iv/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>E-mail security: detecting spam (III)</title>
		<link>http://becomingparanoid.com/2006/03/30/e-mail-security-detecting-spam-iii/</link>
		<comments>http://becomingparanoid.com/2006/03/30/e-mail-security-detecting-spam-iii/#comments</comments>
		<pubDate>Thu, 30 Mar 2006 10:23:23 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Advanced]]></category>
		<category><![CDATA[Email]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/03/30/e-mail-security-detecting-spam-iii/</guid>
		<description><![CDATA[Before talking about other methods for detecting spam, let&#8217;s have a closer look to Bayesian filters and programs using this technique to classify mail. This will be a technical post, so it might not interest to all of you. In next posts we&#8217;ll see some software which uses these filters.
I&#8217;m not a mathematician, so I [...]]]></description>
			<content:encoded><![CDATA[<p>Before talking about other methods for detecting spam, let&rsquo;s have a closer look to Bayesian filters and programs using this technique to classify mail. This will be a technical post, so it might not interest to all of you. In next posts we&rsquo;ll see some software which uses these filters.</p>
<p>I&rsquo;m not a mathematician, so I might make a few errors when trying to explain the theory behind the filters. Please forgive me. The article in <a href="http://en.wikipedia.org/wiki/Bayesian_filtering">Wikipedia</a>&nbsp;explains it better than I can do it.</p>
<p>The main formula where Bayesian filtering stands is:</p>
<p><img alt="Bayes1" src="http://becomingparanoid.com/images/bayes1.png" border="0" / /></p>
<p>which says that the probability of an e-mail being spam given the words contained in it is equal to the probability of these words appearing in a spam message, multiplied by the probability of a message being spam divided by the probability of the words appearing in any message.</p>
<p>Wow, it looks quite complicated. One of the most known papers about this kind of filtering is <a href="http://www.paulgraham.com/spam.html">A plan for spam</a>&nbsp;from Paul Graham. We&rsquo;ll see some code from it. </p>
<p><span id="more-48"></span></p>
<p>Well, to be able to calculate this result we need, in first place, to break the message in words, which are called <em>tokens</em>, from where the probabilities are taken. This partitions are really important, as they will affect the final result depending on how they are done. If we have the sentence <em>It&rsquo;s a shame</em> we could break the words in <em>It-s-a-shame</em> or maybe in <em>Its-a-shame</em> or even <em>It&rsquo;s-a-shame</em> and each of them might give different results when used.</p>
<p>Once the message is broken in tokens, we can calculate the Pr(word|spam) with the next code (this was code in Lisp originally):</p>
<p><code>(let ((g (* 2 (or (gethash word good) 0)))<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (b (or (gethash word bad) 0)))<br />&nbsp;&nbsp; (unless (&lt; (+ g b) 5)<br />&nbsp;&nbsp;&nbsp;&nbsp; (max .01<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (min .99 (float (/ (min 1 (/ b nbad))<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (+ (min 1 (/ g ngood))<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (min 1 (/ b nbad)))))))))</code> </p>
<p>When we have calculated the probabilities for all the tokens in the message, we get the most relevant ones (the ones which probability is farther from 0.5, so the nearest to 0 or 1). Paul decided to use the 15 most relevant and stores them in a list called probs, applying the next formula to it:</p>
<p><code>(let ((prod (apply #'* probs)))<br />&nbsp; (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (- 1 x))<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; probs)))))</code> </p>
<p>If the result is bigger than 0.9 we consider that the e-mail is spam and classify it as such. So, although the theory may look hard once implemented it is far easier. Maybe the only problem with this code is it&rsquo;s Lisp, which not so many people know about.</p>
<p>Let&rsquo;s make this even easier by looking at the source code of Mozilla Thunderbird, the famous opensource mail reader, which includes a Bayesian module to classify mail. The implementation in Thunderbird is slightly different from the original, but the concept remains the same.</p>
<p>The algorithm is implemented in the file mozilla\mailnews\extensions\bayesian-spam-filter\src\nsBayesianFilter.cpp in the function classifyMessage. It&rsquo;s implemented in C++, but we are seeing it in &ldquo;pseudo-code&rdquo;. It uses some different variables:</p>
<ul>
<li>mGoodCount: number of non-spam messages classified</li>
<li>mBadCount: number of spam messages classified</li>
<li>mGoodTokens: hash table with good tokens and number of times they have appeared</li>
<li>mBadTokens: hash table with spam tokens and number of times they have appeared</li>
</ul>
<p>Take care, as the same token might appear in both hash tables with different number of apparitions. For example, the word <em>hello</em> is equally probable in spam and non-spam messages. When the algorithm is not yet trained default values are assigned:</p>
<p><code>if&nbsp;(mGoodCount == 0 || mGoodTokens.count() == 0)<br />&nbsp;&nbsp;&nbsp; message is spam<br />si (mBadCount == 0 || mBadTokens.count() == 0)<br />&nbsp;&nbsp;&nbsp; message is not spam</code> </p>
<p>If the algorithm has been trained then it&rsquo;s applied with the next formula (adapted from <a href="https://bugzilla.mozilla.org/attachment.cgi?id=138425&amp;action=view">Bugzilla</a>):</p>
<p><code>for each&nbsp;token {<br />&nbsp;hamcount = number of token appearances in non-spam<br />&nbsp;spamcount = number of token appearances in spam&nbsp;<br />&nbsp;hamratio = hamcount / nGoodCount<br />&nbsp;spamratio = spamcount / nBadCount<br />&nbsp;<br />&nbsp;prob = spamratio / (hamratio + spamratio)<br />&nbsp;<br />&nbsp;n = hamcount +&nbsp; spamcount<br />&nbsp;prob = (0.225 + n * prob) / (.45 + n)<br />&nbsp;<br />&nbsp;distance = abs(prob - 0.5)<br />&nbsp;if (distance &gt; = .1) {<br />&nbsp;&nbsp;token.distance = distance<br />&nbsp;&nbsp;token.prob = prob<br />&nbsp;}<br />}</code> </p>
<p>With this code, we have the probability for each token. This is saved in a list sorted by distance (distance is taken as the difference between probabilities) and the first 150 elements are taken. A probability distribution chi<sup>2</sup> is calculated and if the result is bigger or equal to 0.9 the message will be classified as spam.</p>
<p>But, we don&rsquo;t need to know all of this unless we want to write one filter ourselves. There are lots of already available filters which work quite good and get rates of detection around 99%, sometimes even better than a human.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/03/30/e-mail-security-detecting-spam-iii/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>E-mail security: detecting spam (II)</title>
		<link>http://becomingparanoid.com/2006/03/29/e-mail-security-detecting-spam-ii/</link>
		<comments>http://becomingparanoid.com/2006/03/29/e-mail-security-detecting-spam-ii/#comments</comments>
		<pubDate>Wed, 29 Mar 2006 09:47:04 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Email]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/03/29/e-mail-security-detecting-spam-ii/</guid>
		<description><![CDATA[As spam filters get more advanced, less spam is allowed to enter into user&#8217;s inbox so the business model of spammers gets hurt. Instead of thinking that people don&#8217;t really like to receive spam and they would prefer less intrusive ways to get publicity, they try to workaround these filters in, sometimes, really clever ways. [...]]]></description>
			<content:encoded><![CDATA[<p>As spam filters get more advanced, less spam is allowed to enter into user&rsquo;s inbox so the business model of spammers gets hurt. Instead of thinking that people don&rsquo;t really like to receive spam and they would prefer less intrusive ways to get publicity, they try to workaround these filters in, sometimes, really clever ways. So, spam filters have to be continually modified and adapted to not fall into these new tricks.</p>
<p>As Bayesian filtering is the most common used technique, this is what spammers try to escape more frequently. We told that <a href="http://becomingparanoid.com/2006/03/27/e-mail-security-detecting-spam/">Bayesian works</a>&nbsp;by calculating the probability that a word is from spam or from legitimate mail, so what spammers do is modify the messages so they get more probability of being legitimate mail.</p>
<p>One of the ways to do this is insert random but common words in spam, so the <em>spam words</em> contribute less to the score and the message goes under the filter. We can see an example of&nbsp;a real spam:</p>
<p align="center"><img alt="Spam1" src="http://becomingparanoid.com/images/spam1.png" border="0" / /></p>
<p align="left">The real content of the spam is contained at the bottom but at the beginning of the e-mail there are some lines with text which come from the novel <a href="http://en.wikipedia.org/wiki/The_Master_and_Margarita">The Master and Margarita</a>&nbsp;and try to hide the fact that this is an spam.</p>
<p><span id="more-47"></span></p>
<p align="left">
<p align="left">Another way to try to evade the filters is by sending the content as an image. This technique&nbsp;is also used in the last example we have seen, but it&rsquo;s a really common one, as we can see in this other e-mail:</p>
<p align="center"><img alt="Spam2" src="http://becomingparanoid.com/images/spam2.png" border="0" / /></p>
<p align="left">Although this may look like an HTML email, in fact all the content is inside an image, with no text to be analyzed by the filters, so it gets more difficult to identify the message as spam because we have no words to compute the probability. Sometimes this technique works against the spammer, as it&rsquo;s quite strange for a legitimate mail to contain only an image with a link, so some more advanced filters might detect this message as spam correctly.</p>
<p align="left">One last technique is the use of unknown or made-up words to confuse the filter. As Bayesian works by looking the probability of already seen words and knowing if they are more likely to occur in legitimate mail or in spam, when an unknown word is found the filter can&rsquo;t really know if it belongs to spam or not, so it can&rsquo;t classify it correctly and the spam might just evade the filter. Let&rsquo;s see an example:</p>
<p align="center"><img alt="Spam3" src="http://becomingparanoid.com/images/spam3.png" border="0" / /></p>
<p align="left">We can see that instead of <em>ordering </em>the message says <em>orderinq</em> with a Q as the last letter, which looks quite similar to the G. Also, the word Viagra is not written with a V letter at the beginning, but with the slash and forward-slash symbols like this \ /. There are more example in these two sentences, as almost every word is modified to evade the filters.</p>
<p align="left">Sometimes, it gets so difficult for spammers to be sure their junk will reach the recipient that the messages they sent have almost no sense and it is quite hard to know what they are really trying to advertise.</p>
<p align="center"><img alt="Spam4" src="http://becomingparanoid.com/images/spam4.png" border="0" / /></p>
<p>Can you guess what they are trying to say?</p>
<p>If we have a Bayesian filter which checks our e-mail it is a good idea to keep it updated and trained. It&rsquo;s quite easy and shouldn&rsquo;t consume a lot of our time, unless we receive incredible amounts of e-mail. To do this we should check from time to time the folder where spam is moved to check if there has been any false positive (that is, a legitimate mail message which has been classified as spam). If there is any, we must tell the filter that message is not spam so it changes the probabilities of the words included in it. Checking this folder from time to time is a good idea anyways, so we don&rsquo;t lose any important e-mail which might have been miscategorised. It&rsquo;s also important when we receive spam which is not filtered as such, not only delete it, but tell the filter that message is spam so we can keep it trained.</p>
<p>There are other methods to classify spam which are not based in Bayesian filters and we will see them in next posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/03/29/e-mail-security-detecting-spam-ii/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>E-mail security: detecting spam</title>
		<link>http://becomingparanoid.com/2006/03/27/e-mail-security-detecting-spam/</link>
		<comments>http://becomingparanoid.com/2006/03/27/e-mail-security-detecting-spam/#comments</comments>
		<pubDate>Mon, 27 Mar 2006 18:08:30 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Email]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/03/27/e-mail-security-detecting-spam/</guid>
		<description><![CDATA[If the volume of spam we receive is overwhelming us and we can&#8217;t keep up with classifying , we need an automated way to separate spam from legitimate mail. One of the&#160;most famous&#160;methods was proposed proposed by Paul Graham in a paper called A plan for spam, where he talked about some algorithms which use [...]]]></description>
			<content:encoded><![CDATA[<p>If the volume of spam we receive is overwhelming us and we can&rsquo;t keep up with classifying , we need an automated way to separate spam from legitimate mail. One of the&nbsp;most famous&nbsp;methods was proposed proposed by Paul Graham in a paper called <a href="http://www.paulgraham.com/spam.html">A plan for spam</a>, where he talked about some algorithms which use probability to classify each&nbsp;message.</p>
<p>The basis for this method is a previous training of the algorithm, where we must feed it with spam messages and legitimate mail telling which is which. With this data, the algorithm breaks the messages in words and assign a probability to each word for being in a spam message and another for being in a legitimate mail.</p>
<p>When a new message is received, it&rsquo;s broken in words like the training messages and the saved probabilities of each word are analyzed with a formula called <em>Naive Bayes</em>, which returns a final probability for the mail being spam or not.</p>
<p>Most of the known mail classifier use, at least, this method, usually combined with others, but we can see this is a really powerful way of classifying.</p>
<p>Another approach to classification is the one used by <a href="http://spamassassin.apache.org/">Spamassassin</a>&nbsp;which has a series of rules that assign some points when it applies to the mail. As more points are assigned the mail has more probability of being spam, and it is classified as such when it surpasses a threshold.</p>
<p>Spamassassin also uses the Bayesian filter but it&rsquo;s not the only way to check for spam, as it usually has distinguishable characteristics which may make it different enough from legitimate mail to be easily classifiable.</p>
<p>But spammers are adapting to the measures, modifying the mails they send so they are not detected as spam by the filters and it&rsquo;s necessary to tweak these filters and find new ways to throw spam to trash.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/03/27/e-mail-security-detecting-spam/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>E-mail security: spam</title>
		<link>http://becomingparanoid.com/2006/03/25/e-mail-security-spam/</link>
		<comments>http://becomingparanoid.com/2006/03/25/e-mail-security-spam/#comments</comments>
		<pubDate>Sat, 25 Mar 2006 19:14:55 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Email]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/03/25/e-mail-security-spam/</guid>
		<description><![CDATA[Spam is one of the most common types of undesired mail. It is sent in bulk to lots of people trying to sell some product or service. Many times, these products are not legal at all, as some drugs, but other times legal services are offered this way.
For an e-mail to be spam it must [...]]]></description>
			<content:encoded><![CDATA[<p>Spam is one of the most common types of undesired mail. It is sent in bulk to lots of people trying to sell some product or service. Many times, these products are not legal at all, as some drugs, but other times legal services are offered this way.</p>
<p>For an e-mail to be spam it must be sent without the consent of the recipient, that is, an e-mail with a commercial advertisement is not spam if you have asked for it. The legislation of each country is more specific as to what is spam and what is not.</p>
<p>The products which get more advertising in spam vary with time, but it is quite usual to receive spam about drugs like viagra or valium, about how to get fake college diplomas, how to get a mortgage or illegal software.</p>
<p>The problem of spam is economic. Sending spam is really cheap, so even if only a really small percentage of the receivers buy the product it&rsquo;s still profitable. So, you must never buy products advertised this way, so spammers get the message that people don&rsquo;t like to receive these kind of messages and won&rsquo;t buy their products.</p>
<p>In the same way, the most expensive part of the spam is not payed by the spammer. He only has to find somewhere from where to send the spam and, once it has been send, he doesn&rsquo;t have to pay anything more for it. But the message has to travel through other networks, has to be stored somewhere and has to be, finally, read or deleted. This has a cost in network bandwidth, in disk space occupied in, more importantly, in time spent by the final recipient having to classify and delete the e-mail.</p>
<p>For&nbsp;many people, the quantity of spam received is bigger than the quantity of legitimate mail, so they need some way to classify it automatically, as it almost gets impossible to do it by hand in a short time.</p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/03/25/e-mail-security-spam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Undesirable mail. What, who, why, how&#8230;</title>
		<link>http://becomingparanoid.com/2006/03/11/undesirable-mail-what-who-why-how/</link>
		<comments>http://becomingparanoid.com/2006/03/11/undesirable-mail-what-who-why-how/#comments</comments>
		<pubDate>Sat, 11 Mar 2006 17:40:25 +0000</pubDate>
		<dc:creator>madelman</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://becomingparanoid.com/2006/03/11/undesirable-mail-what-who-why-how/</guid>
		<description><![CDATA[Wow, a really great post from Sergio Hernando&#160;where he talks about all kinds of undesirable mail. His main points are:

Why people send this kind of mail
Differents kinds of undesirable mail
Methods to recollect e-mail dirs
How they send this mail
How to avoid receiving this kind of mail

If you know spanish I can only recommend going straight to [...]]]></description>
			<content:encoded><![CDATA[<p>Wow, a really great post from <a href="http://www.sahw.com/wp/">Sergio Hernando</a>&nbsp;where he talks about all kinds of undesirable mail. His main points are:</p>
<ul>
<li>Why people send this kind of mail</li>
<li>Differents kinds of undesirable mail</li>
<li>Methods to recollect e-mail dirs</li>
<li>How they send this mail</li>
<li>How to avoid receiving this kind of mail</li>
</ul>
<p>If you know spanish I can only recommend going straight to read it. If you don&rsquo;t, I&rsquo;m going to&nbsp;write a series of posts about e-mail with some of the information in the post.</p>
<p>More info&nbsp;| <a href="http://www.sahw.com/wp/archivos/2006/03/10/correo-no-deseado-causas-tipologias-y-medidas-de-prevencion/">Correo no deseado. Causas, tipologías y medidas de prevención.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://becomingparanoid.com/2006/03/11/undesirable-mail-what-who-why-how/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
