I'm Keyvan Nayyeri, a 25 years old Ph.D. student at
the Computer Science department of
the University of Texas at San Antonio.
I'm also
a Software Architect and Developer and previously held a B.Sc.
degree in Applied Mathematics.
This is my blog where I publish content about various topics specifically Programming Languages and Compilers, Software
Engineering and Programming.
You know that the lighter side of my face is my background in the mathematics as I’m not good at software development at all! Prior to this, I’ve had some posts about applied mathematics and some mathematic tips in the software development but it’s really difficult to keep track of both fields on one blog!
Since I started Waegis as an online spam filter service, clients and buddies have been asking me about my methods to catch spams and clear the web because they’re usually surprised how I am able to identify a content as spam or ham automatically especially in regard to the fact that there are manifold content types and categories received by Waegis service and it’s hard to understand the process.
Doubtlessly no one talks about the internal techniques of his service both because of security reasons, and the fact that such techniques and formulas are the key point in the success of a service, so if Waegis has been able to work with an accuracy rate larger than 99%, then it’s mainly because of the amount of time that I spent on finding these unique formulas and techniques. After succeeding with my findings in action, I felt that my forehead has stretched significantly!!
I’m going neither to publish my techniques here nor to demonstrate my implementation of Bayesian algorithm here. Moreover, I’m not also going to talk about the main technique that I used on Waegis because Bayesian algorithm can’t be very effective against today’s spammers very much. But as a matter of fact Bayesian algorithm has proven to be an oldie but goodie method to filter many traditional spams and I’m sure it’s interesting for my audience to learn about that.
Bayesian algorithm is a mathematical, statistical or even an Artificial Intelligence algorithm that is designed based on Bayes theorem and is one of the several algorithms that let you accomplish machine learning. The fundamental of this algorithm is based on the past history of our data in order to predict the future with some parameters.
First of all, let me talk about Bayes theorem as one of the most useful theorems in the probability that is named after Thomas Bayes. This theorem relates the conditional probability of two events to their marginal probabilities. Here is the formula of the theorem:
![]()
This is pretty simple and I’m almost sure that you all have seen it before. This simple formula is the foundation of an important part of the probability and statistics that is known as Bayesian probability which has a variety of applications in the real world situations.
Before following my discussion, let me point out to the Bayesian analysis of Monty Hall problem that is a favorite topic for the .NET community leaders these days after that Steve introduced this problem with two follow up posts recently.
Now let’s return to the main discussion and apply the general method in the case of spam filtering. The application of Bayesian algorithm in spam filtering is applying a derivation of Bayesian theorem to spams in which we use a Naive Bayes classifier and replace its parameters with our desire parameters in spam filtering.
For your information, Naive Bayes classifier is a derivation of Bayesian theorem where we assume that our random events are independent, so this independency helps us extract a simpler formula which is much easier to implement in the real world situations.
This simplified formula for N events is similar to what follows.

This formula relates the initial conditional probability to a set of constants and probabilities that can be calculated by experiment and this is the point! Here we can calculate this conditional probability based on the data that we already have, and this opens many doors to great applications in the real world problems such as spam filtering.
In the spam filtering field above formula can be expressed as the following.
![]()
In other words, the probability of being spam for a content in the case that it has a set of certain words is equal to the multiplication of the conditional probability of each word to be spam in the probability of begin spam, divided by the probability of the existence of these words in any content.
Yes, this is a very pure discussion but as I remember it’s one of the simplest definitions that I have ever seen in the mathematics and statistics!!
Three factors appear in the right hand side of this equation and we need to find their values in order to be able to calculate the left hand side probability. All the values in the right hand side depend on prior events and a set of data that we can gather from prior instances, so we can gather a set of contents on a random basis and apply some methods to proceed their words and calculate these probabilities for them. Here it’s important to know which content is spam and which is not and this should be done by human. Later we can proceed this data and find the values for the factors.
Here it’s worthwhile to state that choosing a correct set of initial data is a key factor in the success of your Bayesian filter. You need to choose a certain number of initial items and they should reflect your real world model very well because those probabilities vary from model to model and they can have a huge impact on the working of your filter. It’s recommended to spend much time identifying data collected from your own model by hand. Even though this may require much effort, it guarantees the correctness of your probabilities. and their proximity to real values.
At the end, you can apply this formula to find a probability for any given text and choose your own thresholds to filter out the content. Again, your thresholds can be set by a set of experiments on your real data.
Here I neglect a discussion about the implementation of this method in a programming language because it’s completely beyond the scope of this post but it may be interesting for you to know that with this basic definition you wouldn’t be able to catch more than 10-20% of today’s spams. Modern spammers are aware of these methods and as I had tated before, they evolve quickly during the time. Not only you need to apply many tweaks, techniques and innovations in your implementation of Bayesian algorithm, but also you need to use other spam filtering methods to block modern spams. To be honest, modern spammers have been able to work around traditional Bayesian filters and these filters are not able to identify all the spams or at least an acceptable rate of spams.
Paul Graham is a best-known name in the area of refinement in Bayesian spam filters who has authored a set of articles on the methods that can improve Bayesian implementations such as the necessity of case-sensitivity or omitting common words. Although he has great article documenting some effective hacks and techniques, but they’re not as effective as the time of writing at present.
Finally, let me summarize this discussion shortly. Bayes theorem is a pure fundamental of Bayesian probability as the source of many applications in the real world and Bayesian Naive classifier is a common specific case in this category. Bayesian spam filtering is the application of a Bayesian Naive classifier in the field of spam filtering that relates the probability of being spam to the probability of some factors that can be found with some experiments. Bayesian spam filtering, with this basic definition, isn’t very effective and requires some tweaks and techniques to refine it but modern spammers are still able to pass Bayesian spam filters and we need to work on other methods to work with Bayesian method in parallel.
Anthony Grace
Sep 30, 2008 12:44 PM
#
Hi Keyvan,
Could you show how you derived the simplified formula for n events?
Love your blog,
Anthony :-)
Keyvan Nayyeri
Sep 30, 2008 1:10 PM
#
@Anthony
I omitted that part for simplicity as it's available on Wikipedia.
rascunho » Blog Archive » links for 2008-09-30
Sep 30, 2008 4:05 PM
#
Pingback from rascunho » Blog Archive » links for 2008-09-30
David
Oct 07, 2008 11:00 AM
#
Bayesian filtering is a very good solution for filtering. I have found it very useful using SpamBully and spam bayes for the last 5 years.
My Best Blog Posts in 2008
Dec 31, 2008 1:41 PM
#
In the past 3.5 years of blogging, I haven’t had such best pick up collections in the end of the year, but now that everybody is writing one, why shouldn’t I write my own?! Collecting this list, I could realize some interesting facts that completely changed
Leave a Comment