Spam - An Endless Pain

SpamYesterday was the 15th birthday of spam! Of course, there is no official date for this but almost everybody agrees on the point that today is the birthday of spam because fifteen years ago Joel Furr defined spam and gave it a name and after that this web phenomenon evolved and grown up and suffered us everyday!

I think that I'm one of the million guys who have written about different aspects of spam and spamming before me and I'm sure that I'm not the only guy who writes about them!

2-3 days ago and at military office eventually I was talking to a friend who is a web user about spamming and we had a long conversation about different aspects of it. By the way, I haven't been a normal guy against spam and spamming and have had some works and studies about it.

Both spamming and spam filtering have become good industries and businesses around software and web in the years. As you saw, this industry was big enough to catch someone like Robert Soloway (Spam King).

Like a knowledge, like a technology and like the web, spam has evolved quickly and new types of spam are brought to the web and communication systems. At the beginning it was just of type of an unexpected email but nowadays it's grown to have different types like comment spam, trackback spam, short message spam, chat room spam and some other types!

There are lots of negative points about spam that forces users, webmasters and site owners to find some ways to fight against it! Here are some reasons that would be obvious for everyone:

To defeat spammers, you need to learn how they work and more importantly you need to have a solid definition of spam! There may be a traditional definition of spam in your mind but a simple question can make it ambiguous: how do you consider an email from a site that sends you an email just because you have registered on the site? On the other hand, modern types of spams break many of the previous definitions for email spams!

In my opinion the worst point about spamming is the cost of spamming for spammers and its comparison with its cost for spam filters. A spammer can send millions of spams automatically in a short time but a spam filter has to do much more work and spend more resources to filter it! The ease of effort to send an spam is the main reason that has kept spammers alive. They just need to think about their methods and patterns and there is no other cost for them! The ease of spamming is important for spammers. They haven't been very active in video spamming because of its effort and cost and they usually look for cheap and easy ways.

All the negative points with spams led to many reactions from web experts against them in the history. The war between spam filtering methods and services and spammers have been an interesting war in the history and doesn't look to end someday! Both sides evolve to beat the other and none of them can do this!

In my opinion this is one of the very interesting parts of the software world for someone who loves to wrestle with complex and uncommon problems. Manual fighting against spam is impossible so we need to find automated ways to fight against them. In this way four categories of the knowledge come into the play:

You may think that there should be some common methods in these knowledges to filter spams and everybody can use them but the interesting part of the process is here. You have some tools that have proven their effect but they can't be success until you implement them in the best form.

Photo taken from http://bonq.org/blog/wp-content/uploads/2007/05/spam.jpg Before following this discussion wait a moment and let me give a background about the evolvement of spamming and spam types. First type of spam was (and is) the most common and well-known type of it (email spam) while other types of spams have become common among spammers. Some of these types are email spam, comment and trackback spam, forum spam, chat room spam, short message spam (for mobile devices), newsletter spam, Wiki spam and other types including a new type of spam like Twitter spam!

Professional spam filtering has become a knowledge for its own which requires a very strong knowledge to be able to evolve and improve your methods with spammers!

Generally there are two groups of methods to filter spams:

Feature-based methods are those that rely heavily on the features, properties and characteristics of spam type and are applicable to that spam type only. For example, email headers are a good feature of emails to check because many times spammers show their hands there! Another instance is the IP address of a comment spammer that may help you in some cases (if your spammer isn't professional!). Here there is a lot of space to work. For example, one common solution has been using different types of CAPTCHA controls but this method has failed! Believe it or not, but I never used regular CAPTCHA controls for my sites or any site that I designed because of two simple reasons:

  1. It's not user-friendly!
  2. It can break!

In less than four months at the end of 2007 and beginning of 2008, CAPTCHA controls for Google, Yahoo and Microsoft could be broken and this had to be considered as the end of CAPTCHA. I personally haven't liked regular image and sound CAPTCHA controls at all!

Such methods are specific to the spam type and of course, aren't very trustable. Usually such methods are used to reduce the fault of spam filters. Here we rely on the norm of the content and common characteristics of spammers in a field but you know that this may go wrong!

Algorithmic methods are those that use statistics, artificial intelligence and mathematics to teach the machine how to filter spam content from legitimate content! These are more complicated and require a real knowledge. Such methods have proven their success in the history and the most common method is Bayesian algorithm which is a common concept in statistics and machine learning (artificial intelligence). In a very simple form, for a single word this algorithm calculates the probability of being spam based on the past history of the occureness of this word in spam and legitimate content! Of course, this is a very simple definition and there are various implementations of this basic definition. Some of these methods could be success while some of the others have failed. Obviously, an initial database with probabilities is mandatory and a better database may end up with a better filtering! But the key point is the amount of work that you do to improve this basic definition. There have been various techniques to improve this algorithm because the basic form isn't very effective!

However, algorithmic methods are commonly used in spam filters and a combination of these methods with feature-based methods can give you a good power.

But the bad point is the fault ratio! All these methods have a fault ratio and overall system has its own fault. In spam filtering world there are two definitions that are usually calculated as percentage:

We always try to reduce both ratios but it's obvious that false-positives are more critical than false-negatives because no one wants to miss a legitimate content.

On the other hand, a reasonable number of false-positives and false-negatives are those that are ambiguous based on our definition of spam. For instance many spam filters filter emails from sites that send you an email because you have registered on their sites or ordered something. You may consider them as spam while others may have a different opinion.

The other comparison can be between methods and tools that work per client and those that work on a centralized server in a network. Each of these methods have some pros and cons. That Bayesian filtering is sensitive to probabilities and such probabilities may vary based on clients. For instance, I'm a developer and some words like ".NET", "Software" and "Development" may occur more frequently in my legitimate emails than a normal guy! So Bayesian filtering can work better on my client. Moreover, passing the content to a centralized server has its own cost and performance effect. But on the other hand, a centralized server is able to find more features, characteristics and statistics from spammers because it gets much more spammers from them and usually service-oriented solutions adapt themselves with spammer methods faster than client-side solutions. But generally, service-oriented methods have been better while a combination of both methods can be the best!

All in all, spam filtering methods are evolving and block many spams but this costs much more for spam filters to have an acceptable quality in comparison with the ease of sending millions of spam items by a spammer!

On the other hand, while spammer filters learn how to filter spams, spammers also learn how to defeat these filters and change their spamming patterns. For instance, I can remember the days when I was blogging on a Persian .NET community powered by .Text and got bulks of spams everyday but all of them had over 15-20 links inside them. Some simple methods could filter them based on this factor (feature). A short while after that they reduced the number of links but their Achilles heel was their content so a simple HttpModule like ReverseDOS could filter them based on some sexual content and other common words that they were publishing. Nowadays they're getting clever and use common words with a few links but spam filters are also evolving to block them!

I'm very interested in this field because none of the patterns and methods work without your mind-work and power! And almost all patterns and methods expire after a short while and you need to invest something new! There are many data in/data out common systems in different scales that may need much effort but believe that writing such a system is something different! There have been many attempts that have failed just because programmers couldn't get the appropriate percentage of false-positives and false-negatives!

As the last point, I think that better laws can help us amazingly. Even in the most modern countries in the IT, there aren't very good laws to ban spamming! Many of the governments are spammers, themselves. For instance, here we get unexpected spams as short messages from the governmental carriers!!

[advertisement] Axosoft OnTime 2008 is four developer tools in one: bug tracking, project wiki, feature management, and help desk. It manages your development process so developers can focus on coding. Installed or Hosted – Free Single-user license -- Free 30-day team trial.

2 Comments : 04.03.08

Feedbacks

 avatar
#1
twitter spam
04.03.2008 @ 10:16 PM

...it's not spam, it's twitter! ;-)

 avatar
#2
Sarah Testa
05.15.2008 @ 9:54 AM

It is quite amazing how spam has evolved over the years, and as spam filters get tougher on spam, spammers seem to have also gotten smarter by discovering ways to bypass these filters. So I see no end to the spam and spam filtering saga... it will be an ongoing war. Having said that, your last comment re having laws in countries regulating spam, real laws that make a difference, is something which is definitely needed and which many countries should be discussing and implementing.

Leave a Comment