Introduction

Email, Spam, and Anti-Spam Method using Bayesian filer

By Yen-I Chiang

05/06/04

Introduction

When we use Internet more and more to improve our speed to access information, email has become one of convenient tool of communication. We can mail message to our friend almost instantly to any location on earth. However, the convenience also brings with inconvenience. Spam, the unsolicited commercial email, has come with the Ham, the legitimate email. [1] At the resent years, spam occupied more and more proportion of our email traffic. At the 2004 May, the spam has the proportion of 60% of our email traffic. [2] For my personal experience, after yahoo official filter, I still get seven spam of ten incoming emails. One of ten is commercial email from Borders or Target etc...(Although sometimes I do not like these commercial email, but I did not think of the result when I give my email address out.) And two of ten are my legitimate emails. Suffering the pain of spam, many people, like me, would turn to anti-spam software for help. However, not like others, I have more interests to know where the spam come from? And how anti-spam software works? Or what are other ways we can use to stop spam?

Where does spam come from?

(1)The basic parts of an email message

Before we know where spam comes from, we need to know how spammers make a spam. And of cause, we need to know what the email is made of before we know how spam is made by spammer. Email itself is a sequence of bytes, not a file. And email is composed in three parts, header, body, and envelope. The header is the part records the information about who send the email, who the email is send to, time stamps, sender IP, and the information about the relay which transfers this email. The body is the content of the email. And the envelope is the username at the sender sever.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1| From you@Here.us.edu Fri Dec 13 08:11:44 2002

2| Received: (from you@localhost)

3| by Here.us.edu (8.12.7/8.12.7)

4| id d8BILu12835 for you; Fri, 13 Dec 2002 08:11:44 -0600 (MDT)

5| Date: Fri, 13 Dec 2002 08:11:43

6| From: you@Here.us.edu (your full name)

7| Message-Id: <200201011511.d9BMTuX29709@Here.us.edu>

8| Subject: a test

9| To: you

10| (blank line)

11| This is a one-line message.

Fig 1[3]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(See Fig 1.) In the header part, the "From you@Here.us.edu Fri Dec 13 08:11:44 2002" is the elements must exist. Else like "To: you", "From: you@Here.us.edu (your full name)", "Date: Fri 13 Dec 2002 08:11:43", etc.. are not really necessary required for an email. However, resent version of MUA, mail user agent, or MTA, mail transfer agent, do require "To: you" line, "Subject: a test" line and "Cc: another person" line. The requirement varies from agent to agent.

Most of the time, we send our email by MUA. We do not know we can send email by hand and we rarely do so. But it is possible to send email by hand using telnet.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1 [hb@redhat6~]$ telnet slack 25

2 Trying 192.168.0.111...

3 Connected to slcakware.homeworx.org.

4 Escape character is '^]'.

5 220-slack.homeworx.org Sendmail 8.6.12/8.6.9

6 ready at Mon, 13, Mar 1980 14:01:06

7 GMT

8 220 ESMTP spoken here

9 HELO

10 250 slack.homeworx.org Hello hb@redhat6

11 [192.168.0.166], pleased to meet you

12 MAIL FROM: bigbrother@ms.1984.org

13 250 bigbrother@ms.1984.org... Sender ok

14 RCPT TO: fred@slack

15 250 fred@slack... Recipient ok

16 DATA

17 354 Enter mail, end with "." on a line by itself

18 Hello,

19 this is a message from Big Brother.

20 I am watching you so behave yourself.

21 Bye for now!

22 Big Brother

23 .

24 250 0AA00253 Message accepted for delivery

25 quit

26 221 slack.homeworx.org closing connection

27 Connection closed by foreign host.

Fig 2 [4]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1 From bigbrother@ms1984.org Mon Mar 13 12:01:53 1980

2 Date: Mon, 13 MAr 1980 12:01:10 GMT

3 From: bigbrother@ms.1984.org

4 Apparently-To: fred@slack.homeworx.org

6 Hello,

7 this is a message from Big Brother.

8 I am watching you so behave yourself.

9 Bye for now!

10 Big Brother

Fig 3[5]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1 From bigbrother@ms.1984.org Mon Mar 13 12:01:53 1980

2 Return-Path: bigbrother@ms.1984.org

3 Received: from redhat6 (hb@redhat6 [192.168.0.166])

4 by slack.homeworx.org (8.6.12/8.6.9) with SMPT id MAA00176 for

5 fred@slack; Mon, 13 Mar 1980 12:01:10 GMT

6 Date: Mon, 13 Mar 1980 12:01:10 GMT

7 From: bigbrother@ms.1984.org

8 Message-ID: <198010131201.MAA00176@slack.homeworx.org>

9 Apparently-To: fred@slack.homeworx.org

10 Status: 0

Fig 4[6]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By using steps in Fig 2 we can email by hand. The email received will be Fig 3. However, using hand input for sending an email produces a chance for spammer or cracker to fake a fraud email.

(2)The weakness of email

The Fig 3 is the format of an email that we see all the time. There is "From " in line 1. There is "To: " in line 4. And there is the body in line 6 to 10. However, there are more information have not showed up in the email's header part. So let’s take a look of the unmarked version of the header of the email, the Fig 4. There are more information is showed up in Fig 4. Remember this email is send by hand in Fig 2. In the line 1 of Fig 2 indicate who is the real person sends this email by telnet. But if we take a look of line 1 of Fig3, it says the email is sended by bigbrother@ms.1984.org, not the person, hb@redhat6. How could this happen? Because, when we send an email by hand, the MUA, Sendmail in SMPT, does not use the user name or user IP, Fig 2 line 1, of current sender to be the sender of the line "From " of an email. It uses the user inputted name, "bibrother@ms.1984.org" in line 12 of Fig2, to be the sender of an email. At the same time the user inputted "bigbrother@ms.1984.org" is used to be the envelop of the email. However, the sender, hb@redhat6, can not totally fake an email. There are still clues that we can see when we unmarked the email. We can see the real IP of the sender in line3 of Fig4.

This faking email trick is used by spammer. Why they need to fake an email or emails. Because there are thousands of angry spam-receivers like us trying to send back the spam. Now we know where the real sender of our spam is. We can take some actions to block their email or to hunt them down. Sorry! Not that easy. Most of the spammers are crackers. This means the sender, named hb@redhat, is not the real identity of the spammer. He may crack in the sever, redhat6, and create the user name, hb, for sending spam purpose. There are many ways to crack into a sever, but how to do it is not the main point of this paper. There may exist more tricks that spammers use to fake a email or even more to cover their real IP. But the purpose of talking about faking an email is because most spam is fraud. And some anti-spam methods, for example like SPF, sender policy framwork (or sender permitted from,) use this characteristic to stop spam.[7] It would be discussed in later section.

Hunting down a fake identity or blocking the coming spam of the cracked severs IP are not the way to stop spam. Spammers only need to crack down more severs to replace the revealed one. We need something more efficient to stop the spam.

How Anti-Spam software works?

(1)Former method

There are few early types of anti-spam software. One kind is to remember the name of the sender of spam or the email address of spam, and it creates a blacklist of name and email address. But the spammers keep change the sending address or IP, then this method fails to work well. Since they do not know the new name or new address of spam, and the spammer never stop changing it.

Later comes some new methods try to recognize the style of the subject line of spam. They would recognize some special words, like loan, free, sex, etc., and make a blacklist of these sensitive words. Once the subject line of incoming emails includes these sensitive words, the email would be sort as spam. Otherwise the email is ham. Then here comes the questions. For human LOAN and LO@N are the same words. If the blacklist does not have LO@N in list, then the spam goes through. Then the developer of the anti-spam software upgrades the software and lets it can recognize LO*N or L**N (* means any numeric or alphabetic sign of any language looked like "O" or "A".) However, it can not recognize L*O*A*N or anything looked like LOAN but not for a software. There is going to be endless various version of a single word for the software to recognize. And it still needs to take care of other sensitive words, not just one. The efficiency of this method is low. And as the blacklist grows, in the future, the efficiency will be lower.

(2)Bayesian filter

In August 2002, Paul Graham proposed a plan for anti-spam method in his article, "A Plan for Spam."[8] He proposed the Bayesian filter should be a good method to filter the spam supported by his astonished well performance experimental data,0.99 and no positive false. After the influence of his article, many of Bayesian filters were developed in freeware style, and one project, spambayes project[9], is trying to use the idea of Bayesian filter on other place, like outlook plugin, pop3proxy, pop3 with IMAP, procmail environment etc..

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

madam 0.99

promotion 0.99

republic 0.99

shortest 0.047225013

mandatory 0.047225013

standardization 0.07347802

sorry 0.08221981

supported 0.09019077

people's 0.09019077

enter 0.9075001

quality 0.8921298

organization 0.12454646

investment 0.8568143

very 0.14758544

valuable 0.82347786

Fig 5 A Spam not gets through [10]~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

perl 0.01

python 0.01

tcl 0.01

scripting 0.01

morris 0.01

graham 0.01491078

guarantee 0.9762507

cgi 0.9734398

paul 0.027040077

quite 0.030676773

pop3 0.042199217

various 0.06080265

prices 0.9359873

managed 0.06451222

difficult 0.071706355

Fig 6 A Spam get through [11]~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

P(Spam | all interesting words) =

P(1st word)P(2nd word)P(3rd word)

--------------------------------------------------------------------------------------------------

P(1st word)P(2nd word)P(3rd word) + (1-P(1st word))(1-P(2nd word))(1-P(3rd word))

For a Bayesian filter, the probability of the interesting words, tokens, are independent from each other.

Fig 7 A example of Bayes rule for three component[12]~~~~~~~~~~~~~~~~

I am going to use the examples from "A plan for Spam" to explain how Bayesian filter work. The Fig 5 is a spam can not get through the filter( The contain of the email is in [10].) the Fig 6 is a spam gets through the filter(The contain of the spam is in [11].) When Bayesian filter processes the spam, it takes the header and the body of the spam and scan it with its list of token, which are interesting words, not necessary sensitive words. Then it produces a table of the 15 most interesting words with the word's probabilities in filter's list. And then it uses these 15 probabilities to put into Bayes' rule to produce a final probability that would tell us if this email is a spam. When the final probability is over 0.9, then the email is an spam and it would be filtered out. Otherwise the email will be a ham and passed.

The Bayesian filter needs to be trained to build up the list of the interesting words, token, and their probabilities. Due to the distinguish of a personal life, education, community belong to, the object to mail to, career etc., the list of interesting words may be very different from person to person. Due to the career of Paul, for him, the word, perl, has the 0.01 probability( lower is ham token, and higher is spam token.) But for others person, the word, perl, may never exist in the list and be took as a token. And also because of this reason, once the spam contains lots of words that you use frequently in ham. Then the final probability of that spam may drop and pass the filter. For example, the Fig 6 has the final profanity, 0.01. It is far from the broder line, 0.90.

The Bayesian also needs frequently training and update new tokens. It is because of it is very hard to say that after today's 2000 email input, the filter has totally fit into the pattern of my received emails. And it is also hard to say that the filter has the pattern of all spam. Since the spammers are advancing their spam more like a normal email. There is a difference between Bayesian filter and the former filter. It is useless to fool Bayesian filter using words like LO@N, the weakness of former filters. Since after few training the filter can learn this kind of trick and filter it out without any upgrade from the company of the product, like former filters do.

One of the weakness of the Bayesian filter is that once there is a spam similar to ham, with the only difference, a URL of its main purpose inside. This kind of spam looks like the normal contain of email, like a story of a day, a story of research etc. But at the end, it would use few words to lead you to the URL(It must use only few words. Too much words would increase its probability.) The Bayesian seems to very hard to filter this kind of spam. However, when the spammer leave the URL for the only "spamming" purpose, most of the time this URL would be the same and can be recognized and added into the token list.

What are other ways we can use to stop spam?

There are other methods used to stop spam. On the law level, there is CAN-SPAM Act passed in November 25 2003. And ISPs start to sue individuals and organization which was sending spam. The ISPs also make the free email account application differently. At the before time, when we want to apply a free email account from ISP, what we do is to fill in data. Since these step of actions can be replace by computer program and the spammers need lots of new accounts for them to use to send spam, so spammers start to use program to apply new accounts for them. Due to this reason, the ISP start to use a trick that machine can not do for human. ISPs place a picture generated by computer, which is a picture of a word, and ask applicant to input the word for identifying human or program purpose. Since there is no way for a program to read a picture with twisted words yet.

Stopping spammer applying new accounts is not really enough for ISPs. ISPs come out more plans for fighting back. [14]

AOL is testing a new anti-spam protocol, SPF. Yahoo is developing DomainKeys plan. And MSN is developing a system like caller ID. And some ISP include MSN, plan to ask spammer to pay to mail. There are many methods developing, many methods need the help and the corporations from ISPs all over the world, which is going to be a huge task.

One of the many methods catch my attention is SPF. SPF is not a program or software which can filter spam. It is a protocol. Under its protocol, for every email before it is send out, the MTA would ask the sever of sender if the sender is the user of the sever. For a spam, most of the time, the real sender and the Sender of "From " is different. So in the SPF protocol, the fraud email will be reject before it send out. SFP protocol, not only can block spam, but also warms and some virus, which travels with fraud email. However, SPF also brings some problems. Under this protocol, users of free email service, like yahoo, may have problems. For example, I use Charter to connect Internet, but I use yahoo to send email. And then my IP, which is in Charter, of sender is different to the sender name in "From " line, which is in Yahoo. There are more problems needed to solved before the SPF is accepted as a protocol of all Internet. But if so, SPF would be quite helpful.

Conclusion

There are more and more methods come out for fighting the spam. Some might be very successful, like Bayesian filter. Some might be very aggressive and stop spam before it send, like SPF. Some might punish the spammer at law level, like CAN-SPAM ACT. Some might just make the door narrower for spammer, like YAHOO, MSN, and AOL. But all are fighting together. Paul Graham said that the spammer is paid by his spam response rate. If less spam passes through, then the less they are paid to make a spam. And one day spammer would be out of business. I am thinking maybe on that day my email box would be less crowded than my mailbox.

Source

[1] Spam and Ham are terms used to named them , not the abbreviation of unsolicited commercial email and legitimate email, respectively.

[2] By Gabrielle Gagnon,"Detecting Spam," editor in chief by Michael J. Miller , 2004 May issue PC Magazine, page 72.

[3] ByBryan Costales with Eric Allman, "Sendmail", 3rd edition, published by O'REILLY 2002, page8,

ISBN 1-56592-839-3.

[4] By Dr-K, "A Complete H@cker's Handbook"first edition, published by Carlton Book, London, 2000, 2002, page109,

ISBN 1-84222-724-6

[5] The same book of [4] page 110

[6] The same book of [4] page 110

[7]Web title: "SMTP+SPF"

Author: There is no clearly noticing who the author of this website is. However, I fallow by a URL in a article, "SPF, MTAs, and SRS", by Meng Weng Wong (see Note), to come to this website. And Weng is the founder and CTO of the pobox.com. So in my believe, if the author of this website is not Weng, the website would be held or directed by Weng.

Note: "SPF, MTAs, and SRS", by Meng Weng Wong, in Linux Journal, editor in chief by Don Marti, 2004 May, issue 121, page 50.

Readed date: 05/06/04

URL: http://spf.pobox.com/intro.html

[8]Web title: "Paul Graham",

Readed date: 05/06/04

URL: http://www.paulgraham.com/spam.html

The Bayesian filter is not a new idea for anti-spam (wrote in another article of Paul, "Better Bayesian Filter", in the same website) In 1998, Pantel and Lin, and Microsoft Research had papers about the Bayesian filter. But they did not produce fine result to catch the attention from public.(see [13])

"Paul Graham is the designer of the Arc language. He most recently worked for Yahoo. Before that he was president of Viaweb, which became Yahoo Store when Viaweb was acquired by Yahoo in 1998." wrote in his website, the same URL.

[9] Web title: "Spambayes"

Projest helded by: Tim Peters, Gary Robinson, Rob Hooft, Mark Hammond, and more ( I cannot find the exact sentence said who make this website. But the whole project is contributed by many people. I just list few who are important in the project. )

Readed date: 05/06/04

URL: http://spambayes.sourceforge.net/

[10]This is a spam example, wrote in [8], which can not get though the Bayesian filter

This email is too long to put into the contain of paper .So I include here for explaining the Bayesian filter method.

Return-Path: <z_q_c_x@263.net>

Delivered-To: wg-pg@wg.archub.org

Received: (qmail 17529 invoked from network); 11 Aug 2002 17:32:07 -0000

Received: from unknown (HELO mail100.store.yahoo.com) (216.136.225.204)

by ip67-89-31-66.z31-89-67.customer.algx.net with SMTP; 11 Aug 2002 17:32:07 -0000

Received: from 263.net ([61.50.141.181])

by mail100.store.yahoo.com (8.11.2/8.11.2) with ESMTP id g7BHXfg50998

for <psg@paulgraham.com>; Sun, 11 Aug 2002 10:33:41 -0700 (PDT)

Message-Id: <200208111733.g7BHXfg50998@mail100.store.yahoo.com>

From: "zqcx" <z_q_c_x@263.net>

Subject: permission to enter Chinese market

To: psg@paulgraham.com

Content-Type: text/plain;charset="GB2312"

Reply-To: z_q_c_x@263.net

Date: Mon, 12 Aug 2002 01:32:33 +0800

X-Priority: 3

X-Mailer: Microsoft Outlook Express 6.00.2600.0000

Dear Sir or Madam:

Please reply to

Receiver: China Enterprise Management Co., Ltd. (CMC)

E-mail: unido@chinatop.net

As one technical organization supported by China Investment and Technical Promotion Office of United Nation Industry Development Organization (UNIDO), we cooperate closely with the relevant Chinese Quality Supervision and Standardization Information Organization. We provide the most valuable consulting services to help you to open Chinese market within the shortest time:

1. Consulting Service on Mandatory National Standards of The People's Republic of China.

2. Consulting Service on Inspection and Quarantine Standards of The People's Republic of China.

3. Consulting Service for Permission to Enter Chinese Market

We are very sorry to disturb you!

More information, please check our World Wide Web: http://www.chinatop.net

Sincerely yours

[11]A example of spam, wrote in [8], which gets through the Bayesian filter

Return-Path: <morris@hostex.com>

Delivered-To: wg-pg@wg.archub.org

Received: (qmail 75813 invoked from network); 21 Aug 2002 08:39:41 -0000

Received: from unknown (HELO mail101.store.yahoo.com) (216.136.224.113)

by ip67-89-31-66.z31-89-67.customer.algx.net with SMTP; 21 Aug 2002 08:39:41 -0000

Received: from morris.hostex.com (morris.sales.hostex.com [213.197.134.181])

by mail101.store.yahoo.com (8.11.2/8.11.2) with ESMTP id g7L8ei478111

for <pg@bugbear.com>; Wed, 21 Aug 2002 01:40:44 -0700 (PDT)

Received: by morris.hostex.com (Postfix, from userid 1003)

id 65CFA2027; Wed, 21 Aug 2002 08:30:35 +0000 (GMT)

To: pg@bugbear.com

From: Irene Morris <morris@hostex.com>

X-Mailer: Microsoft Outlook Express 5.50.4807.1700

Subject: Re: hosting services

Message-Id: <20020821083035.65CFA2027@morris.hostex.com>

Date: Wed, 21 Aug 2002 08:30:35 +0000 (GMT)

Dear Paul Graham,

As a person involved in website development you recognize that finding a

good web host offering reasonably priced quality services can be quite

difficult. We believe that offering a combination of quality services,

timely and responsive customer support and good prices is where our company

excels.

HOSTEX ( http://www.hostex.com ) is a flexible company providing quality and

cost effective web hosting solutions with focus on small and medium businesses.

Recently we introduced new service plans offering flexibility, performance and value:

Our plans are:

Basic - $6.95 / month

- 40 Mb of diskspace

- 10 POP3 mailboxes, unlimited aliases, autoresponders

- unlimited ftp accounts, URL protection, shell access and more

- no setup fee

Advanced - $14.95 / month

- 80 Mb of diskspace

- 15 POP3 mailboxes, unlimited aliases, autoresponders

- unlimited ftp accounts, URL protection, shell access

- CGI/SSI scripting (Perl, Python, Tcl)

- PHP4 scripting

- Frontpage 2002 extensions

- Graphical web statistics

- various optional services and more

- no setup fee

Professional - customizable, make-your-own plan

- select only those services that you need

All plans include full 30-day money back guarantee and there is no setup fee.

Services and features associated with your hosting account can be

managed via our easy to use Control Center. Everything is right there at

your finger tips and if you are stuck our customer support team will provide

you with friendly and personal attention 24 hours a day, 7 days a week.

Please visit http://www.hostex.com for more information.

Sincerely,

Irene Morris

Sales Manager

morris@hostex.com

[12] By Karl Henrik Borch, "The economic of uncertainty", published by Princeton University Press, 1968 , page 206, ISBN 691-04124-5.

Also read: ByRobert L. Winkler, "Introduction to Bayesian Inference and Decision", published by Holt, Rinehart and Winston Inc., 1972, page 41, ISBN 0-03-081327-1

Also read: by Gabrielle Gagnon, "Dectcting Spam", editor in chief by Michael J. Miller , PC Magazine 2004 May issue, page72

[13]Web titel: "Paul Graham"

Readed date: 05/06/04

Article title: "Better Bayesian Filter"

URL: http://www.paulgraham.com/better.html, second paragraph.

[14]Web title: PC World

Readed date: 05/06/04

Article Titel: "Spam Weapon of Tomarrow", publish in PC World magazine

Article author: Tim Spring

Published date: Monday, March 01, 2004

URL: http://www.pcworld.com/news/article/0,aid,114995,00.asp

Topics > Internet & Networking > E-Mail > Spam >