Email,
Spam, and Anti-Spam Method using Bayesian filer
By
Yen-I Chiang
Introduction
When we use
Internet more and more to improve our speed to access information, email has
become one of convenient tool of communication. We can mail message to our
friend almost instantly to any location on earth. However, the convenience also
brings with inconvenience. Spam, the unsolicited commercial email, has come
with the Ham, the legitimate email. [1] At the resent years, spam occupied more
and more proportion of our email traffic. At the 2004 May, the spam has the
proportion of 60% of our email traffic. [2] For my personal experience, after yahoo official filter, I still get seven
spam of ten incoming emails. One
of ten is commercial email from Borders or Target etc...(Although sometimes I do not like these commercial email, but
I did not think of the result when I give my email address out.) And two of ten
are my legitimate emails. Suffering the pain of spam, many people, like me, would turn to
anti-spam software for help. However, not like others, I have more
interests to know where the spam come from? And how anti-spam software works? Or
what are other ways we can use to stop spam?
Where does spam come from?
(1)The basic parts of an email message
Before we
know where spam comes from, we need to know how spammers make a spam. And
of cause, we need to know what the email is made of before we know how spam is
made by spammer. Email itself is a sequence of bytes, not a file. And email is
composed in three parts, header, body, and envelope. The header is the part
records the information about who send the email, who the email is send to, time stamps,
sender IP, and the information about the relay which transfers this email. The
body is the content of the email. And the envelope is the username at the
sender sever.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1| From you@Here.us.edu Fri
Dec 13 08:11:44 2002
2| Received: (from
you@localhost)
3| by Here.us.edu
(8.12.7/8.12.7)
4| id d8BILu12835
for you; Fri, 13 Dec 2002 08:11:44 -0600 (MDT)
5| Date: Fri, 13 Dec 2002
08:11:43
6| From: you@Here.us.edu
(your full name)
7| Message-Id:
<200201011511.d9BMTuX29709@Here.us.edu>
8| Subject: a test
9| To: you
10|
(blank line)
11|
This is a one-line message.
Fig 1[3]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(See Fig 1.) In the header part, the "From you@Here.us.edu Fri Dec
13 08:11:44 2002" is the elements must exist. Else like "To:
you", "From: you@Here.us.edu (your full name)", "Date: Fri 13 Dec 2002 08:11:43",
etc.. are not really necessary required for an email. However,
resent version of MUA, mail user agent, or MTA, mail transfer agent, do require
"To: you" line, "Subject: a test" line and "Cc:
another person" line. The requirement varies from agent to agent.
Most of the time, we send our
email by MUA. We do not know we can send email by hand and we rarely do so. But
it is possible to send email by hand using telnet.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1
[hb@redhat6~]$ telnet slack 25
2
Trying 192.168.0.111...
3
Connected to slcakware.homeworx.org.
4
Escape character is '^]'.
5
220-slack.homeworx.org Sendmail 8.6.12/8.6.9
6
ready at Mon, 13, Mar 1980 14:01:06
7
GMT
8
220 ESMTP spoken here
9
HELO
10
250 slack.homeworx.org Hello hb@redhat6
11
[192.168.0.166], pleased to meet you
12
MAIL FROM: bigbrother@ms.1984.org
13
250 bigbrother@ms.1984.org... Sender ok
14
RCPT TO: fred@slack
15 250 fred@slack... Recipient ok
16
DATA
17
354 Enter mail, end with "." on a line by itself
18
Hello,
19
this is a message from Big Brother.
20 I
am watching you so behave yourself.
21
Bye for now!
22
Big Brother
23 .
24
250 0AA00253 Message accepted for delivery
25
quit
26
221 slack.homeworx.org closing connection
27
Connection closed by foreign host.
Fig 2 [4]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1
From bigbrother@ms1984.org Mon Mar 13 12:01:53 1980
2
Date: Mon, 13 MAr 1980 12:01:10 GMT
3
From: bigbrother@ms.1984.org
4
Apparently-To: fred@slack.homeworx.org
5
6
Hello,
7
this is a message from Big Brother.
8 I
am watching you so behave yourself.
9 Bye
for now!
10
Big Brother
Fig 3[5]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1
From bigbrother@ms.1984.org Mon Mar 13 12:01:53 1980
2
Return-Path: bigbrother@ms.1984.org
3
Received: from redhat6 (hb@redhat6 [192.168.0.166])
4 by
slack.homeworx.org (8.6.12/8.6.9) with SMPT id MAA00176 for
5
fred@slack; Mon, 13 Mar 1980 12:01:10 GMT
6
Date: Mon, 13 Mar 1980 12:01:10 GMT
7
From: bigbrother@ms.1984.org
8
Message-ID: <198010131201.MAA00176@slack.homeworx.org>
9
Apparently-To: fred@slack.homeworx.org
10
Status: 0
Fig 4[6]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By using steps in Fig 2 we
can email by hand. The email received will be Fig 3. However, using hand input
for sending an email produces a chance for spammer or cracker to fake a fraud email.
(2)The weakness of email
The Fig 3 is the format of an email that we see
all the time. There is "From " in line 1. There is "To: "
in line 4. And there is the body in line 6 to 10. However, there are more information have not
showed up in the email's header part. So let’s take a look of the unmarked version of the header
of the email, the Fig 4. There are more information is showed up in Fig 4.
Remember this email is send by hand in Fig 2. In the line 1 of Fig 2 indicate who is the real
person sends this email by telnet. But if we take a
look of line 1 of Fig3, it says the email is sended by bigbrother@ms.1984.org,
not the person, hb@redhat6. How could this happen? Because,
when we send an email by hand, the MUA, Sendmail in SMPT, does not use the user name or
user IP, Fig 2 line 1, of current sender to be the sender of the line
"From " of an email. It uses the user inputted name,
"bibrother@ms.1984.org" in line 12 of Fig2, to be the sender of an email.
At the same time the user inputted "bigbrother@ms.1984.org" is used
to be the envelop of the email. However, the sender, hb@redhat6, can not totally fake an email. There are
still clues that
we can see when we unmarked the email. We can see the real IP of the sender
in line3 of Fig4.
This faking email trick is used
by spammer. Why they need to fake an email or emails. Because there are
thousands of angry spam-receivers like us trying to send back the spam. Now we know where the real sender
of our spam is. We can take some actions to block their email or to hunt them
down. Sorry! Not that easy. Most of the spammers are crackers. This means the
sender, named hb@redhat, is not the real identity of the spammer. He may crack
in the sever, redhat6, and create the user name,
hb, for sending spam purpose. There are many ways to crack into a sever, but
how to do it is not the main point of this paper. There may exist more tricks
that spammers use to fake a email or even more to cover their real IP. But the
purpose of talking about faking an email is because most spam is fraud. And
some anti-spam methods, for example like SPF, sender policy framwork (or sender
permitted from,) use this characteristic to stop spam.[7] It would be discussed
in later section.
Hunting down a fake identity or
blocking the coming spam of the cracked severs IP are not the way to stop spam.
Spammers only need to crack down more severs to replace the revealed one. We
need something more efficient to stop the spam.
How Anti-Spam software works?
(1)Former method
There are few early types of
anti-spam software. One kind is to remember the name of the sender of spam or the email address of spam, and
it creates a
blacklist of name and email address. But the spammers keep change the
sending address or IP, then this method fails to work well. Since they do not
know the new name or new address of spam, and the spammer never stop changing
it.
Later comes
some new methods try to recognize the style of the subject line of spam. They
would recognize some special words, like loan, free, sex, etc., and make a
blacklist of these sensitive words. Once the subject line of incoming emails
includes these sensitive words, the email would be sort as spam. Otherwise the
email is ham. Then here comes the questions. For human LOAN and LO@N are the
same words. If the blacklist
does not have LO@N in
list, then the spam goes through. Then the developer of the anti-spam software upgrades the
software and lets it can recognize LO*N or L**N (* means any numeric or alphabetic sign of any language looked like
"O" or "A".) However, it can not recognize L*O*A*N or
anything looked like LOAN but not for a software. There is going to be endless
various version of a single word for the software to recognize. And it still needs to take care of
other sensitive words, not just one. The efficiency of this method is low. And
as the blacklist grows, in the future, the efficiency will be lower.
(2)Bayesian filter
In August 2002, Paul Graham
proposed a plan for anti-spam method in his article, "A Plan for
Spam."[8] He proposed the Bayesian filter should be a good method to
filter the spam supported by his astonished well performance experimental
data,0.99
and no positive false. After the influence of his article, many of
Bayesian filters were developed in freeware style, and one project, spambayes
project[9], is trying to use the idea of Bayesian filter on other place, like
outlook plugin, pop3proxy, pop3 with IMAP, procmail environment etc..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
madam 0.99
promotion 0.99
republic 0.99
shortest 0.047225013
mandatory 0.047225013
standardization 0.07347802
sorry 0.08221981
supported 0.09019077
people's 0.09019077
enter 0.9075001
quality 0.8921298
organization 0.12454646
investment 0.8568143
very 0.14758544
valuable 0.82347786
Fig 5 A Spam not gets through
[10]~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
perl 0.01
python 0.01
tcl 0.01
scripting 0.01
morris 0.01
graham 0.01491078
guarantee 0.9762507
cgi 0.9734398
paul 0.027040077
quite 0.030676773
pop3 0.042199217
various 0.06080265
prices 0.9359873
managed 0.06451222
difficult 0.071706355
Fig 6 A Spam
get through [11]~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P(Spam | all interesting
words) =
P(1st word)P(2nd word)P(3rd
word)
--------------------------------------------------------------------------------------------------
P(1st word)P(2nd word)P(3rd word) + (1-P(1st word))(1-P(2nd word))(1-P(3rd
word))
For a Bayesian filter, the probability of the interesting words, tokens, are independent from
each other.
Fig 7 A example of Bayes rule for three
component[12]~~~~~~~~~~~~~~~~
I am going
to use the examples from "A plan for Spam" to explain how Bayesian
filter work. The Fig 5 is a spam can not get through the filter( The contain of
the email is in [10].) the Fig 6 is a spam gets through the filter(The contain
of the spam is in [11].) When Bayesian filter processes the spam, it takes the header and
the body of the spam and scan it with its list of token, which are interesting
words, not necessary sensitive words. Then it produces a table of the 15 most
interesting words with the word's probabilities in filter's list. And then it
uses these 15 probabilities to put into Bayes' rule to produce a final
probability that would tell us if this email is a
spam. When the final probability is over 0.9, then the email is an spam and it would
be filtered out. Otherwise the email will be a ham and passed.
The Bayesian filter needs to be
trained to build up the list of the interesting words, token, and their
probabilities. Due to the distinguish of a personal life, education, community
belong
to, the object to mail to, career etc., the list of interesting words may
be very different from person to person. Due to the career of Paul, for him,
the word, perl, has the 0.01 probability( lower
is ham token, and higher is spam token.) But for others
person, the word, perl, may never exist in the list and be took as a token. And
also because of this reason, once the spam contains lots of words
that you use frequently in ham. Then the final probability of that spam may
drop and pass the filter. For example, the Fig 6 has the final profanity, 0.01.
It is far from the broder line, 0.90.
The Bayesian also needs
frequently training and update new tokens. It is because of it is very hard to
say that after today's 2000 email input, the filter has totally fit
into the pattern of my received emails. And it is also hard to say that the
filter has the pattern of all spam. Since the spammers are advancing
their spam more like a normal email. There is a difference between Bayesian filter
and the former filter. It is useless to fool Bayesian filter using words like LO@N, the
weakness of former filters. Since after few training the filter can learn this
kind of trick and filter it out without any upgrade from the company of
the product, like former filters do.
One of the weakness of the
Bayesian filter is that once there is a spam similar to ham, with the only
difference, a URL of its main purpose inside. This kind of spam looks like the normal
contain of email, like a story of a day, a story of research etc. But at the
end, it would use few words to lead you to the URL(It must use only few words. Too much words would increase its probability.) The Bayesian
seems to very hard to filter this kind of spam. However, when the spammer leave
the URL for the only "spamming" purpose, most of the time this URL
would be the same and can be recognized and added into the token
list.
What are other ways we can use to stop spam?
There are
other methods used to stop spam. On the law level, there is CAN-SPAM Act passed
in November 25 2003. And ISPs start to sue individuals and organization which
was sending spam. The ISPs also make the free email account application
differently. At the before time, when we want to apply a free email account
from ISP, what we do is to fill in data. Since these step of actions can be
replace by computer program and the
spammers need lots of new accounts for them to use to send spam, so spammers
start to use program to apply new accounts for them. Due to this reason, the
ISP start to use a trick that machine can not do for human. ISPs place a
picture generated by computer, which is a picture of a word, and ask applicant
to input the word for identifying human or program purpose. Since there is no
way for a program to read a picture with twisted words yet.
Stopping
spammer applying new accounts is not really enough for ISPs. ISPs come out more
plans for fighting back. [14]
AOL is testing a new anti-spam protocol, SPF. Yahoo
is developing DomainKeys plan. And MSN is developing a system like caller ID.
And some ISP include MSN, plan to ask spammer to pay to mail. There are many
methods developing, many methods need the help and the corporations from ISPs all
over the world, which is going to be a huge task.
One of the
many methods catch my attention is SPF. SPF is not a program or software which
can filter spam. It is a protocol. Under its protocol, for every email before
it is send out, the MTA would ask the sever of sender if the sender is the user
of the sever. For a spam, most of the time, the real sender and the Sender of
"From " is different. So in the SPF protocol, the fraud email will be
reject before it send out. SFP protocol, not only can block spam, but also
warms and some virus, which travels with fraud email. However, SPF also brings
some problems. Under this protocol, users of free email service, like yahoo,
may have problems. For example, I use Charter to connect Internet, but I use
yahoo to send email. And then my IP, which is in Charter, of sender is
different to the sender name in "From " line, which is in Yahoo.
There are more problems needed to solved before the SPF is accepted as a
protocol of all Internet. But if so, SPF would be quite helpful.
Conclusion
There are
more and more methods come out for fighting the spam. Some might be very
successful, like Bayesian filter. Some might be very aggressive and stop spam
before it send, like SPF. Some might
punish the spammer at law level, like CAN-SPAM ACT. Some might just make the
door narrower for spammer, like YAHOO, MSN, and AOL. But all are fighting
together. Paul Graham said that the spammer is
paid by his spam response rate. If less spam passes through, then the
less they are paid to make a spam. And one day spammer would be out of
business. I am thinking maybe on that day my email box would be less crowded
than my mailbox.
Source
[1] Spam and Ham are terms used to named them , not
the abbreviation of
unsolicited commercial email and legitimate email, respectively.
[2] By
Gabrielle Gagnon,"Detecting Spam," editor
in chief by Michael J. Miller , 2004 May issue PC Magazine, page 72.
[3] ByBryan Costales with Eric Allman, "Sendmail", 3rd
edition, published by O'REILLY 2002, page8,
ISBN 1-56592-839-3.
[4] By
Dr-K, "A Complete H@cker's Handbook"first
edition, published by Carlton Book, London, 2000, 2002, page109,
ISBN 1-84222-724-6
[5] The
same book of [4] page 110
[6] The
same book of [4] page 110
[7]Web
title: "SMTP+SPF"
Author: There is no clearly noticing who
the author of this website is. However, I fallow by a URL in a
article, "SPF, MTAs, and SRS", by Meng Weng Wong (see Note), to
come to this website. And Weng is the founder and CTO
of the pobox.com. So in my believe, if the author of
this website is not Weng, the website would be held
or directed by Weng.
Note:
"SPF, MTAs, and SRS", by Meng Weng Wong, in Linux Journal,
editor in chief by Don Marti, 2004 May, issue 121, page 50.
Readed date:
05/06/04
URL:
http://spf.pobox.com/intro.html
[8]Web title: "Paul Graham",
Readed
date: 05/06/04
URL: http://www.paulgraham.com/spam.html
The
Bayesian filter is not a new idea for anti-spam (wrote in another article of
Paul, "Better Bayesian Filter", in the same website) In 1998, Pantel
and Lin, and Microsoft Research had papers about the Bayesian filter. But they
did not produce fine result to catch the attention from public.(see [13])
"Paul
Graham is the designer of the Arc language. He most recently worked for Yahoo.
Before that he was president of Viaweb, which became Yahoo Store when Viaweb
was acquired by Yahoo in 1998." wrote in his website, the same URL.
[9] Web title: "Spambayes"
Projest
helded by: Tim Peters, Gary Robinson, Rob Hooft, Mark Hammond, and more ( I
cannot find the exact sentence said who make this website. But the whole
project is contributed by many people. I just list few who are important in the
project. )
Readed
date: 05/06/04
URL: http://spambayes.sourceforge.net/
[10]This is a spam example, wrote in [8], which can
not get though the Bayesian filter
This email is too long to put into the contain of
paper .So I include here for explaining the Bayesian filter method.
Return-Path:
<z_q_c_x@263.net>
Delivered-To:
wg-pg@wg.archub.org
Received: (qmail 17529
invoked from network); 11 Aug 2002 17:32:07 -0000
Received: from unknown
(HELO mail100.store.yahoo.com) (216.136.225.204)
by ip67-89-31-66.z31-89-67.customer.algx.net
with SMTP; 11 Aug 2002 17:32:07 -0000
Received: from 263.net
([61.50.141.181])
by mail100.store.yahoo.com
(8.11.2/8.11.2) with ESMTP id g7BHXfg50998
for <psg@paulgraham.com>; Sun, 11
Aug 2002 10:33:41 -0700 (PDT)
Message-Id:
<200208111733.g7BHXfg50998@mail100.store.yahoo.com>
From: "zqcx"
<z_q_c_x@263.net>
Subject: permission to
enter Chinese market
To: psg@paulgraham.com
Content-Type:
text/plain;charset="GB2312"
Reply-To: z_q_c_x@263.net
Date: Mon, 12 Aug 2002
01:32:33 +0800
X-Priority: 3
X-Mailer: Microsoft Outlook
Express 6.00.2600.0000
Dear Sir or Madam:
Please reply to
Receiver: China Enterprise
Management Co., Ltd. (CMC)
E-mail: unido@chinatop.net
As one technical
organization supported by China Investment and Technical Promotion Office of
United Nation Industry Development Organization (UNIDO), we cooperate closely
with the relevant Chinese Quality Supervision and Standardization Information
Organization. We provide the most valuable consulting services to help you to
open Chinese market within the shortest time:
1. Consulting Service on
Mandatory National Standards of The People's Republic of China.
2. Consulting Service on
Inspection and Quarantine Standards of The People's Republic of China.
3. Consulting Service for
Permission to Enter Chinese Market
We are very sorry to
disturb you!
More information, please check
our World Wide Web: http://www.chinatop.net
Sincerely yours
[11]A example of spam, wrote in [8], which gets through the Bayesian filter
Return-Path:
<morris@hostex.com>
Delivered-To:
wg-pg@wg.archub.org
Received: (qmail 75813
invoked from network); 21 Aug 2002 08:39:41 -0000
Received: from unknown
(HELO mail101.store.yahoo.com) (216.136.224.113)
by ip67-89-31-66.z31-89-67.customer.algx.net
with SMTP; 21 Aug 2002 08:39:41 -0000
Received: from
morris.hostex.com (morris.sales.hostex.com [213.197.134.181])
by mail101.store.yahoo.com
(8.11.2/8.11.2) with ESMTP id g7L8ei478111
for <pg@bugbear.com>; Wed, 21 Aug
2002 01:40:44 -0700 (PDT)
Received: by
morris.hostex.com (Postfix, from userid 1003)
id 65CFA2027; Wed, 21 Aug 2002 08:30:35 +0000
(GMT)
To: pg@bugbear.com
From: Irene Morris
<morris@hostex.com>
X-Mailer: Microsoft Outlook
Express 5.50.4807.1700
Subject: Re: hosting
services
Message-Id:
<20020821083035.65CFA2027@morris.hostex.com>
Date: Wed, 21 Aug 2002
08:30:35 +0000 (GMT)
Dear Paul Graham,
As a person involved in
website development you recognize that finding a
good web host offering
reasonably priced quality services can be quite
difficult. We believe that
offering a combination of quality services,
timely and responsive
customer support and good prices is where our company
excels.
HOSTEX (
http://www.hostex.com ) is a flexible company providing quality and
cost effective web hosting
solutions with focus on small and medium businesses.
Recently we introduced new
service plans offering flexibility, performance and value:
Our plans are:
Basic - $6.95 / month
- 40 Mb of diskspace
- 10 POP3 mailboxes, unlimited
aliases, autoresponders
- unlimited ftp accounts, URL
protection, shell access and more
- no setup fee
Advanced - $14.95 / month
- 80 Mb of diskspace
- 15 POP3 mailboxes, unlimited
aliases, autoresponders
- unlimited ftp accounts, URL
protection, shell access
- CGI/SSI scripting (Perl, Python,
Tcl)
- PHP4 scripting
- Frontpage 2002 extensions
- Graphical web statistics
- various optional services and more
- no setup fee
Professional - customizable, make-your-own plan
- select only those services
that you need
All plans include full
30-day money back guarantee and there is no setup fee.
Services and features
associated with your hosting account can be
managed via our easy to use
Control Center. Everything is right there at
your finger tips and if you
are stuck our customer support team will provide
you with friendly and
personal attention 24 hours a day, 7 days a week.
Please visit
http://www.hostex.com for more information.
Sincerely,
Irene Morris
Sales Manager
morris@hostex.com
[12] By Karl Henrik Borch, "The economic of
uncertainty", published by Princeton University Press, 1968 , page 206,
ISBN 691-04124-5.
Also read: ByRobert L. Winkler, "Introduction
to Bayesian Inference and Decision", published by Holt, Rinehart and
Winston Inc., 1972, page 41, ISBN
0-03-081327-1
Also read: by Gabrielle Gagnon, "Dectcting
Spam", editor in chief by Michael J. Miller , PC Magazine 2004 May issue, page72
[13]Web titel: "Paul Graham"
Readed
date: 05/06/04
Article title: "Better Bayesian Filter"
URL: http://www.paulgraham.com/better.html,
second paragraph.
[14]Web title: PC World
Readed
date: 05/06/04
Article
Titel: "Spam Weapon of Tomarrow", publish in PC World magazine
Article
author: Tim Spring
Published date: Monday, March 01, 2004
URL: http://www.pcworld.com/news/article/0,aid,114995,00.asp
Topics > Internet & Networking > E-Mail > Spam >