Page for Independent Study students Fall 2009: CS 4980 Section 001

This is a 1 unit independent study course which is expected to be about 30 hours of work over the course of the semester. The project for this course is:

0. Keep track of you time on a spreadsheet.

1. Read the article:
Patrick Pantel and Dekang Lin. 1998. SpamCop – A Spam Classification & Organization Program. In Proceedings of AAAI Workshop on Learning for Text Categorization. pp. 95-98. Madison, Wisconsin.

2. We will meet as a group to discuss the article (hopefully mid-October). You should come to my office hours or make an appointment at least every other week to update me on your progress.

3. Implement the Naive Bayes algorithm described in the article, using Python, to create a natural language classifier.

4. Test your classifier on data that I will provide later in the semester.

5. Write up your results:
       I expect approximately 2-3 typed pages with an explanation of how your classifier works, what the key data structures and algorithms are and what the limitations of the classifier are.


We met on October 15th at 11:00 am.
Discussed

For the next meeting (October 29th at noon)
Write a program in python to count words in a text document you program should:


How to Classify Messages

Classify the message as a flame if:
P(Flame | Message) > P(Not_Flame | Message)

where
P(Flame | Message)  = P(w_1 | Flame) … P(w_n | Flame) P(Flame)
and
P(Not_Flame | Message)  = P(w_1 | Not_Flame) … P(w_n | Not_Flame) P(Not_Flame)

Now each piece is computed by:

P(w_i | Flame) = N(w_i, Flame) + 1/k
                          -----------------------
                         N(Flame) + 1

similarily

P(w_i | Not_Flame) = N(w_i, Not_Flame) + 1/k
                                  -----------------------
                                   N(Not_Flame) + 1

where N(w_i, Flame) is the number of occurrences of wi in Flames and N(Flames) is the number of flames in the corpus.

P(Flame) is the proportion of flames in the corpus (i.e. total number of flame messages divided by total number of messages).

Otherwise the message is classified as not a flame.

I will update this to explain how to train, test, and measure your results.


Flame Corpus
Annotation