Page for Independent Study students Fall 2009: CS 4980 Section 001

This is a 1 unit independent study course which is expected to be about 30 hours of work over the course of the semester. The project for this course is:

0. Keep track of you time on a spreadsheet.

Patrick Pantel and Dekang Lin. 1998. SpamCop – A Spam Classification & Organization Program. In Proceedings of AAAI Workshop on Learning for Text Categorization. pp. 95-98. Madison, Wisconsin.

2. We will meet as a group to discuss the article (hopefully mid-October). You should come to my office hours or make an appointment at least every other week to update me on your progress.

3. Implement the Naive Bayes algorithm described in the article, using Python, to create a natural language classifier.

4. Test your classifier on data that I will provide later in the semester.

I expect approximately 2-3 typed pages with an explanation of how your classifier works, what the key data structures and algorithms are and what the limitations of the classifier are.

We met on October 15th at 11:00 am.
Discussed
• conditional probability
• conditional independence assumptions
• features and feature extraction
• need for dampening
• classification task and building a classifier

For the next meeting (October 29th at noon)
Write a program in python to count words in a text document you program should:
• open a text file given on the command line to read (input)
• open a text file given on the command line to write (output)
• open a text file given on the command line to read a stoplist (could be either common_words or an empty file)
• possibly take a command line argument to indicate print only works with more than n occurrences
• remove punctuation (use a regular expression and substitution)
• read the words from the file into a dictionary (key=word, value=count)
• write (both to the screen and to the output file)  the words and their counts, sorted by frequency (descending) and alphabetically

How to Classify Messages

Classify the message as a flame if:
P(Flame | Message) > P(Not_Flame | Message)

where
P(Flame | Message)  = P(w_1 | Flame) … P(w_n | Flame) P(Flame)
and
P(Not_Flame | Message)  = P(w_1 | Not_Flame) … P(w_n | Not_Flame) P(Not_Flame)

Now each piece is computed by:

P(w_i | Flame) = N(w_i, Flame) + 1/k
-----------------------
N(Flame) + 1

similarily

P(w_i | Not_Flame) = N(w_i, Not_Flame) + 1/k
-----------------------
N(Not_Flame) + 1

where N(w_i, Flame) is the number of occurrences of wi in Flames and N(Flames) is the number of flames in the corpus.

P(Flame) is the proportion of flames in the corpus (i.e. total number of flame messages divided by total number of messages).

Otherwise the message is classified as not a flame.

I will update this to explain how to train, test, and measure your results.

Flame Corpus
Annotation