Page for Independent Study students
Fall 2009: CS 4980 Section 001
This is a 1 unit independent study course which is expected to be about
30 hours of work over the course of the semester. The project for this
0. Keep track of you time on a spreadsheet.
1. Read the article:
2. We will meet as a group to discuss the article (hopefully
mid-October). You should come to my office hours or make an appointment
at least every other week to update me on your progress.
3. Implement the Naive Bayes algorithm described in the article, using
Python, to create a natural language classifier.
4. Test your classifier on data that I will provide later in the
5. Write up your results:
I expect approximately 2-3 typed pages
with an explanation of how your classifier works, what the key data
structures and algorithms are and what the limitations of the
We met on October 15th at 11:00
- conditional probability
- conditional independence assumptions
- features and feature extraction
- need for dampening
- classification task and building a classifier
For the next meeting (October
29th at noon)
Write a program in python to count words in a text document you program
- open a text file given on the command line to read (input)
- open a text file given on the command line to write (output)
- open a text file given on the command line to read a stoplist
(could be either common_words or an
- possibly take a command line argument to indicate print only
works with more than n occurrences
- remove punctuation (use a regular expression and substitution)
- read the words from the file into a dictionary (key=word,
- write (both to the screen and to the output file) the words
and their counts, sorted by frequency (descending) and alphabetically
How to Classify Messages
Classify the message as a flame if:
P(Flame | Message) > P(Not_Flame | Message)
P(Flame | Message) = P(w_1 | Flame) … P(w_n | Flame) P(Flame)
P(Not_Flame | Message) = P(w_1 | Not_Flame) … P(w_n | Not_Flame)
Now each piece is computed by:
P(w_i | Flame) = N(w_i, Flame) + 1/k
N(Flame) + 1
P(w_i | Not_Flame) = N(w_i, Not_Flame) + 1/k
N(Not_Flame) + 1
where N(w_i, Flame) is the number of occurrences of wi in Flames and
N(Flames) is the number of flames in the corpus.
P(Flame) is the proportion of flames in the corpus (i.e. total number
of flame messages divided by total number of messages).
Otherwise the message is classified as not a flame.
I will update this to explain how to
train, test, and measure your results.