Page for Independent Study students
Fall 2009: CS 4980 Section 001
This is a 1 unit independent study course which is expected to be about
30 hours of work over the course of the semester. The project for this
course is:
2. We will meet as a group to discuss the article (hopefully
mid-October). You should come to my office hours or make an appointment
at least every other week to update me on your progress.
3. Implement the Naive Bayes algorithm described in the article, using
Python, to create a natural language classifier.
4. Test your classifier on data that I will provide later in the
semester.
5. Write up your results:
I expect approximately 2-3 typed pages
with an explanation of how your classifier works, what the key data
structures and algorithms are and what the limitations of the
classifier are.
We met on October 15th at 11:00
am.
Discussed
conditional probability
conditional independence assumptions
features and feature extraction
need for dampening
classification task and building a classifier
For the next meeting (October
29th at noon)
Write a program in python to count words in a text document you program
should:
open a text file given on the command line to read (input)
open a text file given on the command line to write (output)
open a text file given on the command line to read a stoplist
(could be either common_words or an
empty file)
possibly take a command line argument to indicate print only
works with more than n occurrences
remove punctuation (use a regular expression and substitution)
read the words from the file into a dictionary (key=word,
value=count)
write (both to the screen and to the output file) the words
and their counts, sorted by frequency (descending) and alphabetically
How to Classify Messages
Classify the message as a flame if:
P(Flame | Message) > P(Not_Flame | Message)