Large Data Sets
AHPCRC Grant

Data Mining with Dr. Martin



Flame Recognition Project
Big Corpus
Newsgroup1
Newsgroup2

Creating similarity matricies

README
format_matrix.py
process_format_matrix.py
run_matrix.py

N-Closest
nclosest5.py

http://www.unc.edu/~hhalpin/lsi/

Notes from January 4th meeting
We divided into two groups:

    Classification using Weka - Juan and Roberto
       For next week: come up with a list of possible attributes building on Smokey and the other papers
                Keep in mind the issue of included text in the messages
                Also keep in mind preprocessing issues and how to get it into ".arff" format

    Clustering using LSA and Dr. Carter's algorithms - Jesus and Cameron
       For next week: get the data ready for input to LSA
             Keep id, but remove the rest of the header and included text
             Each message is one document
             Output should be three files:
                   One with ids (index file)
                   One with the documents only, each document separated by a blank line
                   One with ids and documents:
                            id1
                            doc1

                            id2
                            doc2

Reminder: please keep track of what you do, so that we can write procedure manual once we are done

Papers to read:
Spertus
Pantel and Lin
Altaf
Razavi
Xu
Deerwester et al (LSA)

Data to look at:
Flame Corpus
Annotation

Work Report


Census Data Project
2000 Census Gateway
TIGER: Topologically Integrated Geographic Encoding and Referencing system



IVF Project

Papers to look at:
Can we predict IVF outcomes?, Julie Goodside, Leah Passmore, Lutz Hamel, Liliana Gonzalez, Tali Silberstein, Richard Hackett, David L. Keefe and James R. Trimarchi, Abstract presented at the 2004 First Quarterly Meeting of The New England Fertility Society and The Annual Assembly of the New England Fertility Society (NEFS2004), March 12 – 14, 2004.

Comparing Data Mining and Logistic Regression for Predicting IVF Outcome, J. R. Trimarchi, J. Goodside, L. Passmore, T. Silberstein, L. Hamel, L. Gonzalez, Abstract presented at the 59th Annual meeting of the American Society for Reproductive Medicine (ASRM 2003), San Antonio, TX, October 11-15, 2003.

Assessing Decision Tree Models for Clinical In-Vitro Fertilization Data, J. R. Trimarchi, J. Goodside, L. Passmore, T. Silberstein, L. Hamel, L. Gonzalez, Technical Report TR03-296, Dept. of Computer Science and Statistics, University of Rhode Island, 2003.

google "preficting implantation outcome from imbalanced ivf dataset"


Data to look at:
Society for Assisted Reproductive Technology
CDC


Month-long Data Mining Introduction

Some tools we will use:
    Python
    Awk
    Excel
    Weka

Assignments are from KDNuggets

Preliminary materials
     slides for 8/17/10
     slides for 8/18/10
     slides for 8/19/10


Some Python Examples
    counts.py
    process_counts.py
    fixnewlines.py
    pull_ids.py

Some awk examples
    ls -l | awk '$5 == 0 {print $0}' 
       this finds empty files in a directory

    awk -f awk_test/fmt print3  > print4
                                  <input>   <output>
        reformats lines to 75 characters

    awk '$4 ~ /SS-A-.*/ && $2 ~ /DF-A-.*/ {print $0}' Tm05M04codes.comb
    awk '$4 ~ /SS-A[-]*.*/ && $2 ~ /DF-A[-]*.*/ {print $0}' Tm05M04codes.comb
    awk '$2 ~ /DF-.*E\^.*-*.*/  {print $0}' Tm05M04codes.comb
       some matching examples
awk link

Week 1
   Assignment 1
   Assignment 2
   Read Course Notes

Week 2
    Assignment 3

Week 3
    Continue with Assn 3

Week 4
    Some code from Jonathan Ebie

tabToComma.py
Takes a file as input and replaces all tabs with commas first input file is tab separated and second input file where to store comma separated values.

Normalize.py
Normalizes the values in the file to a preset minimum and maximum first input file contains file with values to be normalized and second input file where to store values after normalization.

Invert.py
Transposes the file from genes in rows to genes in columns or from genes in column to genes in row first input file with genes in row format and second input file to store genes in column format.

addClassField.py
Adds the class field name and values to the file containing genes in column format first input file contains genes in column format, second file contains class field name and values in column format, third input file stores result of appending the class field in column format to the genes in column format




Page for Independent Study students Fall 2009: CS 4980 Section 001

This is a 1 unit independent study course which is expected to be about 30 hours of work over the course of the semester. The project for this course is:

0. Keep track of you time on a spreadsheet.

1. Read the article:
Patrick Pantel and Dekang Lin. 1998. SpamCop – A Spam Classification & Organization Program. In Proceedings of AAAI Workshop on Learning for Text Categorization. pp. 95-98. Madison, Wisconsin.

2. We will meet as a group to discuss the article (hopefully mid-October). You should come to my office hours or make an appointment at least every other week to update me on your progress.

3. Implement the Naive Bayes algorithm described in the article, using Python, to create a natural language classifier.

4. Test your classifier on data that I will provide later in the semester.

5. Write up your results:
       I expect approximately 2-3 typed pages with an explanation of how your classifier works, what the key data structures and algorithms are and what the limitations of the classifier are.


We met on October 15th at 11:00 am.
Discussed

For the next meeting (October 29th at noon)
Write a program in python to count words in a text document you program should:


How to Classify Messages

Classify the message as a flame if:
P(Flame | Message) > P(Not_Flame | Message)

where
P(Flame | Message)  = P(w_1 | Flame) … P(w_n | Flame) P(Flame)
and
P(Not_Flame | Message)  = P(w_1 | Not_Flame) … P(w_n | Not_Flame) P(Not_Flame)

Now each piece is computed by:

P(w_i | Flame) = N(w_i, Flame) + 1/k
                          -----------------------
                         N(Flame) + 1

similarily

P(w_i | Not_Flame) = N(w_i, Not_Flame) + 1/k
                                  -----------------------
                                   N(Not_Flame) + 1

where N(w_i, Flame) is the number of occurrences of wi in Flames and N(Flames) is the number of flames in the corpus.

P(Flame) is the proportion of flames in the corpus (i.e. total number of flame messages divided by total number of messages).

Otherwise the message is classified as not a flame.

I will update this to explain how to train, test, and measure your results.


Flame Corpus
Annotation