index

Large Data Sets
AHPCRC Grant

Data Mining with Dr. Martin

Flame Recognition Project

Creating similarity matricies

README
format_matrix.py
process_format_matrix.py
run_matrix.py

N-Closest
nclosest5.py

http://www.unc.edu/~hhalpin/lsi/

Notes from January 4th meeting

We divided into two groups:

    Classification using Weka - Juan and Roberto
       For next week: come up with a list of possible attributes building on Smokey and the other papers
                Keep in mind the issue of included text in the messages
                Also keep in mind preprocessing issues and how to get it into ".arff" format

    Clustering using LSA and Dr. Carter's algorithms - Jesus and Cameron
       For next week: get the data ready for input to LSA
             Keep id, but remove the rest of the header and included text
             Each message is one document
             Output should be three files:
                   One with ids (index file)
                   One with the documents only, each document separated by a blank line
                   One with ids and documents:
                            id1
                            doc1

                            id2
                            doc2

Reminder: please keep track of what you do, so that we can write procedure manual once we are done

Papers to read:

Spertus
Pantel and Lin
Altaf
Razavi
Xu
Deerwester et al (LSA)

Data to look at:

Flame Corpus
Annotation

Work Report

Census Data Project

2000 Census Gateway

TIGER: Topologically Integrated Geographic Encoding and Referencing system

IVF Project

Papers to look at:

Can we predict IVF outcomes?, Julie Goodside, Leah Passmore, Lutz Hamel, Liliana Gonzalez, Tali Silberstein, Richard Hackett, David L. Keefe and James R. Trimarchi, Abstract presented at the 2004 First Quarterly Meeting of The New England Fertility Society and The Annual Assembly of the New England Fertility Society (NEFS2004), March 12 Ã¢â‚¬â€œ 14, 2004.

Comparing Data Mining and Logistic Regression for Predicting IVF Outcome, J. R. Trimarchi, J. Goodside, L. Passmore, T. Silberstein, L. Hamel, L. Gonzalez, Abstract presented at the 59th Annual meeting of the American Society for Reproductive Medicine (ASRM 2003), San Antonio, TX, October 11-15, 2003.

Assessing Decision Tree Models for Clinical In-Vitro Fertilization Data, J. R. Trimarchi, J. Goodside, L. Passmore, T. Silberstein, L. Hamel, L. Gonzalez, Technical Report TR03-296, Dept. of Computer Science and Statistics, University of Rhode Island, 2003.

google "preficting implantation outcome from imbalanced ivf dataset"

Data to look at:

Society for Assisted Reproductive Technology
CDC

Month-long Data Mining Introduction

Some tools we will use:
    Python
    Awk
    Excel
    Weka

Assignments are from KDNuggets

Preliminary materials
     slides for 8/17/10
     slides for 8/18/10
     slides for 8/19/10

Some Python Examples
    counts.py
    process_counts.py
    fixnewlines.py
    pull_ids.py

Some awk examples
    ls -l | awk '$5 == 0 {print $0}'
       this finds empty files in a directory

    awk -f awk_test/fmt print3 > print4
                      <input>   <output>
        reformats lines to 75 characters

    awk '$4 ~ /SS-A-.*/ && $2 ~ /DF-A-.*/ {print $0}' Tm05M04codes.comb
    awk '$4 ~ /SS-A[-]*.*/ && $2 ~ /DF-A[-]*.*/ {print $0}' Tm05M04codes.comb
    awk '$2 ~ /DF-.*E\^.*-*.*/ {print $0}' Tm05M04codes.comb
       some matching examples
awk link

Week 1

tabToComma.py
Takes a file as input and replaces all tabs with commas first input file is tab separated and second input file where to store comma separated values.

Normalize.py
Normalizes the values in the file to a preset minimum and maximum first input file contains file with values to be normalized and second input file where to store values after normalization.

Invert.py
Transposes the file from genes in rows to genes in columns or from genes in column to genes in row first input file with genes in row format and second input file to store genes in column format.

addClassField.py
Adds the class field name and values to the file containing genes in column format first input file contains genes in column format, second file contains class field name and values in column format, third input file stores result of appending the class field in column format to the genes in column format