LargeDataSets WekaAnalysis 2012 13

From Wiki
Revision as of 12:30, 13 November 2012 by MThomas (Talk | contribs) (10/30/12)

Jump to: navigation, search

WPST data analysis group, Fall 2012

Record here: SQL queries, and the create tables statements used to create intermediate tables.


SELECT DISTINCT w.id FROM wpst_data w WHERE w.id NOT IN (SELECT DISTINCT c.id FROM cyerss c)


This query should create a table with only the data from students appear in wpst and also in the cyerss data.

** Create Table InfoWeHave SELECT DISTINCT w.* FROM wpst_data w WHERE w.id IN (SELECT DISTINCT c.id FROM cyerss c) 

This query should join all three tables into one table. Needs some fixes since it provides duplicates of some of the tuples more than one time. For example for student #20 you may notice that there is duplicates for the rows that contain full data and no null fields. This query should work it only outputs less than 90 records with null values, but they can be explain. e.g. id #20 took the test while not being enroll in school and also as a graduate which we do not have record for that.

** create table  WekaFiles SELECT distinct 
W.ID AS identification, c.*, f.year, f.date, f.efc,  f.academic, f.low_income, w.esl, w.db, w.score, w.level, w.lang, w.fresh,
 w.day_taken, w.loc_taken, w.year_taken,   w.spec  
FROM infowehave w 
LEFT JOIN fin_aid f ON w.ID = f.ID AND w.year_taken = f.year
LEFT JOIN CYERSS c ON c.ID=w.ID   and c.schoolyear = w.year_taken

11/13/12

Decisions made about data sets:

Ethicity tags inconsistent from year to year -- ignore/remove and use first-language information (ESL) instead. better predictor

Put in "M"s where data is missing, and 0's for missing numeric data

Keep IDs in CSV files, but "remove" when load into Weka so Weka won't try to classify based on ID. Make sure score is also "remove"d.

Assignment for all of us:

Play around with decision trees (J48) and Naive Bayes, predicting score (pass/fail) (decision trees first, because visualization of "important" is easier)

Remember we want the kappa to be as close to 1 as possible (better than .4 or .5 is ok, lower than that is close rolling dice to predict pass/fail)

10/30/12

Note: ** This query is the statement to create a table and store the values in the table. This should generate the "most trustworthy" (and smallest) data set, where IDs and semesters match up. About 1,600 tuples.

Emmanuel's Query

SELECT * FROM wpst_data2 w, CYERSS c, distFinAid f WHERE w.id=f.id AND c.id=w.id AND c.Semester = w.term

Now we need a query to generate the second most trustworthy set, where IDs match up from all three tables, but financial aid data may not be available (outer join??) and semester information matches only "in the same year", not necessarily in the exact same semester.

To Do:

0) Email most trustworthy data set to Dr Martin and Dr Thomas

1) Put the 'most trustworthy' data set into Weka

2) Translate actual WPST scores into pass/not pass binary. (7 and up passes, 6 and down not pass)

3) Try predicting pass/not pass using the rest of the columns (or a subset), using Naive Bayes and decision trees.

4) Continue working, in parallel, on perfecting the query to generate the second-most-trustworthy data set.


Second Query????

SELECT  c.*, f.year, f.date, f.efc,  f.academic, f.low_income, w.esl, w.db, w.score, w.level, w.lang, w.fresh,
w.day_taken, w.loc_taken, w.year_taken,   w.spec  
FROM infowehave w 
LEFT JOIN fin_aid f ON w.ID = f.ID AND w.year_taken = f.year
LEFT JOIN CYERSS c ON c.ID=w.ID and c.schoolyear = w.year_taken where c.semester = w.term

10/23/12

To do: collapse fin_aid table so one row per student (either first for each student, or last for each student) so that can use 'left join' in create wekafiles on just ID, without 'and year=year', and won't lose as many students from resulting table.--->This has also been created by using GROUP BY to re-create a table with just the id appearing once and not multiple times in the list.

update "Spring II 20xx" terms in data to be simply "Spring 20xx". Fix "W/ Spring" semesters, too. (GIGO! GIGO!) --->This has been fixed, "W/Spring 20xx" has been changed into "Spring 20xx" since the test occurred on February. Also Spring II 20xx has been modified to Spring 20xx

Drop graduate students and students who never enrolled at CSUS from data set.

11/11/12

On Hopper you will find that most of the empty cells now contains a 'M' for missing. Also colleges have been change to the following:

CBA - College of Business Administration
CHHS - College of Human and Health Sciences
CHSS - College of Humanities and Social Sciences
CNS - College of Natural Sciences
COA - College of the Arts
COE - College of Education

The remaining colleges have been kept the same, the Interdisciplinary program and undeclared