LargeDataSets WekaAnalysis 2012 13
- 1 WPST data analysis group, Fall 2012
- 2 February 22, 2012
- 3 Updates to Table
- 4 Queries to find missing information
- 5 Facts about data files that need to be preserved
- 6 SQL queries for generating tables and intermediate tables
- 7 Dec 4
- 8 November 27, 2012
- 9 11/20/12
- 10 11/13/12
- 11 10/30/12
- 12 10/23/12
- 13 11/11/12
WPST data analysis group, Fall 2012
February 22, 2012
The following updates were done on the table by the name of adjustedTable:
Accounting & Finance --> Accounting and Finance
update AdjustedTable` set department = "Accounting and Finance" where department = "Accounting & Finance"
Anthropology & Geography --> Anthropology and Geography
update AdjustedTable set department = "Anthropology and Geography" where department = "Anthropology & Geography"
Ethnic & Gender Studies --> Ethnic and Gender Studies
update AdjustedTable set department = "Ethnic and Gender Studies" where department = "Ethnic & Gender Studies"
Sociology & Gerontology --> Sociology and Gerontology
update AdjustedTable set department = "Sociology and Gerontology" where department = "Sociology & Gerontology"
Politics & Public Administration --> Politics and Public Administration
UPDATE ADJUSTEDTABLE SET department = 'Politics and Public Administration' where department = 'Politics & Public Administration'
Undeclared Undergrad - Undeclared Undergraduate
update AdjustedTable set major = "Undeclared Undergraduate" where major = "Undeclared Undergrad"
Had problems when fixing the department name of Philosophy and Modern Language, the word Language was misspelled.
UPDATE `adjustedTable` SET `Department` = 'Philosophy and Modern Languages' WHERE `adjustedTable`.`department` = 'Philosophy and Modern Langauages'
I also moved the row/column by the name of PassOrFail to the end. Also added a column/row with the name of EFC2 to the table as well. EFC2 was composed with the following queries.
*UPDATE adjustedTable SET EFC2 = EFC; *UPDATE adjustedTable SET EFC2 = 24000 WHERE EFC2 > 24000;
Updates to Table
The updates being mention below this comment has been done in the database. Also the following update statements have been used to fix any errors that came with the data previously.
update finAid_wpst_cyerss set loc_taken = 'Turlock' where loc_taken = 'Turlock00*******' update finAid_wpst_cyerss set loc_taken = 'Stockton' where loc_taken = ' Stockton' update finAid_wpst_cyerss set loc_taken = 'Turlock' where loc_taken = ' 'update finAid_wpst_cyerss set day_taken = '30-Apr' where day_taken = '30-Apr-11' update finaid_wpst_cyerss set spec = 'M' where spec = '\r'
Queries to find missing information
CREATE TABLE remaining_missing_finAid_wpst SELECT * FROM missing_finAid_wpstData f WHERE term NOT IN (SELECT fc.term_semester FROM missing_finAid_wpst_cyerss fc WHERE f.ID = fc.ID)
took while enroll
create table missing2_finAid_wpstData SELECT c.*, r.esl, r.term as term_semester, r.db, r.score, r.level, r.lang, r.fresh, r.day_taken, r.loc_taken, r.year_taken, r.spec FROM distCYERSS c, remaining_missing_finAid_wpst r WHERE c.ID = r.ID
This is obtained from the 635, not every person took the test during the time they were enrolled in the semester. 293 enrolled, 342 not enrolled
Facts about data files that need to be preserved
Values in Transfer units earned column are high because a decimal point has been dropped. Where a value appears to be 1117, it is actually 111.7.
Comment: on languages. Several students are recorded as having "Filipino" as their language; their real language is probably Tagalog. However, Tagalog and Ilocano are both languages of the Philippines; we may at some point wind up changing all students whose first language is from that country to the same value.
The same issue shows up for Chinese / Mandarin / Cantonese.
Changes to make:
There are errors in specific rows of the data that must be fixed.
Anonymous ID 3004 was a Sophomore in Spring 2010
Anonymous ID 73 was a Senior in Spring 2010
Anonymous ID 9256 was a Senior in Fall 2010
Anonymous ID 9883 was a Senior in Spring 2011
Anonymous ID 11664 was a Junior in Spring 2011 (and is recorded as a freshman in Spring 2011, as well as a "Select One", so two rows of data need to be fixed)
Exam location column has problems. Rows with " Stockton" as location should be changed to "Stockton". Rows with no location specified should be filled in with "Turlock". And there is one row where a random real student ID has been attached to the end of "Turlock" as the location. (Look for Turlock00)
SQL queries for generating tables and intermediate tables
Record here: SQL queries, and the create tables statements used to create intermediate tables.
SELECT DISTINCT w.id FROM wpst_data w WHERE w.id NOT IN (SELECT DISTINCT c.id FROM cyerss c)
This query should create a table with only the data from students appear in wpst and also in the cyerss data.
** Create Table InfoWeHave SELECT DISTINCT w.* FROM wpst_data w WHERE w.id IN (SELECT DISTINCT c.id FROM cyerss c)
This query should join all three tables into one table. Needs some fixes since it provides duplicates of some of the tuples more than one time. For example for student #20 you may notice that there is duplicates for the rows that contain full data and no null fields. This query should work it only outputs less than 90 records with null values, but they can be explain. e.g. id #20 took the test while not being enroll in school and also as a graduate which we do not have record for that.
** create table WekaFiles SELECT distinct W.ID AS identification, c.*, f.year, f.date, f.efc, f.academic, f.low_income, w.esl, w.db, w.score, w.level, w.lang, w.fresh, w.day_taken, w.loc_taken, w.year_taken, w.spec FROM infowehave w LEFT JOIN fin_aid f ON w.ID = f.ID AND w.year_taken = f.year LEFT JOIN CYERSS c ON c.ID=w.ID and c.schoolyear = w.year_taken
The data that contains an 'M' was update with a simple UPDATE table_name SET col_name = 'M'. This was done because of the missing information that it was not provided to us.
I have also updated the column with the language students speak; it was updated with similar query statement the only differences is that it is UPDATE table_name SET col_name = LOWER(col_name)
November 27, 2012
When exporting a csv file from the database, click on the section that will remove the CRLF
WP faculty group meets 12/3. Would be GREAT to have a clean WP data set, with documented assumptions, to give to the English professors then.
Decisions made about data sets:
Ethnicity tags inconsistent from year to year -- ignore/remove and use first-language information (ESL) instead. better predictor
Put in "M"s where data is missing, and 0's for missing numeric data
Keep IDs in CSV files, but "remove" when load into Weka so Weka won't try to classify based on ID. Make sure score is also "remove"d.
Assignment for all of us:
Play around with decision trees (J48) and Naive Bayes, predicting score (pass/fail) (decision trees first, because visualization of "important" is easier)
Remember we want the kappa to be as close to 1 as possible (better than .4 or .5 is ok, lower than that is close rolling dice to predict pass/fail)
Note: ** This query is the statement to create a table and store the values in the table. This should generate the "most trustworthy" (and smallest) data set, where IDs and semesters match up. About 1,600 tuples.
SELECT * FROM wpst_data2 w, CYERSS c, distFinAid f WHERE w.id=f.id AND c.id=w.id AND c.Semester = w.term
Now we need a query to generate the second most trustworthy set, where IDs match up from all three tables, but financial aid data may not be available (outer join??) and semester information matches only "in the same year", not necessarily in the exact same semester.
0) Email most trustworthy data set to Dr Martin and Dr Thomas
1) Put the 'most trustworthy' data set into Weka
2) Translate actual WPST scores into pass/not pass binary. (7 and up passes, 6 and down not pass)
3) Try predicting pass/not pass using the rest of the columns (or a subset), using Naive Bayes and decision trees.
4) Continue working, in parallel, on perfecting the query to generate the second-most-trustworthy data set.
SELECT c.*, f.year, f.date, f.efc, f.academic, f.low_income, w.esl, w.db, w.score, w.level, w.lang, w.fresh, w.day_taken, w.loc_taken, w.year_taken, w.spec FROM infowehave w LEFT JOIN fin_aid f ON w.ID = f.ID AND w.year_taken = f.year LEFT JOIN CYERSS c ON c.ID=w.ID and c.schoolyear = w.year_taken where c.semester = w.term
To do: collapse fin_aid table so one row per student (either first for each student, or last for each student) so that can use 'left join' in create wekafiles on just ID, without 'and year=year', and won't lose as many students from resulting table.--->This has also been created by using GROUP BY to re-create a table with just the id appearing once and not multiple times in the list.
update "Spring II 20xx" terms in data to be simply "Spring 20xx". Fix "W/ Spring" semesters, too. (GIGO! GIGO!) --->This has been fixed, "W/Spring 20xx" has been changed into "Spring 20xx" since the test occurred on February. Also Spring II 20xx has been modified to Spring 20xx
Drop graduate students and students who never enrolled at CSUS from data set.
On Hopper you will find that most of the empty cells now contains a 'M' for missing. Also colleges have been change to the following:
CBA - College of Business Administration CHHS - College of Human and Health Sciences CHSS - College of Humanities and Social Sciences CNS - College of Natural Sciences COA - College of the Arts COE - College of Education
The remaining colleges have been kept the same, the Interdisciplinary program and undeclared