Talk:LargeDataSets WekaAnalysis 2012 13

From Wiki
Jump to: navigation, search

Documentation

Documentation of how we ended up with data in finAid_wpst_cyerss table 12/02/12 - Emmanuel

1. How we ended up with a table (finAid_wpst_cyerss) that contained data from the three initial tables we started with, namely: CYERSS, fin_Aid and wpst_data

CYERSS (50,525 rows): Enrollment data for the years Fall 2009 – Fall 2010

fin_Aid (32,003 rows): Financial Aid data for the academic years 2008/09 – 2010/11

wpst_data (4,118 rows): wpst data for the years Fall 2009 – Fall 2010


2. The first table we created was distFinAid (20,379 rows), which does not have repeated data for each ID. Only distinct IDs using the query:

//query CREATE TABLE distFinAid (SELECT * FROM `fin_Aid` GROUP BY ID);

We created this table because in our analysis we are only interested in determining if a student qualified for Financial Aid and whether his or her low income status can be an indication of whether he or she will pass the wpst.


3. Next we created the table finAid_wpst (3,402 rows) using the following query, which connects the data in the wpst_data table and the data in the distFinAid by joining the IDs in both tables.

//query CREATE TABLE finAid_wpst (SELECT * FROM distFinAid f, wpst_data w WHERE f.ID = w.ID);


We created this table because we are interested in only analyzing the wpst pass rate for students that were enrolled in the university and have ever qualified for financial aid.


4. Next we created the table infoWeHave_finAid_wpst_cyerss (1,605) using the following query, which connects the data in the finAid_wpst table and the data in the CYERSS by joining the IDs in three tables and the Semester COLUMN in CYERSS = Term in finAid_wpst.

//query

CREATE TABLE infoWeHave_finAid_wpst_cyerss (SELECT * FROM finAid_wpst f, CYERSS c WHERE f.ID = c.ID AND f.term = c.Semester);



5. Next we got the remaining data (1797 rows) that was lost in the infoWeHave_finAid_wpst_cyerss (1,605) table and stored them in a table called missing_finAid_wpst_IN_cyerss (1797 rows) using the following query:

(Note 1797 + 1605 = 3402 which is the number of rows in the finAid_wpst table)

//query equivalent to finAid_wpst – finAid_wpst_cyerss

CREATE TABLE missing_finAid_wpst_IN_cyerss (SELECT * FROM finAid_wpst f WHERE f.term NOT IN (SELECT fc.Semester FROM finAid_wpst_cyerss fc WHERE f.ID = fc.ID));

//however we lose 46 rows from the 1797 because some students that took the wpst never enrolled in CSUSTAN. Therefore we end up with 1751 records.



6. Next we got the distinct rows from the CYERSS table using the following query

//query CREATE TABLE distCYERSS (SELECT * FROM `CYERSS` GROUP BY ID);


Then used this table to match the enrollment data with the remaining data (1751 rows) that was lost in the infoWeHave_finAid_wpst_cyerss (1,605) table because of students who took the wpst during a Semester, which they were not enrolled in the school. For example in the Summer or Winter Semester.

We first run the following query to get the remaining data (1751) then export this data into a csv file and delete one of the duplicate ID columns. Then upload the data into a table called remaining_finAid_wpst_cyerss (1,751 rows).

//query SELECT * FROM distCyerss c, missing_finAid_wpst_IN_cyerss f WHERE f.ID=c.ID ORDER BY c.ID DESC

Export to a csv file called remaining_finAid_wpst_cyerss.csv

Copy structure only of table finAid_wpst_cyerss and create a new table called remaining_finAid_wpst_cyerss.

Import data from remaining_finAid_wpst_cyerss.csv into table remaining_finAid_wpst_cyerss.


7. Finally join the infoWeHave_finAid_wpst_cyerss (1,605) and remaining_finAid_wpst_cyerss (1,751) to form the final table finAid_wpst_cyerss (3,356)