LargeDataSets DreamTeam 2012

From Wiki
Revision as of 15:22, 15 May 2012 by MThomas (Talk | contribs) (5/14/2012)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Spring 2012 Dream Team Wiki

List of data sets in the order Dr Sanchez cares about them:

student and school data files Cal Ed school data - overall page with all data

  1. public schools data
  2. staff data files
  3. enrollment by school
  4. Graduates by race and gender (on student and school data files page)
  5. Dropouts by race and gender (ditto)
  6. all data sets on student and school data files page, if possible
  7. English Learner files
  8. School-Level Fluent-English Proficient Students (FEPs)
  9. Staff demographic data
  10. Staff assignment and course data
  11. School-Level English Learner Staff


Place user manual, code & data files placed in one of your student home directories and tell me where. I will copy the files straight from your home directory to mine. (Don't forget setting the permissions appropriately so I will be allowed to copy them.)

Reflective essays and other files due via email May 24th (Thursday) midnight.

User manual. 1 page "what I learned this semester" reflective essays. (1 page is about 500 words.)

Export all tables from database (clean data) and gather up all php, zip up all files and put in home directory for Dr Thomas to copy elsewhere. Backup. With data files and all PHP files, can reconstruct database/web site if something crashes during the summer.

Command to make a file world-readable. "chmod ogu+r"


DataQuest competitor to our web site

Ed-Data Website, contractor with dept of education


Demo: add "next" feature so can look at rows 1001-2000, etc, after looking at rows 1-1000.

Add sorting by columns, if at all possible.

Add list of tables to left side of 'query database' web page. (done)

staff_assignment table not quite full

SQL query, raw, not joining on CDS... or joining VERY slowly

Should get some queries for the users to see.

Need to be able to save query results into plain text/CSV filess

From Jan 25 meeting: Sample questions: what is the high school's graduation rate? rate by gender? graduation satisfying CSU/UC requirements, by gender?


  1. Database must be created to handle other languages, and not just english.
  2. School in English Learners staff must be incremented to 100


Bugs in uploading-to-hopper code. Design of web interface classic and simple, now code needs to work!

Demo to Dr Thomas on 4/16, hopefully demo to Dr Sanchez on 4/18.


Load all the data into hopper DBMS. Polish up web site for "user demo" to fake-user next Monday. (Dr Thomas will pretend to be Dr Sanchez for the first-round demo.) Email Dr Sanchez for a real demo after next Monday's meeting. Jose & Armando will load up the data; Emmanuel with polish up the web site.

Pre-write some SQL queries to pull out information Dr Sanchez expressed interest in. Graduation rate for high schools, graduation rate by gender, satisfying CSU/UC requirements (by gender)?


This week: make a poster for Saturday CCSC-SW. Meeting Tue/Thu to plan and print poster.

Prepping for a demo for Dr Sanchez & team. Need to make "enter spring year" into an automatically calculated detail. (Fall year plus one.) Need to get data into hopper database for access by other people. Integrate Emmanuel's web page material with all the SQL tables Jose / Armando have. Confirmation page for after new data is uploaded.

Looks like we need to implement the use of transactions to "wrap" the insertion of a new file-full of data. If the insertion goes correctly and all the file data is successfully inserted, then the transaction will commit. If there is a failure or error during insertion, then the transaction will NOT be committed, and MySQL can "roll back" the unsuccessful transaction. (If some rows of data were inserted and others were not, the roll back code in MySQL will undo all the insertions.)

Need to make sure old copies of inserted data files don't pile up on hopper, wasting space.

Jose: it is possible that some plain text files from different data sets have the same file name.

Need a query demo page, with the top few queries the IR office is interested in made "easy to run"

Overall web site design

sign in page (index.php)

if ok:

menu page: upload, view, query, link-to-CDE-site, how-to/help

if query:

query page: raw-SQL, predetermined queries list with drop-down fields

for all queries:

results page, with results plus a save-to-CSV option

When to give demo to IR? Discuss next Monday.


Re-write of 5 page paper due to Heather by 4pm Friday, for state competition.

Must add validating output of our database by comparing it to reports created by the Cal Dept of Education, per Dr Sanchez's comments last Friday.

Need PHP to insert Fall-year information (as part of primary key) from web page form as part of data-loading process.

Note for future functionality: should make it possible for users to save results of SQL queries as .csv or other Excel-readable file format.

Link that would help us write the queries into txt files

To Do for next week:

Send Heather revised copy of 5 page paper, plus form with application info. Don't forget to mention your need for Internet access. Send CCSC-SW poster chair your paper abstract, for a poster presentation on 3/24.

Jose will run 'create table' statements on hopper, first, before we try loading the data. Don't forget to add the 'year' field (data describing year of data of the text file, not year contained within text file).

Load data into MySQL DB on hopper. ALL of it.


PHP code successfully loads in staff data files. don't know if it works on any of the rest. Should work on .csv files. dBase conversion to other formats works; can output to text, CSV, etc.

Emmanuel: Get PHP code to properly handle header information in incoming data files. Grab field names and use to set up insertion commands correctly.

Armando: organized documentation of where Cal Ed documentation is incorrect, and describe the actual data formats. Pay particular attention to files that are likely to cause difficulties loading the data in and with PHP code. Help on PHP coding.

Jose: create table table statements for all the data sets / data files.

All: Start work on powerpoint slides for SRC presentation on 3/9.

Bear in mind that we are going to have to put in some security protections, in the MySQL and PHP, to keep malicious users out, protect against SQL injection attacks, protect Office of IR staff from accidentally deleting the whole database, etc.


Student Research Competition

Title so far: "Building a useful California Dept of Education Database"

Start of abstract: The goal of our project is to assemble an easily accessible and manageable database for the CSU Stanislaus Office of Institutional Research...

Outline of introduction section:

  1. student research project. opportunity offered by Institutional Research.
  2. goal: put together existing data from Cal Dept of Ed into a useful, accessible, flexible database
  3. students have added to project the goal of making database web - accessible and usable by future staff of the IR office
  4. IR wants to be able to analyze and track K-12 students as they do and do not matriculate to CSU Stanislaus

Outline of conclusion section:

  1. Students making good progress loading data into database software
  2. Have discovered many areas where documentation of data and actual data do not agree with one another
  3. In process of building web interface
  4. hope to have IR-usable, web-accessible database before end of semester

Send a work-able draft to Dr Thomas by Wednesday evening, early (5-7pm), for editing.

To Do for Feb 20:

Emmanuel has a starting-to-look-nice set of PHP to load in new data files. Trying to set it up so that, in 2012, 2013, 2014, users could load in new data to existing tables.

  1. Jose: Getting .dbf files into format we can read into MySQL. Either by directly loading the data into MySQL, or by translating into plain text files that can be handed to Emmanuel's code.
  2. Emmanuel: working on file uploading PHP code. Error checking. Input fields for school year information per file.
  3. Armando: also working on file uploading PHP code, especially generalizing the code to read all of the different types of files we have -- in some automatic way?

Create table statements will have to add in school year fields (in SQL date-datatype-ok formats) in addition to the 4digit year information that some data files have. (0910, 1011, etc)

Create table statements at bottom of this wiki. N.B.: we will have to add in fallyear, springyear fields to ALL tables, and always populate those. springyear will have to be part of the primary key for all tables, to separate out "enrollment for school X in 2010" from "enrollment for school X in 2009."

Document, document, document!

To Do for Feb 13:

  1. Jose will continue working on straight public schools data with all the blanks. Having trouble with MySQL complaining about empty fields.
  2. Armando will focus on 'staff' data files
  3. Emmanuel will try 'enrollment by school' data files.
  4. Side projects: Emmanuel continues with Php auto-read-in data files code project. Armando and Jose start crash course on basics of Php and PhpMyAdmin. (Spelling of that might be wrong.)

Don't forget to post useful SQL commands here! At bottom of page.

Feb 6 meeting notes:

CDSCode, used to ID schools in most files, is actually three different numbers - county, school district and school. Should unpack into 3 different columns to allow more flexible querying.

In "grads" data files, they have 'year graduated' recorded as something like '0708' for graduating in 2008. We'll need to bust that (non-standard data format) "year" information up to construct real year data.

MySQL load data is complaining about the (many) empty fields in the public schools data files. There has to be a way to get empty fields tidily read in as NULL data.

Emmanuel wants to get a PHP script to read in data files neatly.

Dr Thomas thinks she may have found some free Mac software that will read .dbf files.

Four weeks until student research competition!!! Would be great to be able to do "live" demos of queries to judges. Also, include pictures / diagrams of plans for final product (in PPT slides??).

To Do for Feb 6:

  1. try loading top three data files, for most recent year, into local installations of MySQL
    1. HOW EASY IS THIS? straightforward? GIGO issues?
    2. try public schools data and enrollment by school data first
  2. document problems, likely GIGO issues
  3. start looking into .dbf file format importing techniques, for older years
  4. Dr Thomas will see if we can get a group account on local MySQL server


  1. Jose is able to open dbf file in Excel (Windows computer)
  2. "load data infile" has this option: The IGNORE number LINES option can be used to ignore lines at the start of the file. For example, you can use IGNORE 1 LINES to skip over an initial header line containing column names: ..."


Link to add files with empty data values

If you run into a file that contains duplicates of the primary key you can use the following command,

load data infile 'location of file' ignore into table 'name of table';

Armando - I got all the Staff information to load in correctly.

Here are some useful Unix commands I found useful:

  1. sudo sed 's/;1011/;2010-2011/g' cbedsora10a.txt > cbedsora10a_edited.txt #replaces every instance of ;1011(used as year) with ;2010-2011.
  2. sudo sed 's/\(.*\)0910/\12009-2010/' cbedsora09a.txt > cbedsora09a_edited.txt #replaces only the last iinstance of 0910 in each line with 2009-2010.
  3. (Dr Thomas comment: unless you have accidentally set security permissions so that you are not allowed to edit the cb...txt files, the above commands should work without 'sudo' at the front.)

I first used the first command to replace the 1011 with 2010-2011 and it worked perfectly because the file was semicolon separated. Then when I opened the second file (year 2009) the text file was not semicolon separated but it was tab separated, so I couldn't get the first command to work. I looked around and found the second command which worked perfectly since every line has a date and it's always at the end of the line.

Note: Need to see if there is any way to input data into mysql which is not in english.


it talks about how we should create our database so that it will not come out with warnings concerning the names of the people.

loading data into public schools

load data infile 'filelocation' ignore into table publicschools lines terminated by '\r\n' ignore 1 line;

information about using PHP to insert dBase files (.dbf), found by Jose. There is a problem with using the php to open the dbf files. Php does no longer support them, but if we wish to use those methods we can download an extension and compile php to work. Here is the link where it talks about installing the extension and compiling it

Dbase files to TXT: The nightmare with dbase files is over, I found a webpage that has some java source code to convert the dbase file into a useable format. is the link to the page where it talks about how to convert it. I have created a Source code that would allow us to just tell a name of a directory were we have the dbase files and it would convert them into a txt file i have test it and saw that it is working ok. I will email the source codes to your emails.

Creating table statements for all "top 10" data sets:

  1. CREATE TABLE ENROLLMENT(CDS_CODE varchar(14), ETHNIC varchar(1), SEX varchar(1), SFT_PP int(3), SPT_PP int(3), SFT_CLER int(3), SPT_CLER int(3), SFT_OTH int(3), SPT_OTH int(3), KDGN int(4), GR_1 int(4), GR_2 int(4), GR_3 int(4), GR_4 int(4), GR_5 int(4), GR_6 int(4), GR_7 int(4), GR_8 int(4), UNGR_ELM int(4), GR_9 int(4), GR_10 int(4), GR_11 int(4), GR_12 int(4), UNGR_SEC int(4), ENR_TOTAL int(4), ADULT int(4), GRADS int(4), UC_GRADS int(4), VOC_GRADS int(4), INT_ALG int(4), ADV_MATH int(4), CHEM int(4), PHYS int(4), VOC_ED int(4), FALLYEAR year, SPRINGYEAR year, primary key(CDS_CODE, ETHNIC. SEX, FALLYEAR), foreign key (CDS_CODE) references PUBLICSCHOOLS(CDSCode));
  2. CREATE TABLE DROPOUTS(CDS_CODE varchar(14), ETHNIC varchar(1), GENDER varchar(1), E7 int(5), E8 int(5), E9 int(5), E10 int(5), E11 int(5), E12 int(5), EOS int(5), ETOT int(6), D7 int(4), D8 int(4), D9 int(4), D10 int(4), D11 int(4), D12 int(4), DTOT int(5), YEAR varchar(5), FALLYEAR year, SPRINGYEAR year, primary key(CDS_CODE, FALLYEAR, ETHNIC, GENDER), foreign key (CDS_CODE) references PUBLICSCHOOLS(CDSCode));
  3. CREATE TABLE GRADUATES(CDS_CODE varchar(14), SCH_CODE varchar(7), ETHNIC varchar(1), GENDER varchar(1), GRADS int(4), UC_GRADS int(4), YEAR varchar(4),FALLYEAR year, SPRINGYEAR year, primary key(CDS_CODE, FALLYEAR, ETHNIC, GENDER), foreign key (CDS_CODE) references PUBLICSCHOOLS(CDSCode));
  4. CREATE TABLE SCHOOL_ELSTAFF(CDS_Code varchar(14), County varchar(15), District varchar(50), School varchar(50), TYPE varchar(2), LC varchar(2), Language varchar(25), C00 int(6), C01 int(6), C02 int(6), C03 int(6), C04 int(6), C05 int(6), C06 int(6), TOTAL int(6),VALUE int(4), FALLYEAR year, SPRINGYEAR year, primary key(CDS_Code, FALLYEAR, LC, Language), foreign key (CDS_Code) references PUBLICSCHOOLS(CDSCode));
  5. CREATE TABLE STAFFDEMOGRAPHICS(REC_ID varchar(17), CDS_CODE varchar(14), COUNTY varchar(15), DISTRICT varchar(35), SCHOOL varchar(35), GENDER varchar(1),ED_LEVEL varchar(1), ETHNIC_GP varchar(1), YRS_TEACH int(3), YRS_DIST int(3),OVER_100 varchar(1), STATUS varchar(1), F_P_TIME varchar(1), PERC_TIME int(4), FULL_CRED varchar(1), UNIV_INT varchar(1), DIST_INT varchar(1),PRE_INT varchar(1),EMERGENCY varchar(1), WAIVER varchar(1), ELEM varchar(1), SEC varchar(1), GEN_SEC varchar(1), AGRI varchar(1), ART varchar(1), BUSINESS varchar(1), ENGLISH varchar(1), FOR_LANG varchar(1), HEALTH varchar(1), HOME_ECON varchar(1), IND_TECH varchar(1), LIFE_SCI varchar(1), MATH varchar(1), MUSIC varchar(1), PHYS_ED varchar(1), PHYS_SCI varchar(1), SOC_SCI varchar(1), VOC varchar(1), SPEC_ED varchar(1), READING varchar(1), BCC varchar(1), ELD varchar(1), SDAIE varchar(1), ADULT varchar(1), SPECSUBJ varchar(1), TEACH varchar(1),PCTTEACH int(3), ADMIN varchar(1), PCTADMIN int(3), PUPIL varchar(1),PIP varchar(1), STSP varchar(1), PCTPUPIL int(3), FALLYEAR year, SPRINGYEAR year, primary key(CDS_CODE, FALLYEAR, GENDER, ETHNIC_GP, ED_LEVEL), foreign key (CDS_CODE) references PUBLICSCHOOLS(CDSCode));
  6. CREATE TABLE FEPS(CDS varchar(14), COUNTY varchar(15), DISTRICT varchar(50), SCHOOL varchar(50), LC varchar(2), LANGUAGE varchar(25), KDGN int(6), GR_1 int(6), GR_2 int(6), GR_3 int(6), GR_4 int(6), GR_5 int(6), GR_6 int(6), GR_7 int(6), GR_8 int(6), GR_9 int(6), GR_10 int(6), GR_11 int(6), GR_12 int(6),UNGR int(6), TOTAL int(6), FALLYEAR year, SPRINGYEAR year, primary key(CDS, FALLYEAR, LC, LANGUAGE),foreign key (CDS) references PUBLICSCHOOLS(CDSCode));
  7. CREATE TABLE ELSBYGRADEANDLANGUAGE(CDS varchar(14), COUNTY varchar(15), DISTRICT varchar(50), SCHOOL varchar(50), LC varchar(2), LANGUAGE varchar(25), KDGN int(6), GR_1 int(6), GR_2 int(6), GR_3 int(6), GR_4 int(6), GR_5 int(6), GR_6 int(6), GR_7 int(6), GR_8 int(6), GR_9 int(6), GR_10 int(6), GR_11 int(6), GR_12 int(6), UNGR int(6), TOTAL_EL int(6), TOTAL int(6), TOTAL_LEP int(6), FALLYEAR year, SPRINGYEAR year, primary key(CDS, FALLYEAR, LC, LANGUAGE), foreign key (CDS) references PUBLICSCHOOLS(CDSCode));
  8. CREATE TABLE ASSIGNMENTCODE(ASGN_CODE varchar(4), ASGN_NAME varchar(47), CTYPE varchar(1), SUBJ_CODE varchar(2), SUBJECT varchar(15), PIC_CODE varchar(2), TOPIC_HEAD varchar(15), START_DT varchar(4), LAST_DT varchar(4), UC_CSU_REQ varchar(1), Comments varchar(100),primary key(ASG_CODE, SUBJ_CODE));
  9. CREATE TABLE ASSIGNMENT(REC_ID varchar(10), CDS_CODE varchar(14), ASN_CODE varchar(4), ASN_PCT int(4), M_ENROLL int(4), F_ENROLL int(4),C_CSU varchar(1), GD_LEV varchar(1), PCT_TIME int(4),PERC_TIME int(4), FTE float(8), CTYPE varchar(1), NCLB_CORE varchar(1), NCLB_HQT varchar(1), FALLYEAR year, SPRINGYEAR year, primary key(CDS_CODE, FALLYEAR, ASN_CODE), foreign key (CDS_CODE) references PUBLICSCHOOLS(CDSCode), foreign key(ASN_CODE) references ASSIGNMENTCODE(ASGN_CODE));
  10. CREATE TABLE PUBLICSCHOOLS(CDSCode varchar(14), NCESDist varchar(7), NCESSchool varchar(5), StatusType varchar(50), County varchar(15), District varchar(90),School varchar(90), Street varchar(211), StreetAbr varchar(201), City varchar(25), Zip varchar(10), State varchar(2), MailStreet varchar(211), MailStrAbr varchar(201), MailCity varchar(25), MailZip varchar(10), MailState varchar(2), Phone varchar(14), Ext varchar(6), Website varchar(100), OpenDate date, ClosedDate date, Charter varchar(1), CharterNum varchar(4), DOC varchar(2), DOCType varchar(50), SOC varchar(2), SOCType varchar(50), EdOpsCode varchar(20), EdOpsName varchar(100), EILCode varchar(50), EILName varchar(50), GSOffered varchar(101), GSserved varchar(101), PopStat varchar(2), Latitude varchar(10), Longitude varchar(10), AdmFName1 varchar(20), AdmLName1 varchar(40), AdmEmail1 varchar(50), AdmFName2 varchar(20), AdmLName2 varchar(40), AdmEmail2 varchar(50), AdmFName3 varchar(20), AdmLName3 varchar(40), AdmEmail3 varchar(50), LastUpDate date, primary key (CDSCode));