LargeDataSets Analysis IR 2012
Good draft of paper due via email to Dr Thomas on or before 5/17. Will give feedback in a couple days. Final draft due 5/24 (final exams seem to be done by about 5/23). Write as if your audience is another undergrad starting up analysis work in September, starting where you left off.
Dr Sanchez interested in: WPST pass rates by majors. Tables generated for 2003 thru 2010 (combine old IR report data with our new data).
Kappa above .8 is likely to be good -- what we're looking for. .4 to .8 is a grey area.
(If the classifier task is easy, a .8 Kappa might not actually be that good.)
Note: to not pass the WPST as a native English speaker is a Very Bad Sign. We should probably take a look at those cases.
WPST scores as pass/not pass, 10-fold cross validation. Try Naive Bayes (baseline), decision trees (human readable results), SMO (support vector machines, elephant gun, lots of knobs to turn).
Weka Explorer, top bar, "select attributes" button will let you play with minimizing the number of attributes we need to try to predict WPST pass. Try it yourself, using human intuition, and also let Weka try to do it without you. (Can help you track down weirdness, outliers, errors in data.) Look for 'information gain.'
(Plans to be revised at a moment's notice, if we get the rest of the WPST data!)
Kristi - put some of the NSSE data into the database (columns to be selected by Cassie & Melissah). Put queries up on wiki, with results. Try a few NSSE predictions using Weka.
Melissah - keep running NSSE tests, put results in wiki.
Cassie - keep running NSSE tests. put results in wiki.
Put results (summaries of, not the whole Weka report page) on the new 'results page' we created.
Want very accurate prediction of "students who need tutoring to pass WPST." Might wind up with results that miss/ignore some students who need tutoring, but we want 100% of the set of predicted 'students who need tutoring' to be students who absolutely do need tutoring, so that 100% of the money spent on tutoring is money well spent.
With WPST data, our classifiers are more likely to work well if we try to predict "passed / not passed" or even "high pass / not high pass", rather than trying to predict exact scores.
While we wait for rest of WPST data, pick some columns from NSSE data and see if Weka (decision trees? Naive Bayes?) can come up with accurate classifiers. Pick a few "this might actually work" theories to find classifiers for, and pick at least a couple "there is no way this will work" classifiers (what the heck -- they might actually work) to test. If you try to use Weka to predict a column with few valid results, Weka is likely do better. (Predicting gender might be easier than predicting major.) 10-fold cross validation to be preferred.
Kristi will spend some time polishing up MySQL database before tackling Weka work. (We should try to work on a set of SQL queries that generate interesting results, once that database is ready to roll. Add NSSE data as another relation in the database?)
Sam / Kristi going to 4/21 mathematics conference at CSUS.
To Do for next week:
Kristi: get new fields into (My)SQL table. Get Weka working. putting data into Weka and generating graphs
Cassie: putting data into Weka and generating graphs
Melissah: putting data into Weka and generating graphs
Sam: create gender tables for updated IR office report. (last week's work) get Weka working. putting data into Weka and generating graphs
Student research competition on Friday!!
After SRC, send (via email) around recreating-IR-report figures. Try putting NSSE / WPST data into Weka after the SRCompetition. Either pick a theory and see if Weka can find a correlation, or pick a favorite data mining algorithm and see what it says about the data.
To Do for next week:
Still need to do more "characterizing the data set" gathering. Counts, percentages.
All work on sections of paper
Kristi: paper draft, veteran's data analysis
Cassie: paper draft, NSSE data, study habits and communication with faculty habits
Melissah: paper draft work, plus work with Max/Cameron
Sam: paper draft, stick NSSE data into SPSS, look at WPST scores and class level
Student Research competition paper draft
Kristi - has WPST stuff in MySQL database
Everyone - make 3-4 slides for student research presentation (maybe more)
Take old WPST report and recreate figures. Cassie recreating figure 1 (by WPST attempts), 4 (age group),
Sam: gender WPST, ESL status
Kristi: pass rate by race/gender - original language, since don't have race info, class level
If get done, try WPST pass rate by institution took freshman English at
MT: Still need a draft of this! 100-200 words, no more.
What we're doing
Where data is from
(and tiny snippets from rest of paper)
Here's what I have, it doesn't seem like much, so any feedback is greatly appreciated.
We are working with the Office of Institutional Research (IR) located here on the California State University Stanislaus (CSUS)campus. They collect large amounts of data from a variety of sources, such as national surveys and local surveys of students and faculty. Peoplesoft is an important source of data for IR, especially since it contains information on students who are attending CSUS. All of this information is analyzed and used in a variety of ways, such as planning, evaluating, and making improvements to the campus. The IR department has supplied our research group with student Writing Proficiency Screening Test (WPST) results. They also have supplied us with data from the National Survey for Student Engagement (NSSE). We will be studying this data to develop a hypothesis on what can be done to improve student scores on the writing proficiency test.
One of the tools we will be using is WEKA. Data mining is looking at large data sets and finding patterns, or relationships that are meaningful. Classifying, clustering, identifying associations and sequential patterns are all part of data mining. The results can be analyzed and used to predict future trends.
We plan to use the tools of data mining to analyze the WPST data, hoping to identify students at risk for not passing the test.
(MT, Wed night: Still needs work! Note that the company named Peoplesoft is not the source of student records data. Peoplesoft is merely the vendor supplying the software we happen to use -- probably shouldn't be mentioned by you at all, since you're not using Peoplesoft software. Just say something about IR having access to university records about students' academic performance and demographic data. Mention that IR supplied several years of WPST data -- not just one year, which the current prose implies. And you'll probably be developing 'hypotheses' not only one 'hypothesis.')
(MT: more than improvements. Campus planning, campus evaluation. IR also has access to Peoplesoft data -- not just surveys. "supplied", not 'is supplying'. Insert a citation to http://www.csustan.edu/wpst/ where you first mention the WPST, and http://nsse.iub.edu/ for NSSE. Cite the Weka web site the first time you mention that. Might want a bit more segue between your general descriptions of data mining and what it is for, and your comments about the WPST. That leap from the general to the local is abrupt. "We plan to use the tools provided by the field of data mining to analyze WPST data, hoping to identify ....")
(For me, newspaper reporters just say stuff. Scientists say stuff, but cite their sources. Be scientists!)
What NSSE / SPST is (separate paragraphs)
The Writing Proficiency Screening Test, or WPST, is a method for determining if a student is sufficiently prepared to enroll in a Writing Proficiency (WP) course, which satisfies the University’s Graduation Writing Assessment Requirement. This implements a CSU system wide writing requirement. Students must pass the WPST before they are allowed enrollment in any WP course. All undergraduates attending CSU Stanislaus without exemption must take the WPST, preferably at the end of their sophomore year. The essays produced are holistically read and scored by at least two different readers. Students who fail the test at least twice are provided with options to improve their writing skills, such as a workshop or courses specifically aimed at developing writing abilities. http://www.csustan.edu/wpst/
The National Survey for Student Engagement (NSSE) was developed in 1998 as a way to gather collegiate quality information on a national level. Its main purpose is to aid universities in improving undergraduate education and involvement. First year and senior students seeking baccalaureate degrees at participating institutions are surveyed. CSU Stanislaus has participated in the NSSE in 2003, 2004, 2006, 2009, and 2011. The NSSE is supported by a grant from The Pew Charitable Trusts and is now fully self-supported by institutional participation fees.
(MT, Wed night: segue with introduction section, or merge? Still dictionary-like...)
(MT comment: Traditionally, a background section has a more 'narrative' or 'story-like' flow. This is rather dictionary-ish, a list. Your project actually does have a story -- we were asked/offered to put your skills to use for IR, on the job of analyzing... so that IR would have more information about ...)
What we're doing
Analysis with Excel to start with, but too much data.
Weka, data mining tool, SPSS
CC: This is another draft of the experimental approach. No other modifications have been made to the other topics and I have not received anything via email. For now, I shall begin bridging all aspects of the paper together.
A few notes: Disregard the figure part for now, I am still working on it. At the bottom of my section is the citations I have for my sections. WEKA looks different because there's a link on their site that states if you want to cite them then use this citation (which is the citation I used).
Also, I am assuming that only one example will be used in this paper because I haven't receive anything else. That is why there is only one example (Ethnicity versus Majors).
The experimental approach of our research project involves a lot of time prepping the data as well as running queries in order to achieve a result. Using tools such as Microsoft Excel (“Microsoft Excel-Spreadsheet-Office.com”), IBM’s Statistical Package for the Social Sciences, also known as SPSS (“IBM SPSS software for predictive analysis.”), and WEKA, a data mining software tool (Hall et. al), allows us to create queries and identify meaningful correlations. These three programs provide the means for us to conduct our research in a way that can provide insightful conclusions.
Our data was originally provided in Excel format and from this we have been running various queries. Various formulas are utilized throughout the experiments, ranging from count, average, max, and min. Running these queries provides the correlations we need in order to achieve our results. A key focus of our research is geared toward various student majors and whether there is a determining factor for choosing that major. Correlations that we are currently looking into are between student majors versus ethnicity, parent’s education, and gender. A majority of the experiment process is taken up by piecing together the data to run these queries. We eventually hope to find correlations where we can place the data in WEKA, test against algorithms inside the software, and see if those algorithms can in fact achieve the same correlations that we have found from our experiments so far. However, due to the time that it took to prepare the data, we are unable to test it yet.
Although we are still in the process of conducting our research, we can already see some correlations based on tests that have been run. For example, one of our first tests involves looking at NSSE data (more specifically the ethnicity and majors of students attending CSU Stanislaus) and determining if any correlations can be made. During this process we have noticed that ethnicity was reported twice; by the students and also the university. We decided to test both reported ethnicities with student majors by using a “countifs” formula inside Excel which counts the number of students with a certain ethnicity for each major (“Microsoft Excel-Spreadsheet-Office.com”). Results of this test were placed inside graphs and interesting outcomes can be seen in both figures one and two (figure 1 and figure 2).
Figure one depicts student reported ethnicities that was gathered from an independent survey, while figure two illustrates ethnicities collected by the university, in which both were compared against student majors. As can be seen, both reports are slightly different, however, this variation is marginal, especially when comparing the results of the four main majors of Arts and Humanities, Business, Social Sciences, and Other, thereby indicating the method employed by the university to categorize student ethnicities is accurate.
When focusing on students who are majoring in Arts and Humanities, it is clear to see it is the major with the highest number of student enrollments. Within the Arts and Humanities category, the most common majors are Philosophy, History, Music, Art, and other less common majors. In regards to the student ethnicity breakdown, it can be seen that the dominant ethnicity is that of White/Caucasian students, constituting 39.9% of all Arts and Humanities students. The second most common ethnicity is Hispanic compromising of Mexican, Puerto Rican, and other Hispanic ethnicities with 35.1%, followed by Asian/Pacific Islander students with 11.5%, and concluding with 2.0% for African American students.
Another of the key majors is that of the Social Sciences. Social Sciences consist of Psychology, Sociology, Political Science, Economic majors and other less common majors. Within this category, the most dominant ethnicity is Hispanic with 37.8% of students. The next common ethnicity is White/Caucasian with 32.2%, while Asian/Pacific Islander and African American students are 11.9% and 3.5% respectively.
The third popular major among students attending CSU Stanislaus is Business, consisting of majors like Marketing, Finance, and Accounting to name a few. In this category, White/Caucasian is the highest ethnicity with 34.4% followed by Hispanics with 32.8%. The third highest student ethnicity is Asian/Pacific Islander with 10.16% and, while African American students make up 4.7% of all Business majors.
The fourth popular major is known as Other and consists of Computer Science, Communications, Criminal Justice, Kinesiology, and other less common majors. In regards to this category, Hispanics students are the most common ethnicity with 46.7% followed by White/Caucasian students with 30.0%. Asian/Pacific Islanders are third highest with 6.7% and finally, African Americans are the lowest of the four with 3.3% of students.
Overall, the most dominate ethnicities noticed are Hispanics and White/Caucasian students with Asian/Pacific Islanders and African Americans with the lowest of these four. Although the most preferable majors among students lie within the Arts and Humanities majors, this data provides insights that major programs here at CSU Stanislaus can analyze in order to increase the diversity of students and the majors they pursue. Not only can certain major programs at CSU Stanislaus benefit from this data, CSU Stanislaus as a whole can utilize the data and insights to develop focused strategies with the goal of improving the spread of ethnic diversities among majors.
“IBM SPSS software for predictive analysis.” ibm.com. IBM. n.d. Web. 19 Feb. 2012.
Mark Hall, Eibe Frank, Geoffery Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
“Microsoft Excel-Spreadsheet-Office.com.” office.microsoft.com. Microsoft. n.d. Web. 19 Feb. 2012.
Conclusions / Future Work
Thanks to IR and AHPCRC (put in acknowledgement section right below conclusions??)
Cite web page with WPST old report
weka web page
NSSE code book
(This is my second draft, please be brutal and let me know what to change. Also the works cited page is forthcoming after the meeting today.)
Our first task in the future is to recreate reports from the Office of Institutional Research supplemented with statistically accurate information regarding the studies used. From there we will be moving on to attempting to make predictions regarding student success with the WPST, allowing us to develop tools to accurately identify students who will have trouble with this important milestone. These techniques are not just restricted to WPST information but can be applied to student success in a wide range of areas such as math remediation.
Later we hope to expand the sources we use in our analysis to include data collected from other areas. One such source is the PeopleSoft database here on campus. This will allow us to make more accurate predictions while giving us the ability to ask more in-depth questions about our school and student body. These questions will be relevant not just to students in an academic sense, but also relevant to the university so we can approach policy and decision making with sound evidence.
Our work with the analysis tools and techniques in data mining has included Weka, Excel, and SPSS. The whole process is an ongoing effort that will ultimately lead to better equipped students, ones who can go forward in their fields of study with the confidence they are utilizing scientifically sound processes. This is not just limited to the existing data that we have been working on, but will be expanded to envelop other sources of data. ? such as pass rates of course work at the school enabling us to widen the questions that student researchers can ask.
In closing we would like to extend our thanks to the California State University Stanislaus Office of Institutional Research for their invaluable contribution to our research, and the Army High Performance Computing Research Center for the grant which made this research project possible.
(Cassie: Here's my rough draft page for the experimentation approach.)
The experimentation approach of our research project involved a lot of time prepping the data as well as running the queries in order to achieve a result. Using tools such as Microsoft Excel, IBM’s Statistical Package for the Social Sciences (SPSS), and WEKA (a data mining software tool) allowed us to create queries and make the correlations that were made. These three programs provided the means for us to conduct our research in a way that provided successful conclusions.
Our data was originally provided in Excel and from there we began by running queries based on the data we obtained. Various formulas used throughout the experimentation process ranged from count, average, max, and min. Running these queries provided the correlations needed to achieve our results. A key focus of our research geared toward various student majors and whether there was a determining factor for choosing that major. Correlations were made between student majors versus parent’s education, ethnicity, and gender. Although the purpose of our research was to analyze student performance during their time in college, we also aimed to see how they reacted socially in a college environment. This involved queries such as the age of students attending the university, where they lived, the class environment, professor environment, and whether or not they participated in extracurricular activities.
A majority of the experimentation process was taken up by piecing the data together to run these queries. We eventually hope to find correlations where we can place the data in WEKA, test against algorithms inside the software, and see if those algorithms can in fact achieve the same correlations that were found in our experiments. However, due to the time that it took to prepare the data, we are unable to test it yet. Even though we faced time constraints, we were able to see some correlations based on graphs and charts that we have created inside Excel and SPSS. There were many more experiments that we could have tested and plan to test in the future, however, the analysis made from the tests ran were sufficient to make our research stand out.
(MT: "experiments" are the usual word. Or "experimental approach." I'd recommend using the present tense for most of this, since we're still in the process of prepping the data and haven't even started running most queries. 'Excel format,' not 'in Excel.' IR didn't hand us a computer with a running Excel program that had the data file open. I hope you plan to cite the web pages of the Weka project, whoever sells SPSS, etc.)
2/23/12 CC: Here's the draft that was previously in my section:
The experimental approach of our research project involves a lot of time prepping the data as well as running the queries in order to achieve a result. Using tools such as Microsoft Excel, IBM’s Statistical Package for the Social Sciences (SPSS), and WEKA (a data mining software tool) allows us to create queries and make correlations. These three programs provide the means for us to conduct our research in a way that will result in successful conclusions.
Our data was originally in Excel and from this we have been running queries. Various formulas we use throughout the experimental process allow us to do a wide range (what to put here? Mind is blank) count, average, max, and min. Running these queries provides the correlations we need to achieve our results. A key focus of our research is geared toward various student majors and whether there is a determining factor for choosing that major. Correlations are made between student majors versus parent’s education, ethnicity, and gender. Although the purpose of our research is to analyze student performance during their time in college, we also aim to see how they react socially in a college environment. This involves queries such as the age of students attending the university, where they live, the class environment, professor environment, and whether or not they participate in extracurricular activities.
A majority of the experimentation process is taken up by piecing the data together to run these queries. We eventually hope to find correlations where we can place the data in WEKA, test against algorithms inside the software, and see if those algorithms can in fact achieve the same correlations that were found in our experiments. However, due to the time it's taken to prepare the data, we are unable to test it yet. Even though we were facing time constraints, we were able to see some correlations based on graphs and charts that we have created inside Excel and SPSS. There are many more experiments that we would like to test, and plan to in the future. However, the analysis made from the tests we have done so far are sufficient to make our research stand out.
(MT, Wed night: hopefully you will be FINDING correlations that exist in the data, not making them. Again, data was originally in MS Excel *format*, not "in Excel" -- MS Excel is a software program, not a file folder. HOW do you aim to see how students react socially? You won't be video-taping them! Describe how -- that you will analyze student answers to questions about their social lives, given in the NSSE survey, etc, etc. Don't assume the readers know everything about the NSSE survey, or about finding correlations, or even what a correlation is. Describe the basics of finding one, or of using Weka. This descriptive material is all quite vague! Do not state anything about "making your research stand out". That sounds like bragging. It is the judges' job to decide if your work stands out or not.)