LargeDataSets ProjectIdeas 2011 12
Project Ideas for Spring 2012
- LSA analysis of any of the smaller data sets below, feed the results to a spectral clustering group working with Dr Carter
- work on analysis, in R or Matlab or with python programs, of astronomical data (with Dr DeVries)
- Straightforward operating system comparison: MySQL vs MySQL, same hardware. Pick two operating systems, run the same set of queries over the same data set (generated by Fall 2011 data generating program students wrote?). Which is faster, or is there no difference? We could run tests on the netbooks, or we might hang an 'out of order' sign on one of the CS lab machines for a little while. (Early on in the Spring semester, or before the semester. Not during Spring finals.) If there is a difference, can we figure out why the operating systems cause performance differences in the DBMS?
- Straightforward DBMS comparison: MySQL vs PostgreSQL, same hardware, same data, same set of queries. Which DBMS should a user wanting to cope with Census data use? (Suppose the Census Bureau, for example, needs to install a DBMS.)
- Validate our data generation program. Load real Census data into DBMS; load data from student data generation program.
Interesting Data Sets for Spring 2012 Projects
- Census data, 1990, 2000, 2010
- American Community Survey Data
- any data set with spatial data can be mapped and compared to Census Data
- FBI Uniform Crime Data?
- FCC broadband access data
- Spitzer/Fitzer/Pitzer? data, stars, Dr DeVries - looking for young stars.
- Atacoma (sp?) Large Millimeter array, Dr DeVries - looking for signs of collapsing gas (data not really available yet - late Spring 2012)
- Data from the CSU Stanislaus Institutional Research Office
- CSU Stan has test results from 6 years of WP exams
- Nat'l Survey of Student Engagement data, 6 years worth, and about 2 years of Faculty Survey of Student Engagement data
- Dept of Education 'California Basic Educational System' (CBEdS) database. Can be organized by county / district / school to look at the student population that "feeds" CSU Stanislaus
- other data sets also available
- Usenet flame data (Dr Martin)
- medical web pages, annotated (Dr Martin)
- group communication, annotated with effectiveness of outcomes of communication (Dr Martin)
Safecast offers data (under Creative Commons license) gathered by volunteers in Japan, on radiation levels in Japan, gathered by people who put Safecast sensors on their cars and drive around taking readings. (Also, they have some fixed sensors.)