Spring 2011, Spring Research Project
Goal: draft of introduction and background sections to Dr Thomas by 5pm on Friday, 5/6. Draft of whole shebang (looked at and edited by ALL!) on Monday, 5/9.
First draft of student project report, summarizing all the work since January (whew!) due next week. Would be nice if it were ready by Monday so we can start editing, improving. Graphs of assorted run-times, too.
Random query generation... ???
template file might look like:
select count(*) from people where ethnicity like "$$";
[Hispanic, African-American, Native-American]
Code could look for the $$s, and use them as a command to read the next line, create an array of the values inside the s (actually, might be easier just to list the values without the s), and pick values randomly out of that array to generate X queries.
(Long gap between meetings due to Tapia conference work)
Due to time constraints, not using "income" field -- would like incomes to be skewed in a semi-realistic fashion and we don't have time to do that.
We also don't have ethnicity or job category yet, either.
Code to generate random queries. Include all the queries that got one or more votes (below) -- with simplifications some of those queries will probably merge into one. Should leave us with around a dozen queries.
Note for experimenting on hopper. "users", typed into a command line, will tell you who else is logged into the machine at the same time you are. ("whoami", which I erroneously wrote instead of "users" earlier, reports which user name you are currently logged in as. Useful for sysadmins with numerous user names.)
Continue running data loading and 'mega-query' experiments. Carefully note timings. We'll try using hopper as a data-generation machine, to take advantage of speed.
Query from list below: How many adult men are in the database (over 18)?
Turn this into: How many [gender] are in the database with age greater than [random number between 10 and 70]?
When generating queries from above pattern, pick male/female 50/50, and the age to be the bottom of the range randomly.
Generalizing the sample queries y'all liked in this way, we can get classes of queries to generate.
Put your student research competition slides and final draft papers into the CSHomework Submission System, under the "instructor" name of ahpcrc.
Revising our relations for better fake-Census queries. New relations:
People - first name, last name, age, address, zip code, birth date, gender, income, job category, ethnicity
Zip Codes / Cities - zip code (primary key), city, state, latitude, longitude
Note that birthdate and age are a little redundant, but the Census form asks for both, so we're following their lead. Also, job category and ethnicity should be randomly selected from lists of the "usual" choices. The Bureau of Labor and Statistics should have a broad list of 20 or so job categories we can use.
1) put lists of "reasonable" questions here, in one big list
2) "vote" for top 10-15 'best' questions
3) write code to re-do the random data generation, using new relations
4) write code to generate random queries, following the templates of the top 10-15 favorite queries. It would be convenient if we could control the mix of queries, so that we can select that 10% of the queries in a generated set should be based on Query A, and 5% on Query B, and 2% on Query C, etc.
5) start re-running the "big, bad query" tests on the new people relation data, using two hardware platforms (but still MySQL as our DBMS)
List of questions you made up:
1 How many adult men are in the database (over 18)? LC AR OC
2 How many adult women are in the database (over 18)? LC AR
3 Total work force (men + women) > 16 and below 65? LC OC
4 What is the average income in California? AR
5 How many unemployed people exist in the database?
6 Find the number or children (<16), adult (>18) and retired (>65) LC AR OC
7 How many Caucasians exist in the database? AR
8 How many Hispanics exist in the database? AR
MT - Suggestion. Replace the above two queries with "How many choose-your-ethnicity people are in the database?" for assorted values of "ethnicity."
9 Find the number of people from other minorities?
10 Which race has the highest income? AR
11 Find the number of households with more than 5 members. OC
12 Average household size per zipcode. OC
13 Average household size per state.
14 Average household size All.
15 Average household size per city. OC
16 Average Income per household per city AR
17 Average Income per household per zipcode AR
18 Average Income per household per state
19 Average Income per household All.
20 How many homes per zipcode. AR
21 How many homes per city.
22 How many homes per state.
23 How many people per zipcode LC AR OC
24 How many people in California LC AR OC
25 How many people in Stanislaus and San Joaquin county LC AR OC
26 List the population per county in California LC
27 How many people per city LC AR
28 How many people per state LC OC
29 How many people All. LC AR OC
Need to put experimental results so far here.
query: select fname,lname,age,address,city,phone,sex,state,zipcode,birthday,count(*) FROM people GROUP BY fname,lname,age,address,city,phone,sex,state,zipcode,birthday;
Done in Asus Eee PC 1000 ha netbook running Ubuntu 10.04
Specs: intel atom 1.66 GHz
memory 1.0 GB DDR 2 667 MHz
mysql server version 5.1
Tuples Time 1,000 0.05 sec 10,000 0.32 sec 100,000 8.08 sec 1,000,000 2 min 43.67 sec 2,000,000 7 min 30.68 sec 5,000,000 29 min 31.64 sec 10,000,000 472 min 33.72 sec = 7 hrs 52 min 33.72 sec 15,000,000 601 min 31.82 sec = 10 hrs 1 min 31.82 sec 20,000,000 1206 min 53.44 sec = 20 hrs 6 min 53.44 sec 25,000,000 need to start query again because netbook got unplugged.
Done in HP ProBook 4525s Laptop running Ubuntu 10.04
Processor: Athlon II dual core processor 2.2 Ghz
memory : 4.0 GB DDR 3 1066 MHz
mysql server version 5.1
Tuples Time 1,000 0.02 sec 10,000 0.11 sec 100,000 3.11 sec 1,000,000 1 min 14.60 sec 2,000,000 3 min 24.00 sec 5,000,000 12 min 46.20 sec 10,000,000 107 min 43.28 sec = 1 hrs 47 min 43.28 sec 15,000,000 20,000,000 pending 25,000,000 pending
For next Monday (mostly to be completed after Friday's Student Research Competition), write 10-20 queries (each) in English. Reasonable queries, such as any Regular Human might ask of (fake) Census data. City Planners, State Senators, that kind of thing.
Then translate the questions into SQL. So we will have a pool of possible SQL queries to consider for the "overwhelm with many queries" experiments.
We may need to modify the fake-Census schema, to make it more rich, for more interesting queries. Discuss that next Monday, after the SRC.
Fall 2010, November Group, Cameron, Tony, Aaron
Final Reflection (Aaron):
Getting The DBMS working was not that difficult due to the work of previos students. I was the only one that was able to get the mySQL Benchmark working. I was able to collect data from our standardized tests with 1, 2, 4, 8, 16, 32 warehouses, as well as 2, 16 and 32 terminals for 10 min within the benchmark. I found that 10 minutes was not long enough when the amount of warehouses was above 8. The reason is that it would start, but before it would complete a couple of transactions the time would end. Even though the 10 minutes was up it would still take another min to 30 min for the benchmark to finalize the tasks. The length is dependent on the number of warehouses.
I wish i would have had time to test the same parameters on battery, but due to time restraints, I am forced to finish this at a later time. Another future project will be getting another benchmark up and running and running the same tests for comparison.
What I learned: I learned that mac OSX is much faster at loading warehouses than Windows, But is very similar to Linux.
I also learned that the benchmark works just fine on Mac OSX, but has problems on windows OS and Linux.
I learned that the more warehouses you load the slower the benchmark runs and the lower the ave. tpmC:.
I can't say that I just learned this since I have known this for a long time, but since we did experience it quite a bit i must mention it.
That no test or project goes as planned.
Over all I enjoyed this part of the project and plan on continuing my tests.
(Dr Thomas comments: Never too early -- or late -- to learn that no project ever goes as planned.)
Next time, need to use different columns to record results of benchmarks run. # terminals, transactions completed, etc.
Benchmarking GUI ran successfully on Mac OS (Aaron), froze on Ubuntu/Asus (Thomas), Ubuntu/HP (Tony), and Cameron's Windows box complained about missing the JDK.
Final Reflection (Tony):
Testing databases sure gave me some idea of how different ways of testing will give different results. For example, one might think that running something on battery power or running it on cord will give the same result since the computer is using the same amount of power continuously when is turned on, but it is not the case. When the computer is running on batter it tends to be slower than plugging it to the wall. Also the hardware embedded into the computer also helps to achieve faster results.
In order start working in this program there is many things to get started.
First the Benchmark testing program must be downloaded (unless you want to create your own).
Always read the readme file, it might be helpful when don't know what to do.
Make sure the paths the program is using the find java are correct (Most computers have different pathways and remember than windows uses '\' instead of '/').
Run the program by reading the readme file
Once you are running the Benchmark GUI, there are chances that it freezes just like my computer did.
Overall, very entertaining and fun!!!
(Dr Thomas comments: It would be interesting to look into more of why battery goes more slowly. Perhaps there are some not-so-well-documented aspects of the operating system that make this happen.)
Final Reflection (Cameron):
I was surprised at the performance between a windows laptop versus the other two. At first comparison both the apple laptop and mine had a 2.2 ghz dual core processor and 2 gigs of RAM. After running some tests and discussing the results with Aaron and Tony the Windows laptop seemed to take twice as much time as the others. So it appears the Windows fails to optimize database management. Unfortunately I wasn't able to do any further benchmarking because of a missing file. I am curious if there is a database software that performs better for windows than other OS's.
(Dr Thomas comments: Well, Microsoft's SQL Server might be an obvious DBMS to look at for good Windows performance...)
Compare load data results, wall vs battery, for the range of warehouse sizes 1, 2, 4, 8, 16, 32, 64... until battery battery goes poof before loading finishes (or you run out of hard drive to store the data).
For running the benchmark querying, vary the sizes of the warehouses. For each warehouse size, run with 2, 32 and 64 terminals. Run all tests for 10 minutes. Start out on wall power for all tests, try out battery if you have time / interest.
Keep adding results into the tables below.
We have (just barely) gotten BenchmarkSQL running on each of your machines. So, this week, we will run experiments in a scattershot, not systematic (or scientific) fashion. Just to see what we can do and where some of the performance limits are.
(Thanks to Cameron, we now know that the README file for BenchmarkSQL correctly lists that loadData ... numWarehouses=10 is correct on Windows. A space must be added in between the = and the number if running loadData on Ubuntu / Mac.)
We will make a more organized testing plan next week.
Record the results of the experiments here. Be careful! Do not accidentally delete someone else's test results!
Loading the database tests:
|What kind of Computer? (OS/hardware)||Who ran the test?||How many warehouses?||Wall or battery power?||Run time||Start time||End time||rows loaded||rows per second||Comments|
|Macbook Pro OSX 10.6.4||Aaron||10||wall||1101 secs||11/2/10, 16:07:40||11/2/10, 16:26:01||5090197||4623||This was the first run|
|Macbook Pro OSX 10.6.4||Aaron||15||wall||1735 secs||11/2/10, 18:31:18||11/2/10, 19:00:13||7584690||4371||2nd run|
|Macbook Pro OSX 10.6.4||Aaron||30||wall||4280 secs||11/3/10, 22:41:18||11/3/10, 23:39:58||15068164||3520||3rd run|
|Ubuntu/Asus Eee PC||Thomas||128||wall||60148 secs||11/7/10, 15:25:31||11/8/10, 08:08:00||63979129||1063|
|Ubuntu/Asus Eee PC||Thomas||64||battery||11/11/10 and 11/12/10||battery died. went from 98% to 0%. tried twice; battery died both times|
|Ubuntu/Asus Eee PC||Thomas||64||wall||23288 secs||11/6/10, 23:19:06||11/7/10, 04:47:15||32032418||1375|
|Ubuntu/Asus Eee PC||Thomas||32||battery||11740 secs||11/10/10, 22:09:45||11/11/10, 01:25:25||16071757||1368|
|Ubuntu/Asus Eee PC||Thomas||32||wall||11684 secs||11/6/10, 19:27:04||11/6/10, 22:41:48||16071068||1375|
|Ubuntu/Asus Eee PC||Thomas||16||battery||5846 secs||11/10/10, 18:38:40||11/10/10, 20:16:07||8086644||1383|
|Ubuntu/Asus Eee PC||Thomas||16||wall||5966 secs||11/5/10, 22:46:32||11/6/10, 00:25:59||8083684||1354|
|Ubuntu/Asus Eee PC||Thomas||8||battery||3041 secs||11/9/10, 22:12:08||11/9/10, 23:02:50||4091721||1345|
|Ubuntu/Asus Eee PC||Thomas||8||wall||3025 secs||11/5/10, 21:20:55||11/5/10, 22:11:20||4091302||1352|
|Ubuntu/Asus Eee PC||Thomas||4||battery||2850 secs||11/9/10, 20:21:57||11/9/10, 21:09:27||2095565||735||netbook may have gone to sleep partway through this|
|Ubuntu/Asus Eee PC||Thomas||4||wall||1592 secs||11/5/10, 20:45:51||11/5/10, 21:12:24||2096072||1316|
|Ubuntu/Asus Eee PC||Thomas||3||wall||1143 secs||11/3/10, 22:55:16||11/3/10, 23:14:19||1597169||1397|
|Ubuntu/Asus Eee PC||Thomas||2||battery||893 secs||11/9/10, 19:56:39||11/9/10, 20:11:33||1098295||1229||without indices|
|Ubuntu/Asus Eee PC||Thomas||2||wall||792 secs||11/6/10, 19:08:40||11/6/10, 19:21:53||1098097||1386||without indices|
|Ubuntu/Asus Eee PC||Thomas||2||wall||889 secs||11/3/10, 21:10:21||11/3/10, 21:25:11||1098732||1235||with indices|
|Ubuntu/Asus Eee PC||Thomas||1||battery||425 secs||11/9/10, 19:45:49||11/9/10, 19:52:55||598692||1408||without indices|
|Ubuntu/Asus Eee PC||Thomas||1||wall||422 secs||11/6/10, 18.56:20||11/6/10, 19:03:23||599761||1421||without indices|
|Ubuntu/Asus Eee PC||Thomas||1||wall||475 secs||11/3/10, 20:54:40||11/3/10, 21:02:36||599152||1261||with indices|
|HP Pavillion Windows7||Cameron||20||wall||8651 secs||11/7/10, 12:03:05||11/7/10, 14:27:17||10082885||1165||with indices|
|HP Pavillion Windows 7||Cameron||1||wall||11/16/10||14:10:35||14:22:23||598420||845||no indices|
|HP Pavillion Windows 7||Cameron||2||wall||11/16/10||14:29:31||14:51:06||1097752||847||no indices|
|HP Pavillion Windows 7||Cameron||16||wall||11/15/10||20:21:40||22:48:12||8084855||919||no indices|
|HP Pavillion Windows 7||Cameron||32||wall||11/11/10||21:125:33||01:41:24||16071771||1047||no indices|
|HP Pavillion Windows 7||Cameron||64||wall||11/14/10||23:13:30||09:05:05||32030042||902||no indices|
|hp probook 4510s (Ubuntu)||Tony||10||wall||870 secs||11/1/10, 20:20:13||11/1/10, 20:34:44||5092056||5852||tables dropped before running command|
|hp probook 4510s (Ubuntu)||Tony||10||battery||892 secs||11/09/10, 11:06:18||11/09/10, 11:21:10||5092289||5708||tables dropped and battery takes a little longer than wall|
|hp probook 4510s (Ubuntu)||Tony||20||wall||1786 secs||11/09/10, 11:24:49||11/09/10, 11:54:35||10082034||5645||speed is not constant by increasing the number of warehouses|
|hp probook 4510s (Ubuntu)||Tony||20||battery||1819 secs||11/09/10, 11:57:48||11/09/10, 12:28:07||10079029||5540|
|hp probook 4510s (Ubuntu)||Tony||30||wall||2646 secs||11/09/10, 12:35:47||11/09/10, 13:19:54||15074045||5696|
|hp probook 4510s (Ubuntu)||Tony||1||wall||100 secs||11/11/10, 16:48:30||11/11/10, 16:50:11||599095||5990|
|hp probook 4510s (Ubuntu)||Tony||2||wall||183 secs||11/11/10, 16:51:07||11/11/10, 16:54:11||1097957||5999|
|hp probook 4510s (Ubuntu)||Tony||4||wall||353 secs||11/11/10, 16:55:21||11/11/10, 17:01:15||2095931||5937|
|hp probook 4510s (Ubuntu)||Tony||8||wall||696 secs||11/11/10, 17:02:07||11/11/10, 17:13:44||4092383||5879|
|hp probook 4510s (Ubuntu)||Tony||16||wall||1386 secs||11/11/10, 17:20:25||11/11/10, 17:43:32||8084345||5832|
|hp probook 4510s (Ubuntu)||Tony||32||wall||2774 secs||11/11/10, 17:44:54||11/11/10, 18:31:08||16068110||5792|
|hp probook 4510s (Ubuntu)||Tony||64||wall||5483 secs||11/11/10, 18:33:56||11/11/10, 20:05:20||32037935||5843|
|hp probook 4510s (Ubuntu)||Tony||128||wall||11267 secs||11/12/10, 00:48:57||11/12/10, 03:56:44||63974262||5678|
|Macbook Pro OSX 10.6.4||Aaron||1||wall||124 secs||11/15/10, 13:47:08||11/15/10, 13:49:13||5090197||4830||This is the 1st true test|
|Macbook Pro OSX 10.6.4||Aaron||2||wall||233 secs||11/15/10, 14:34:20||11/15/10, 14:38:14||1097710||4711|
|Macbook Pro OSX 10.6.4||Aaron||4||wall||447 secs||11/15/10, 15:20:30||11/15/10, 15:27:57||2096206||4689|
|Macbook Pro OSX 10.6.4||Aaron||8||wall||877 secs||11/15/10, 18:25:22||11/15/10, 18:38:59||4094963||4669|
|Macbook Pro OSX 10.6.4||Aaron||16||wall||1743 secs||11/15/10, 19:50:55||11/15/10, 20:19:59||8085881||4639|
|Macbook Pro OSX 10.6.4||Aaron||32||wall||3458 secs||11/15/10, 22:22:12||11/15/10, 23:19:51||16071063||4647|
Running the TPC-C-like benchmarks:
|What kind of computer?||OS used||DBMS used||Who ran the test?||How many warehouses?||Number of Terminals||Wall or battery power?||Total seconds to run test(Date?)||start time||end time||rows loaded(Running Ave tpmC:?)||rows per second(Errors(yes/no)?)||Comments|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||1||2||Wall||11/15/10||13:57:01||14:07:11||tpmC 8.96||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||1||10||Wall||11/15/10||14:21:19||14:32:20||tpmC 8.89||No errors||Unable to run 16 or 32 terminals. Recieved error stating invalid number of terminals so i ran 10|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||2||2||Wall||11/15/10||14:50:03||15:00:28||tpmC 5.27||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||2||16||Wall||11/15/10||15:03:29||15:15:00||tpmC 4.95||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||4||2||Wall||11/15/10||16:19:28||16:29:55||tpmC 5.48||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||4||16||Wall||11/15/10||17:42:43||17:56:56||tpmC 2.18||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||4||32||Wall||11/15/10||17:59:16||18:17:33||tpmC 2.67||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||8||2||Wall||??||18:42:07||18:52:48||tpmC 1.21||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||8||16||Wall||11/15/10||18:56:17||19:15:12||tpmC 1.26||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||8||32||Wall||11/15/10||19:17:38||19:44:50||tpmC 1.39||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||16||2||Wall||11/15/10||20:24:18||20:37:14||tpmC 0.54||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||16||16||Wall||11/15/10||20:41:19||21:10:59||tpmC 0.67||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||16||32||Wall||11/15/10||21:19:51||22:12:34||tpmC 0.64||No errors|
|Macbook Pro||OSX 10.6.4||mySQL.sh||Aaron||32||2||Battery||11/16/10||09:36:59||14:07:11||tpmC 8.96||No errors|
Questions to answer:
Observed oddities during course of testing...
"Pathologies of Big Data", Communications of the ACM, August 2009. Read this by next week. You might need to be on a campus computer to access the full text of the article.
Transaction Processing Council -- we are going to try to get one of their benchmarks running.
Report on a TPC-H test someone ran on MonetDB. Included here to give you an idea of the kind of report we might want to generate. Their results are, now, old, so our results would be fresher.
PostgresQL - another free DBMS we could compare to MySQL.
Oracle 10g Express DBMS -- a free, starter DBMS product from the giant DBMS company, Oracle
BenchmarkSQL open source software for running TPC-C on a variety of DBMS products.
List of database benchmarking tools, plus DBMonster. On the list, the "Open Source Development Labs Database Test Suite", BenchmarkSQL, Bristlecone, Java TPC-W, TPCC-UVa and Hammerora look most promising. Sourceforge seems to have more free DBMS benchmarking tools, too.
Try to get a TPC benchmark running against MySQL on your netbooks. Any TPC benchmark will do.
Keep track of:
- Useful commands to the computer. Ex. "mysql -h hostname -u username -ppassword", for starting mysql up at the command line in Ubuntu.
- Useful web sites, useful software
- Points where installing or using software got hard/confusing, and how you got around the problem (if you do) -- this tells us where software is not ready for "Joe Public" to use it, which is valuable information for improving the software.