LargeDataSets DBMSTesting

From wiki
Jump to navigationJump to search

Spring 2011, Spring Research Project


Goal: draft of introduction and background sections to Dr Thomas by 5pm on Friday, 5/6. Draft of whole shebang (looked at and edited by ALL!) on Monday, 5/9.


First draft of student project report, summarizing all the work since January (whew!) due next week. Would be nice if it were ready by Monday so we can start editing, improving. Graphs of assorted run-times, too.

Random query generation... ???

template file might look like:

select count(*) from people where ethnicity like "$$";

[Hispanic, African-American, Native-American]


Code could look for the $$s, and use them as a command to read the next line, create an array of the values inside the []s (actually, might be easier just to list the values without the []s), and pick values randomly out of that array to generate X queries.


(Long gap between meetings due to Tapia conference work)

Due to time constraints, not using "income" field -- would like incomes to be skewed in a semi-realistic fashion and we don't have time to do that.

We also don't have ethnicity or job category yet, either.

To Do:

Code to generate random queries. Include all the queries that got one or more votes (below) -- with simplifications some of those queries will probably merge into one. Should leave us with around a dozen queries.

Note for experimenting on hopper. "users", typed into a command line, will tell you who else is logged into the machine at the same time you are. ("whoami", which I erroneously wrote instead of "users" earlier, reports which user name you are currently logged in as. Useful for sysadmins with numerous user names.)

Continue running data loading and 'mega-query' experiments. Carefully note timings. We'll try using hopper as a data-generation machine, to take advantage of speed.

Query from list below: How many adult men are in the database (over 18)?

Turn this into: How many [gender] are in the database with age greater than [random number between 10 and 70]?

When generating queries from above pattern, pick male/female 50/50, and the age to be the bottom of the range randomly.

Generalizing the sample queries y'all liked in this way, we can get classes of queries to generate.


Put your student research competition slides and final draft papers into the CSHomework Submission System, under the "instructor" name of ahpcrc.

Revising our relations for better fake-Census queries. New relations:

People - first name, last name, age, address, zip code, birth date, gender, income, job category, ethnicity

Zip Codes / Cities - zip code (primary key), city, state, latitude, longitude

Note that birthdate and age are a little redundant, but the Census form asks for both, so we're following their lead. Also, job category and ethnicity should be randomly selected from lists of the "usual" choices. The Bureau of Labor and Statistics should have a broad list of 20 or so job categories we can use.

To Do:

1) put lists of "reasonable" questions here, in one big list

2) "vote" for top 10-15 'best' questions

3) write code to re-do the random data generation, using new relations

4) write code to generate random queries, following the templates of the top 10-15 favorite queries. It would be convenient if we could control the mix of queries, so that we can select that 10% of the queries in a generated set should be based on Query A, and 5% on Query B, and 2% on Query C, etc.

5) start re-running the "big, bad query" tests on the new people relation data, using two hardware platforms (but still MySQL as our DBMS)

List of questions you made up:

1 How many adult men are in the database (over 18)? LC AR OC

2 How many adult women are in the database (over 18)? LC AR

3 Total work force (men + women) > 16 and below 65? LC OC

4 What is the average income in California? AR

5 How many unemployed people exist in the database?

6 Find the number or children (<16), adult (>18) and retired (>65) LC AR OC

7 How many Caucasians exist in the database? AR

8 How many Hispanics exist in the database? AR

MT - Suggestion. Replace the above two queries with "How many choose-your-ethnicity people are in the database?" for assorted values of "ethnicity."

9 Find the number of people from other minorities?

10 Which race has the highest income? AR

11 Find the number of households with more than 5 members. OC

12 Average household size per zipcode. OC

13 Average household size per state.

14 Average household size All.

15 Average household size per city. OC

16 Average Income per household per city AR

17 Average Income per household per zipcode AR

18 Average Income per household per state

19 Average Income per household All.

20 How many homes per zipcode. AR

21 How many homes per city.

22 How many homes per state.

23 How many people per zipcode LC AR OC

24 How many people in California LC AR OC

25 How many people in Stanislaus and San Joaquin county LC AR OC

26 List the population per county in California LC

27 How many people per city LC AR

28 How many people per state LC OC

29 How many people All. LC AR OC


Need to put experimental results so far here.

query: select fname,lname,age,address,city,phone,sex,state,zipcode,birthday,count(*) FROM people GROUP BY fname,lname,age,address,city,phone,sex,state,zipcode,birthday;

Done in Asus Eee PC 1000 ha netbook running Ubuntu 10.04

Specs: intel atom 1.66 GHz

memory 1.0 GB DDR 2 667 MHz

mysql server version 5.1

  Tuples                 Time
     1,000                  0.05 sec
    10,000                  0.32 sec
   100,000                  8.08 sec
 1,000,000          2  min 43.67 sec
 2,000,000          7  min 30.68 sec
 5,000,000         29  min 31.64 sec
10,000,000        472  min 33.72 sec = 7  hrs  52 min 33.72 sec
15,000,000        601  min 31.82 sec = 10 hrs  1  min 31.82 sec
20,000,000       1206  min 53.44 sec = 20 hrs  6  min 53.44 sec
25,000,000       need to start query again because netbook got unplugged.

Done in HP ProBook 4525s Laptop running Ubuntu 10.04


Processor: Athlon II dual core processor 2.2 Ghz

memory : 4.0 GB DDR 3 1066 MHz

mysql server version 5.1

  Tuples                 Time
     1,000                  0.02 sec
    10,000                  0.11 sec
   100,000                  3.11 sec
 1,000,000          1  min 14.60 sec
 2,000,000          3  min 24.00 sec
 5,000,000         12  min 46.20 sec
10,000,000        107  min 43.28 sec = 1  hrs 47  min 43.28 sec
20,000,000        pending
25,000,000        pending

For next Monday (mostly to be completed after Friday's Student Research Competition), write 10-20 queries (each) in English. Reasonable queries, such as any Regular Human might ask of (fake) Census data. City Planners, State Senators, that kind of thing.

Then translate the questions into SQL. So we will have a pool of possible SQL queries to consider for the "overwhelm with many queries" experiments.

We may need to modify the fake-Census schema, to make it more rich, for more interesting queries. Discuss that next Monday, after the SRC.

Fall 2010, November Group, Cameron, Tony, Aaron

Fall 2010, October Group, Carlos, Jesus, and Manuel

Fall 2010, September Group, Roberto, Juan, and Max


Final Reflection (Aaron):

Getting The DBMS working was not that difficult due to the work of previos students. I was the only one that was able to get the mySQL Benchmark working. I was able to collect data from our standardized tests with 1, 2, 4, 8, 16, 32 warehouses, as well as 2, 16 and 32 terminals for 10 min within the benchmark. I found that 10 minutes was not long enough when the amount of warehouses was above 8. The reason is that it would start, but before it would complete a couple of transactions the time would end. Even though the 10 minutes was up it would still take another min to 30 min for the benchmark to finalize the tasks. The length is dependent on the number of warehouses.

I wish i would have had time to test the same parameters on battery, but due to time restraints, I am forced to finish this at a later time. Another future project will be getting another benchmark up and running and running the same tests for comparison.

What I learned: I learned that mac OSX is much faster at loading warehouses than Windows, But is very similar to Linux.

I also learned that the benchmark works just fine on Mac OSX, but has problems on windows OS and Linux.

I learned that the more warehouses you load the slower the benchmark runs and the lower the ave. tpmC:.

I can't say that I just learned this since I have known this for a long time, but since we did experience it quite a bit i must mention it.

That no test or project goes as planned.

Over all I enjoyed this part of the project and plan on continuing my tests.

(Dr Thomas comments: Never too early -- or late -- to learn that no project ever goes as planned.)


Next time, need to use different columns to record results of benchmarks run. # terminals, transactions completed, etc.

Benchmarking GUI ran successfully on Mac OS (Aaron), froze on Ubuntu/Asus (Thomas), Ubuntu/HP (Tony), and Cameron's Windows box complained about missing the JDK.

Final Reflection (Tony):

Testing databases sure gave me some idea of how different ways of testing will give different results. For example, one might think that running something on battery power or running it on cord will give the same result since the computer is using the same amount of power continuously when is turned on, but it is not the case. When the computer is running on batter it tends to be slower than plugging it to the wall. Also the hardware embedded into the computer also helps to achieve faster results.

In order start working in this program there is many things to get started.

First the Benchmark testing program must be downloaded (unless you want to create your own).

Always read the readme file, it might be helpful when don't know what to do.

Make sure the paths the program is using the find java are correct (Most computers have different pathways and remember than windows uses '\' instead of '/').

Run the program by reading the readme file

Once you are running the Benchmark GUI, there are chances that it freezes just like my computer did.

Overall, very entertaining and fun!!!

(Dr Thomas comments: It would be interesting to look into more of why battery goes more slowly. Perhaps there are some not-so-well-documented aspects of the operating system that make this happen.)

Final Reflection (Cameron):

I was surprised at the performance between a windows laptop versus the other two. At first comparison both the apple laptop and mine had a 2.2 ghz dual core processor and 2 gigs of RAM. After running some tests and discussing the results with Aaron and Tony the Windows laptop seemed to take twice as much time as the others. So it appears the Windows fails to optimize database management. Unfortunately I wasn't able to do any further benchmarking because of a missing file. I am curious if there is a database software that performs better for windows than other OS's.

(Dr Thomas comments: Well, Microsoft's SQL Server might be an obvious DBMS to look at for good Windows performance...)



Compare load data results, wall vs battery, for the range of warehouse sizes 1, 2, 4, 8, 16, 32, 64... until battery battery goes poof before loading finishes (or you run out of hard drive to store the data).

For running the benchmark querying, vary the sizes of the warehouses. For each warehouse size, run with 2, 32 and 64 terminals. Run all tests for 10 minutes. Start out on wall power for all tests, try out battery if you have time / interest.

Keep adding results into the tables below.


We have (just barely) gotten BenchmarkSQL running on each of your machines. So, this week, we will run experiments in a scattershot, not systematic (or scientific) fashion. Just to see what we can do and where some of the performance limits are.

(Thanks to Cameron, we now know that the README file for BenchmarkSQL correctly lists that loadData ... numWarehouses=10 is correct on Windows. A space must be added in between the = and the number if running loadData on Ubuntu / Mac.)

We will make a more organized testing plan next week.

Record the results of the experiments here. Be careful! Do not accidentally delete someone else's test results!

Loading the database tests:

What kind of Computer? (OS/hardware)Who ran the test?How many warehouses?Wall or battery power?Run timeStart timeEnd timerows loadedrows per secondComments
Macbook Pro OSX 10.6.4Aaron10wall1101 secs11/2/10, 16:07:4011/2/10, 16:26:0150901974623This was the first run
Macbook Pro OSX 10.6.4Aaron15wall1735 secs11/2/10, 18:31:1811/2/10, 19:00:13758469043712nd run
Macbook Pro OSX 10.6.4Aaron30wall4280 secs11/3/10, 22:41:18 11/3/10, 23:39:581506816435203rd run
Ubuntu/Asus Eee PCThomas128wall60148 secs11/7/10, 15:25:3111/8/10, 08:08:00639791291063
Ubuntu/Asus Eee PCThomas64battery11/11/10 and 11/12/10battery died. went from 98% to 0%. tried twice; battery died both times
Ubuntu/Asus Eee PCThomas64wall23288 secs11/6/10, 23:19:0611/7/10, 04:47:15320324181375
Ubuntu/Asus Eee PCThomas32battery11740 secs11/10/10, 22:09:4511/11/10, 01:25:25160717571368
Ubuntu/Asus Eee PCThomas32wall11684 secs11/6/10, 19:27:0411/6/10, 22:41:48160710681375
Ubuntu/Asus Eee PCThomas16battery5846 secs11/10/10, 18:38:4011/10/10, 20:16:0780866441383
Ubuntu/Asus Eee PCThomas16wall5966 secs11/5/10, 22:46:3211/6/10, 00:25:5980836841354
Ubuntu/Asus Eee PCThomas8battery3041 secs11/9/10, 22:12:0811/9/10, 23:02:5040917211345
Ubuntu/Asus Eee PCThomas8wall3025 secs11/5/10, 21:20:5511/5/10, 22:11:2040913021352
Ubuntu/Asus Eee PCThomas4battery2850 secs11/9/10, 20:21:5711/9/10, 21:09:272095565735netbook may have gone to sleep partway through this
Ubuntu/Asus Eee PCThomas4wall1592 secs11/5/10, 20:45:5111/5/10, 21:12:2420960721316
Ubuntu/Asus Eee PCThomas3wall1143 secs11/3/10, 22:55:1611/3/10, 23:14:1915971691397
Ubuntu/Asus Eee PCThomas2battery893 secs11/9/10, 19:56:3911/9/10, 20:11:3310982951229without indices
Ubuntu/Asus Eee PCThomas2wall792 secs11/6/10, 19:08:4011/6/10, 19:21:5310980971386without indices
Ubuntu/Asus Eee PCThomas2wall889 secs11/3/10, 21:10:2111/3/10, 21:25:1110987321235with indices
Ubuntu/Asus Eee PCThomas1battery425 secs11/9/10, 19:45:4911/9/10, 19:52:555986921408without indices
Ubuntu/Asus Eee PCThomas1wall422 secs11/6/10, 18.56:2011/6/10, 19:03:235997611421without indices
Ubuntu/Asus Eee PCThomas1wall475 secs11/3/10, 20:54:4011/3/10, 21:02:365991521261with indices
HP Pavillion Windows7Cameron20wall8651 secs11/7/10, 12:03:0511/7/10, 14:27:17100828851165with indices
HP Pavillion Windows 7Cameron1wall11/16/1014:10:3514:22:23598420845no indices
HP Pavillion Windows 7Cameron2wall11/16/1014:29:3114:51:061097752847no indices
HP Pavillion Windows 7Cameron16wall11/15/1020:21:4022:48:128084855919no indices
HP Pavillion Windows 7Cameron32wall11/11/1021:125:3301:41:24160717711047no indices
HP Pavillion Windows 7Cameron64wall11/14/1023:13:3009:05:0532030042902no indices
hp probook 4510s (Ubuntu)Tony10wall870 secs11/1/10, 20:20:1311/1/10, 20:34:4450920565852 tables dropped before running command
hp probook 4510s (Ubuntu)Tony10battery892 secs11/09/10, 11:06:1811/09/10, 11:21:1050922895708 tables dropped and battery takes a little longer than wall
hp probook 4510s (Ubuntu)Tony20wall1786 secs11/09/10, 11:24:4911/09/10, 11:54:35100820345645 speed is not constant by increasing the number of warehouses
hp probook 4510s (Ubuntu)Tony20battery1819 secs11/09/10, 11:57:4811/09/10, 12:28:07100790295540
hp probook 4510s (Ubuntu)Tony30wall2646 secs11/09/10, 12:35:4711/09/10, 13:19:54150740455696
hp probook 4510s (Ubuntu)Tony1wall100 secs11/11/10, 16:48:3011/11/10, 16:50:115990955990
hp probook 4510s (Ubuntu)Tony2wall183 secs11/11/10, 16:51:0711/11/10, 16:54:1110979575999
hp probook 4510s (Ubuntu)Tony4wall353 secs11/11/10, 16:55:2111/11/10, 17:01:1520959315937
hp probook 4510s (Ubuntu)Tony8wall696 secs11/11/10, 17:02:0711/11/10, 17:13:4440923835879
hp probook 4510s (Ubuntu)Tony16wall1386 secs11/11/10, 17:20:2511/11/10, 17:43:3280843455832
hp probook 4510s (Ubuntu)Tony32wall2774 secs11/11/10, 17:44:5411/11/10, 18:31:08160681105792
hp probook 4510s (Ubuntu)Tony64wall5483 secs11/11/10, 18:33:5611/11/10, 20:05:20320379355843
hp probook 4510s (Ubuntu)Tony128wall11267 secs11/12/10, 00:48:5711/12/10, 03:56:44639742625678
Macbook Pro OSX 10.6.4Aaron1wall124 secs11/15/10, 13:47:0811/15/10, 13:49:1350901974830This is the 1st true test
Macbook Pro OSX 10.6.4Aaron2wall233 secs11/15/10, 14:34:2011/15/10, 14:38:1410977104711
Macbook Pro OSX 10.6.4Aaron4wall447 secs11/15/10, 15:20:3011/15/10, 15:27:5720962064689
Macbook Pro OSX 10.6.4Aaron8wall877 secs11/15/10, 18:25:2211/15/10, 18:38:5940949634669
Macbook Pro OSX 10.6.4Aaron16wall1743 secs11/15/10, 19:50:5511/15/10, 20:19:5980858814639
Macbook Pro OSX 10.6.4Aaron32wall3458 secs11/15/10, 22:22:1211/15/10, 23:19:51160710634647

Running the TPC-C-like benchmarks:

What kind of computer?OS usedDBMS usedWho ran the test?How many warehouses?Number of TerminalsWall or battery power?Total seconds to run test(Date?)start timeend timerows loaded(Running Ave tpmC:?)rows per second(Errors(yes/no)?)Comments
Macbook Pro OSX 10.6.4mySQL.shAaron12Wall11/15/1013:57:0114:07:11tpmC 8.96No errors
Macbook Pro OSX 10.6.4mySQL.shAaron110Wall11/15/1014:21:1914:32:20tpmC 8.89No errorsUnable to run 16 or 32 terminals. Recieved error stating invalid number of terminals so i ran 10
Macbook Pro OSX 10.6.4mySQL.shAaron22Wall11/15/1014:50:0315:00:28tpmC 5.27No errors
Macbook Pro OSX 10.6.4mySQL.shAaron216Wall11/15/1015:03:2915:15:00tpmC 4.95No errors
Macbook Pro OSX 10.6.4mySQL.shAaron42Wall11/15/1016:19:2816:29:55tpmC 5.48No errors
Macbook Pro OSX 10.6.4mySQL.shAaron416Wall11/15/1017:42:4317:56:56tpmC 2.18No errors
Macbook Pro OSX 10.6.4mySQL.shAaron432Wall11/15/1017:59:1618:17:33tpmC 2.67No errors
Macbook Pro OSX 10.6.4mySQL.shAaron82Wall??18:42:0718:52:48tpmC 1.21No errors
Macbook Pro OSX 10.6.4mySQL.shAaron816Wall11/15/1018:56:1719:15:12tpmC 1.26No errors
Macbook Pro OSX 10.6.4mySQL.shAaron832Wall11/15/1019:17:3819:44:50tpmC 1.39No errors
Macbook Pro OSX 10.6.4mySQL.shAaron162Wall11/15/1020:24:1820:37:14tpmC 0.54No errors
Macbook Pro OSX 10.6.4mySQL.shAaron1616Wall11/15/1020:41:1921:10:59tpmC 0.67No errors
Macbook Pro OSX 10.6.4mySQL.shAaron1632Wall11/15/1021:19:5122:12:34tpmC 0.64No errors
Macbook Pro OSX 10.6.4mySQL.shAaron322Battery11/16/1009:36:5914:07:11tpmC 8.96No errors

Questions to answer:

Observed oddities during course of testing...


"Pathologies of Big Data", Communications of the ACM, August 2009. Read this by next week. You might need to be on a campus computer to access the full text of the article.

Transaction Processing Council -- we are going to try to get one of their benchmarks running.

Report on a TPC-H test someone ran on MonetDB. Included here to give you an idea of the kind of report we might want to generate. Their results are, now, old, so our results would be fresher.

PostgresQL - another free DBMS we could compare to MySQL.

Oracle 10g Express DBMS -- a free, starter DBMS product from the giant DBMS company, Oracle

BenchmarkSQL open source software for running TPC-C on a variety of DBMS products.

List of database benchmarking tools, plus DBMonster. On the list, the "Open Source Development Labs Database Test Suite", BenchmarkSQL, Bristlecone, Java TPC-W, TPCC-UVa and Hammerora look most promising. Sourceforge seems to have more free DBMS benchmarking tools, too.

Try to get a TPC benchmark running against MySQL on your netbooks. Any TPC benchmark will do.

Keep track of:

  1. Useful commands to the computer. Ex. "mysql -h hostname -u username -ppassword", for starting mysql up at the command line in Ubuntu.
  2. Useful web sites, useful software
  3. Points where installing or using software got hard/confusing, and how you got around the problem (if you do) -- this tells us where software is not ready for "Joe Public" to use it, which is valuable information for improving the software.