LargeDataSets DBMSTesting 2011 12

From Wiki
Jump to: navigation, search

Nullius in verba (Motto of the Royal Society, the oldest "learned society for science" in the world. Roughly translated: "Trust no one's word.")

DBMS Performance Testing Group work

main AHPCRC wiki page

2012 Spring Performance Group

5/7/2012

Good first draft, by email, to Dr Thomas by midnight 5/16 (Reading Day). Will try for 48 hour turnaround on comments, and final draft due by 5/23 or 5/25 (depending on whether or not you're attending commencement on 5/24).

4/24/2012

Matt - bizarre behavior on the part of Postgres. There are some queries (not all) which run very rapidly on MySQL and take multiple minutes to run (same query!) on Postgres. Experimental results may be peculiar. But results are results, so generate 'em anyway.

Lue -- Generating a few sample graphs. Right hand side (really big data set) results not available.

Chris -- Ran more tests. Doesn't really have enough untainted data results to generate graphs. Confident can generate graphs.

By next week, OUTLINES of papers (list of section titles, sub-titles, sketches of what material will be covered where).

Abstract, introduction, background, experimental set-up, experimental results, discussion of experimental results, future work, conclusions, bibliography.

Your bibliographies may be dominated by web sites - MySQL web site, Postgres web site, BenchmarkSQL web site, discussion boards that answered oddball questions that came up during experiments, etc, etc.

Submit your work as posters for the Student Poster Celebration. Link on Dr Thomas's web page.

4/17/2012

Lue - running tests.

Matt - generating data. Has run tests on smaller data sets.

Chris - found flaw in BenchmarkSQL -- seems to be a bug in how Terminals are matched up with "warehouses". If more than one Terminal assigned to a warehouse, the terminals-past-first one fail at a high rate and test results are unreliable.

Goal: generate partial graphs of results for next week, so we can all look at them to see how things are shaping up.

4/3/2012

Matt: catching up. Taking longer to run query sets over new-fake-data, on MySQL. Will continue catching up.

Lue: Solved the mysterious slow Windows-side testing problem. Piping output to 'null' in the Windows OS is bizarrely slow. Why? Unknown. But piping output of MySQL queries into text files is definitely faster than piping to null, on the Windows side.

So Windows vs Mac tests can now be run, piping output into text files. Also, Lue wants to test piping output to local hard drive file vs. piping to file server over the network. (Be careful: other users on the network may affect performance of the latter tests in unpredictable ways.)

Chris: "Terminal" behavior in BenchmarkSQL slightly erratic. May require scrapping (again!  :< ) test results and trying again. Will do more tests, look at results to see how many / if / when Terminal 'dropping off' from participating in testing process.

3/27/2012

Chris: email me outline of testing plan for semester, or post plan to wiki. Proceed with first round of tests.

Plan: Every week perform at least four separate tests, such that every amount of terminals, (from 1 to 10) is covered over a 5 week period (beginning with the seven days that begin with the 27th) at least twice). The data will be scrutinized not only for trends, but for irregularities.

Matt: regenerate, with newer fake-data generation code, new versions of already-run tests, and compare. Smallest, biggest, middle tests. If similar, might not need to re-run rest of intermediate tests. If different, need to re-run tests on more realistic fake data.

Lue: continue probing mac vs windows, but use two different query sets. One query set with only equality comparisons in the where clause, one query set with range (less than, greater than) comparisons in where clause.

3/13/2012

Lue's to do list: figure out why the peculiarly large speed difference between Win & Mac. We expect some difference between Mac & Win, but 3,000 seconds to run 20 queries is just goofy.

Matt's to do list: send Dr Thomas his To Do list by tomorrow. Finish some test runs of small query sets with larger data sets.

Chris's to do list: BenchmarkSQL now runs well enough to gather data. Time to start systematically running tests, to gather data.

3/6/2012

SRC on Friday. PPT draft outline -- Matt is emailing around outline. Meet Thursday at 2:30 ish for talk rehearsal. Each of you work on 3-ish slides.

Each of you create 1-2 "graph of results" (even if not much results yet) that you can have up while you explain what kind of testing you are focusing on.

Title slide done.

Lue: things that affect performance slide, What is performance slide (speed, reliability - metrics for measuring it)

Chris: why does DBMS performance matter? Data sets growing -- Google, genome data sets examples. What is a DBMS slide.

Matt: Defn of terms slide (MySQL is, PostgreSQL is, Mac OSX is...), conclusions/future work slide

Bibliography slide.

Draft of SRC paper, Feb 2012

Title: "Database Management Systems Performance Testing Using Several Different Approaches"

Dr Thomas will be logging on Wednesday night to edit what is here.

Matt will be here noon-2 on Wed and Friday. Works Thurs. Classes Wed / Fri morning. Lue will be here, in the CS lab, most of Thursday. Friday classes back-to-back. Chris will be around Thurs 1:45 - 3:30, Fri 1-2, 3+

Abstract (one paragraph)

Database Management Systems (DBMS) performs differently under various factors: such as, computer hardware (Dell/Mac), the operating system (OS) the DBMS is running over, or concurrency, multiple users using the DBMS at the same time, can affect performance. Technology managers need to know so they can decide what DBMS is right for their needs. However, which DBMS has the best performance under which conditions? To answer this question, our group has decided to look at the three different ways in which the performance of DBMS’s can be tested. The first approach we conducted is to run the same DBMS (and sets of data) on the same computer (iMac), but under difference OS’s. The second approach uses a Dell Workstation running Windows 7, but comparing two DBMS’s (MySQL and PostgressSQL) to see how the performance correlates. Finally, the third performance testing approach on an Asus Netbook plaform will examine what effect concurrency can have on a DBMS, with an analysis emphasis on factors other than CPU speed or memory. These are all important approaches in helping us understand how not only hardware but software can affect the performance of a DBMS under different constraints.

(DONE BY MATT PLEASE EDIT AS YOU SEE FIT)

(MT: Hardware, operating systems are nouns, not names. No capitalization (unless starting a sentence). DBMS is an acronym, so capitalize. DBMS's run, their performance does not. "Knowing how the performance of a database management system (DBMS) will vary based on certain factors, like computer hardware (...), operating system (OS), or how concurrency can affect (AFFECT, not effect -- effect is a noun) performance is important. Technology managers need to know so they can decide what DBMS is right for their needs. Our group has decided to look a three different ways in which the performance of DBMS's can be tested. ..."

"The first approach will be to run the same DBMS..." Platform is not a name; don't capitalize. (For heavens sake. Capitalizing A? Running?) Performance differs, it doesn't correlate. Don't need to mention specific like Windows 7, Asus, Dell, etc, in the abstract. Save that for the experiments section(s).

"Finally, the third performance testing approach will examine what effect concurrency can have on a DBMS, with an analysis emphasis on factors other than CPU speed or memory size." )

Introduction

Database management systems (DBMS) are widely used in business and government to support every kind of information gathering and analysis humans engage in. For this reason, DBMS performance -- speed, ease of use, predictability -- is crucial, and must continuously improve in order to keep up with the increasing demands of industry.

For example, the first human genome was decoded in 2003, and the work took 10 years. In 2009, sequencing a single human genome took a month. The 1000 Genomes project (cite http://en.wikipedia.org/wiki/1000_Genomes_Project) aims to sequence 2,000 genomes. Similar explosions in the capacity to discover and store new information are occurring in many areas of our everyday lives, and much of that information is being stored in database management systems.

This semester, our research group is looking at different aspects of DBMS performance issues. We are exploring three different approaches to evaluating DBMS performance. The first approach is to compare the performance of one DBMS software package, the popular freeware DBMS MySQL (cite here), on two different operating systems, Windows 7 and Mac OS X. The second approach is to compare performance changes between MySQL and another popular, freeware DBMS, Postgres (cite here), on one operating system, Windows 7. The third approach is to test the effect of changes in the number of simultaneous users on MySQL installed on Windows 7. We will discuss each of these approaches in more detail below.

Background

Leave blank for now, until we see what information is getting presented multiple times in your separate subsections

In order to efficiently test DBMSs performance, we require easily accessed large data sets. We wrote a Java program to randomly generate fake data, or tuples, to represents a small portion of the census data set.(appendix -fake data) We also extended the Java program to randomly generate simple and complex DBMSs queries.(appendix – queries)

Comparing MySQL Performance on Different Operating Systems

Although DBMS performance has improved vastly over the past decades, their performance can be affected by the environment - the machine and its operating system - the DBMSs are executed on. This trail leads us to comparing DBMSs on different operating systems. Operating systems are important system software on computers; they are the software that runs and schedules software jobs on a computer. There are different types of operating systems. Some operating systems are programmed for specific application. For example, FreeBSD, NetBSD, and OpenBSD are operating commonly used on webservers. (cite wiki: http://en.wikipedia.org/wiki/Operating_system) Some operating systems, such as Mac OS and Window Microsoft family are personal computer operating systems. These differences imply that operating systems are likely to influence DBMS performance. Naturally, comparing one DBMS on different OS is one approach to evaluating DBMS performance.


The computer that we used as the testing platform is a dual-booted iMac Intel Core i5 @ 2.5 GHz with 4 GB of RAM. A dual-booted machine is a machine that has two operating systems installed on it, and either one can be booted when starting the computer. We chose version 5.5.15 MySQL as the test subject on a 32-bit Windows 7 Enterprise Service Pack 1 and Mac OS X version 10.6.8 operating systems.

(MT: good to explain dual-boot to the newbie readers)

With the testing machine set up and ready, we started by generating the large fake census data sets: 10,000 tuples, 100,000 tuples, 1,000,000 tuples, 10,000,000 tuples, and 100,000,000 tuples (see figure). In terms of file size, these data sets are 193 KB, 1.9 MB, 19 MB, 189 MB, and 4.78 GB, respectively. These data sets may not be very large in comparison to the size of files you may have seen, but it took 10+ hours to generate the 100,000,000 tuples of fake census data that amounts to roughly 5 GB. We hope to generate even larger data sets in the future, but, for now, these data sets suit our purpose.


Next, we inserted the data sets into a table, called "fakecensus", in the MySQL database using scripts. Scripts are small programs written for command interpreters: terminal (Mac OS X) and command prompt ( Windows 7). In the scripts, we prompt the command interpreter for the “time” before inserting the data set into MySQL and after the data insertion is completed. The “time” difference computed from the end time and start time is a good measure of MySQL performance, because we are using the iMac’s clock speed, the speed at which a microprocessor executes instructions (cite? http://www.webopedia.com/TERM/C/clock_speed.html). We repeated this process three times on the operating systems Mac OS X and Windows 7 to see if the last two sets of results are comparable to the first set of results. It turns out that the last two trials had better performance time.


Our next task is to feed MySQL with large sets of simple queries. Earlier, we are concerned about MySQL’s performance in creating and building a table in the DBMS. Here we intend to gauge MySQL’s performance in retrieving data in the "fakecensus" database. There are various ways in which we can query MySQL for answers. For the sake of evaluating the performance of MySQL, we will feed 10,000, 100,000, 1,000,000, and then 10,000,000 randomly generated queries into the "fakecensus" database with 100,000,000 tuples. Again, the “time” it takes to query MySQL is noted down as mentioned above. As a side note, it is of extreme importance to "drop", "create", and reinsert the large data sets into the "fakecensus" table before querying MySQL. It will eliminate sources of errors and set up a standard condition for all querying tests.

Performance comparison for MySQL and Postgres on the same platform and operating system

(----Matt, maybe you should write a sentence or two to introduce your approach. Why compare two DBMs? Could there be a difference in performance?) The second approach our team decided to take was testing MySQL and Postgres on the same platform and operating system. The system specifications include a DELL Precision 490 Intel Xeon 5150 @ 2.66 GHz with 6 GB of ram and running on Windows 7 Professional (64 bit) operating system. The version of MySQL is 5.6.4-m7 and Postgres is 9.1. The initial test data that we used was a 100,000,000 tuple table, consisting of five fields, ID, Income, IsMale, LastName, and Age, that we generated using the same program that generated "fakecensus," above. By generating the data we ensured every record was complete, with no null or missing values.

(MT: I'm deleting the explanations of 32- and 64-bit. We just haven't got space to explain it properly to the judges.)

Though the initial intent was purely to test the performance of MySQL and Postgres individually, we also want to note the installation process as well. MySQL was installed with relative ease and we were capable of using the DBMS fairly quickly afterward. On the other hand there were initial problems installing Postgres. Though Postgres does have an installation manager, we had to uninstall and reinstall the Postgres DBMS multiple times. Once a correct installation was accomplished connecting to the database was the next challenge to overcome. Yet after these initial obstacles both DBMS’s were configured, but due to these challenges and the time constraints we have not been able to actually compare the DBMS's performance.

(----Matt, this first sentence is a little bit confusing.) The approach we will take now that ??? the data has been successfully uploaded into both MySQL and Postgres will be to start with simple queries that use aggregate functions (functions used for specific analysis of the data) including count, average, min and max. Once these are tested, the next step will be to add conditions to the queries, while using the aggregate functions. This will increase the complexity of the query and force the DBMS's to run and possibly return the same results from these queries in different ways. We will need to re-load data in between test runs over our queries, to ensure that the results from previous queries will not simply be stored in memory by the DBMS. Finally, we will test the DBMS performance over a set of nested queries (a query within a query) as a final way to increase level of complexity. (----Maybe explain what is the purpose of the query. MT: purpose of test : strain the DBMS performance. Nested queries are known to be difficult for a DBMS to analyze when it is looking for the fastest way to run a query)

(MT: Suggestion. Replace first sentences with "The approach we will take when the fake census data has been successfully loaded into both MySQL and Postgres will be to start timing the DBMS performance on simple queries. This set of simple queries will use aggregate functions, (functions that collect many rows of input data into one row of output) including count, average, min and max."

Performance on a concurrent workload

//Shortened Revision

Some devices used as DBMS servers will be machines designed for things other than raw performance, such as low cost, low power usage, and portability. How will a DBMS handle these possibilities when expected to respond to many simultaneous queries.

To answer that question, we have chosen to simulate the scenario on a particularly low-end machine, the ASUS EEE netbook. It uses an Intel Atom N280 processor with a speed of 1.66 GHz., and 1 GB of RAM, running Windows 7 32 bit Enterprise Edition. This is reasonably similar to running hardware that is obsolescent for its workload in terms of raw performance. Even high end hardware will find itself over-matched in this manner with the growth of data used if upgrades cannot be afforded easily, as they often are not. This may be under optimized.

A computer services one query at a time trivially, but people do not wait patiently unless they must, so concurrency (at the same time) is important, but this is much trickier. Queries interfere in many ways, possibly leading to false results, or even destroying data. DBMS’s take sufficient measures to prevent corruption and incorrect results, but these measures can cause greater performance degradation compared to those expected purely by the increase in workload.

To simulate concurrent users and their transactions, we will utilize a free program called Benchmark SQL (cite http://sourceforge.net/projects/benchmarksql/). BenchmarkSQL is inspired by the TPC-C (cite http://www.tpc.org) test, and employs a similar configuration. To simulate a reasonable workload, it creates nine different tables. It varies the kind of workload, the number of terminals, and either the number of queries, or the length of use per terminal. It records tpmC (transactions per minute), start time, finish time, and number of transactions;number of errors is also available.. BenchmarkSQL has little overhead, and should not notably affect results. For reported results, no other programs will be running. The number of concurrent terminals and type of workload are expected to affect results, and will be varied separately.





//Before Revision is below

Not every device relied upon for use as a host for a DBMS will be a high end machine. In fact, in some places the machines used are more likely to be at the low end spectrum. Other scenarios to be considered include machines optimized for reasons other than performance, including price, low power usage, and portability. How will the DBMS handle scenarios like these, as they respond to many simultaneous queries?

To answer that question, we have chosen to simulate that scenario on a considerably low end computer, the Asus EEE netbook. This platform has an Intel Atom N280 processor with a speed of 1.66 GHz, and 1 GB of RAM, running Windows 7 (32 bit) Enterprise edition. This is reasonably similar to having a new database on hardware that is obsolescent. With the rate at which available data and the necessity to manage that data increase, hardware considered high end that is now outdated will be unsuited for the role in which it may find itself. Especially where upgrades are a luxury that cannot be afforded. While it is perhaps reasonable for the designers to not have taken this into consideration as much as on higher end devices, how well they do here still matters.

It can be relatively trivial for a computer to service one query at a time, but people do not wish to wait in line for their turn especially if there is a large number of queries before theirs. Serving users concurrently (at the same time) is quite a bit trickier. Queries might interfere with one another in a number of ways. This could lead to incorrect results, up to and including destroying the very data with which they are working. DBMSs take measures to prevent incorrect data corruption and incorrect query results, but these measures could easily cause performance to degrade grow worse faster than the number of simultaneous queries grows bigger, if the DBMS is not carefully coded. Additionally, most programmers are used to writing code that executes sequentially, and so may produce incorrect concurrency code. There are sufficient tests to ensure concurrency results of a major DBMS are correct, but how will this affect performance?

(MT: drop supralinear, just in case. My simple dictionary doesn't have it.)

In order to simulate concurrent users (users querying the system at the exact same time) and their transactions, we will be utilize a free program called BenchmarkSQL (cite http://sourceforge.net/projects/benchmarksql/). BenchmarkSQL is inspired by the TPC-C (cite http://www.tpc.org/) test, and employs a similar configuration. To simulate a reasonable workload it creates 9 tables: warehouse, district, customer, history, oorder, new_order, order_line, stock, & item. A number of options may be selected, including number of terminals, number of warehouses (how big is the database), transactions per terminal (or per minute), what percentage of transactions are payment, order-status, delivery, and stock-level (new orders are equal to whatever percentage remains). It records data such as the options used to run it, tpmC (transactions per minute), start time, finish time, and total transactions. BenchmarkSQL has only a small amount of overhead, and should not notably affect the results. For the reported results, no other programs will be running.

For the testing, number of concurrent terminals and type of workload are the primary variables. Performance is expected to vary with both of them.

Preliminary Results

Due to complexities in setting up our testing stations for the three different approaches in comparing the DBMSs performances, we have little results from the tests.

As a sample of what we hope to do over the rest of the semester, we have ran three trials of the the operating system comparison approach for the Windows 7 and Mac OS X operating systems each and calculated the average clock speed (time) it takes the operating system to insert fake census tuples into MySQL.

(Lue's mac vs win chart here)

Brief commentary on chart: In the graph (figure #), the overall MySQL performance between the Windows 7 and Mac OS X can be seen. Based on the data and the graph, the performance of MySQL on Windows 7 is better than on Mac OS X for smaller data sets (10,000 to 1,000,000 tuples). However, for large data sets, the performace of MySQL is significantly better on Mac OS X than Windows 7. These differences in MySQL performance is due to the various algorithms, data structures, and techniques used in each operating systems. We will anaylze the data and operating systems to determine why MySQL performs better on one operating system and not the other.


(Matts MySQL Vs. PostgreSQL Results)

To better visualize our results we used a graph (Figure #) based on number of records Vs. Time in seconds. The graph clearly shows that PostgreSQL performs better with larger datasets than MySQL does. The trend line produced from PostgreSQL looks linear, and the results are very close to being linear. MySQL starts out with more logarithmic results but then straightens out into what looks linear. Yet the results are not at all close to being linear. With smaller datasets around 100,000 or less MySQL seems to perform better. Thus in the MySQL Vs. PostgreSQL performance testing PostgreSQL out performs.

Conclusions and Future Work

In this small scale DBMS testing research project, we already seen some indications of the differences in performance of DBMSs. DBMSs come in many shapes and sizes depending on the purpose a DBMS is used for and where the DBMSs is actually installed. As we have established, DBMSs are used everywhere. Thusly, the improvement of DBMSs performance on machines are of great importance.

Database management systems performance is strongly affected by operating system, different DBMS installation, computer hardware used, number of users being supported by DBMS.


Future Work:

Lue

We wish to gather performance data when querying MySQL to see how it compare to the data retrieved by inserting large data into MySQL. From the result on the graph, there seem to be a gap from 10 millions to 100 millions tuples. This gap requires closer scrutiny, so we can correctly evaluate MySQL performance on Windows 7 and Mac OS X operating systems. After this initial DBMS performance testing, we hope to actually use real census data so we can truly analysis the performance of DBMS and draw our research to a conclusion.

Acknowledgements

We thank the Army High Performance Research Grant for their support of our research.

Bibliography

cite MySQL web site Cite Microsoft SQL Server web site Cite Postgres web site Cite Wikipedia 1000 Genomes project page Cite BenchmarkSQL web site


Works Cited

[1] Oracle Corp. (2012). MySQL:: The world's most popular open source database. Retrieved 02 26, 2012, from MySQL: http://www.mysql.com/

[2] Microsoft Corporation. (2012). Database Management | Data Mining & Warehousing | Microsoft SQL Server. Retrieved 02 26, 2012, from Microsoft SQL Server: http://www.microsoft.com/sqlserver/en/us/default.aspx

[3] PostgreSQL Global Development Group. (2012). PostgreSLQ: Welcome. Retrieved 02 26, 2012, from PostgreSLQ: http://www.postgresql.org/

[4] 1000 Genomes Project - Wikipedia, the free encyclopdia. (2012). Retrieved 02 26, 2012, from Wikipedia: http://en.wikipedia.org/wiki/1000_Genomes_Project

[5] Geeknet, Inc. (2012). BenchmarkSQL | Free System Administration software downloads at SourceForge.net. Retrieved 2 26, 2012, from sourceforge: http://sourceforge.net/projects/benchmarksql/

[6] Operating System - Wikipedia, the free encyclopdia. (2012). Retrieved 02 26, 2012, from Wikipedia: http://en.wikipedia.org/wiki/Operating_system

[7] Clock Speed - Webopedia. (2012). Retrieved 02 26, 2012, from Webopedia: http://www.webopedia.com/TERM/C/clock_speed.html

Old Draft Material, Kept in case we want it later

//Not Ready for prime time

Performance on a concurrent workload

Not every device relied upon for use as a DBMS host will be a high end machine. In fact, in some places the machines used are likely to be quite low end. How does DBMS software handle this scenario? Specifically, how does MySQL handle the scenario where it is hosted on a machine optimized for factors other than performance, such as price, low power usage, and portability, but still needs to respond to many simultaneous queries?

(---No I's, first person. Use group reference, "we") (Give the model of your low end computer and specification at the beginning.) To answer that question, I have chosen to simulate that scenario on a very low end computer, the EEE netbook. It has an Intel Atom N280 processor at 1.66 gigaherz, and 1 gigabyte of RAM, running 32 bit Windows. This is reasonably similar to having a new database on hardware that is becoming obsolescent; with the rate at which data available and necessary to manage is increasing, much previously decent hardware will be unsuited for the role in which it will be necessary find itself, especially in the poorer regions where upgrades are a luxury that cannot be afforded. While it is perhaps reasonable for the designers to not have taken this into consideration as much as on higher end devices, how well they do here still matters.

(---Why do you want to simulate concurrent users? Why is DBMS performance important for concurreny?) It is relatively trivial for a computer to service a single request at a time, but people do not wish to wait in line to have their turn come up, especially since it may be behind a very large number of other people. Serving multiple users at the same time is quite a bit trickier than one, however, since they might interfere with each other in a number of ways, all the way through giving out incorrect results, up to and including destroying the very data with which they are working. Preventing that brings a measure of performance degradation that could easily be supralinear with the number of simultaneous queries, if not carefully coded. Additionally, most programmers are used to writing code that executes serially, and so may produce incorrect simultaneous code. There are sufficient tests and checks to be relatively sure that the results of a major dbms will be correct, but will the performance be what it could be? In order to simulate concurrent users (users querying the system at the exact same time) and their transactions, a free program called BenchmarkSQL is utilized. Benchmark SQL is inspired by the TPC-C test, and uses a similar setup thereunto. This is used to simulate a reasonable workload. It creates 9 tables: warehouse, district, customer, history, oorder, new_order, order_line, stock, & item. A number of options may be selected; ones that are used include Number of Terminals, Number of Warehouses (how big is the database), transactions per terminal (or by minute), and what percentage of transactions are payment, order-status, delivery, and stock-level (new orders are equal to whatever percentage remains). It records data such as the options used to run it, tpmC, start time, finish time, and total transactions. BenchmarkSQL is only a small amount of overhead, and should not notably affect the results. For the reported results, no other programs will be running.


For the testing, number of concurrent terminals and type of workload are the primary variables. Performance is expected to vary with both of them.

(MT, Wed night: 'in some places, for example, in developing countries, machines used...' And I don't know that netbooks are "inferior," they are just built with different priorities. Lower cost, the ability to run reliably with intermittent power supplies, lower weight. Your choice, I suppose.)


Edited copy as I'm not sure if Chris still needs to change anything I am not going to delete his original. I'll let him.

2/14/12

Chris: Percona software seems to be UNIX specific. So that is a no-go on his Windows machine. Try out BenchmarkSQL, most recent version. Make plans (with or without BenchmarkSQL, depending on how that goes) and meet with Dr Thomas on Thursday to discuss them. (Update 2/16/2012: BenchmarkSQL is thiiiiis close to working. Will continue getting it to go.)

Lue: Write random-query with random-data-fields-in-query generation code for Top 5 queries. Continue wrestling with MySQL installation in computer lab, so that will be able to do Windows vs. Mac testing after queries ready to roll.

Matt: Unable to attend meeting. Will coordinate via email. Might want to talk to Lue, since likely that both of you will be using some kind of random-query-generation code.

Student Research Competition deadline next Thursday!! Each student must draft a one page, summary of plan, experimental approach. We'll have to work on the abstract, intro and conclusions to glue the parts together together, next Tuesday.

2/7/12

Matt: MySQL installed, Postgres needs to be. Plan for week, get loading of data working on both DBMSs, start testing on smaller end of data set scale. Bring list of 3+ sample queries for testing to next meeting.

Chris: look into DBT1, DBT3, DBT2 benchmarks. Definitions, vocabulary, etc. Percona benchmarks might allow Chris to run num-of-user comparisons fairly easily... if can get them working.

Percona benchmarking software seems to run open source versions of the (trademarked) TPC benchmarks.

Lue: Work on generating all the fake-census data sets necessary for testing plans. Work on writing sets of SQL queries for querying part of testing plans. Try trial runs on netbook computer (Windows/MySQL) before trying installation in CS lab.

1/31/12

For next week: each student writes out a testing plan for your individual testing goals. We have OS/OS/DBMS-same testing, and query optimizer effectiveness and/or concurrency scaling testing.

10/31/11

Ball, Cardoza, Davis, Yang group begins.

Start by reading: "Pathologies of Big Data", Communications of the ACM, August 2009. Read this by next week. You might need to be on a campus computer to access the full text of the article.

"Getting our feet wet" assignment: Start/finish job of installing MySQL and getting to the point of being able to load data from a plain text file and time how long loading takes, using simple 'date' reports from the OS. If possible, run the code written by previous groups to generate fake census sets of sample sizes and time how long they take to load.

10/10/11

Sam's script for inserting files full of tuples into a MySQL DBMS and timing the insertion, on a Unix-based OS. Might need to run "chmod ogu+x scriptname.script" at the command line before the script will work. (Changes permissions on the script file, telling the OS to make it executable.)

10/5/11

A helpful page for MySQL can be found at http://dev.mysql.com/doc/refman/5.6/en/index.html New files for generating fake Census data. I think I've stored up the most recent versions. I renamed the GenerateData code file to reflect the students currently working on it. I also renamed Jose's "Test" to be "LastNameBGPJ" -- a name which reflects the purpose of the class and the people working on the code.

GenerateDataBGPJ.java, LastNameBGPJ.java, ExpDistribBGPJ.java, and, to support the generation of last names (Names are copied from http://names.mongabay.com/data/1000.html), lastNames1-250.txt, lastNames251-500.txt, lastNames501-750.txt, lastNames751-1000.txt. Lists of names are courtesy of Mongabay.com.

New SQL create table statement to match above text file:

drop table fakecensus;

create table fakecensus ( identifier int auto_increment, gender char(1), income int, lastname char(50), age int, primary key (identifier) );

load data local infile '/Users/wherever/whatever.txt' into table fakecensus fields terminated by ',';

Notes: There must be a comma at the front of each line, for the auto_increment field to be created properly.


From Jose: If you are having trouble with mysql there is a webpage that can probably explain you how to work with sql. the webpage is http://dev.mysql.com/doc/refman/5.6/en/loading-tables.html

9/28/11

create table fakecensus ( identifier int auto_increment, is_male bit, income int, lastname char(50), age int, primary key (identifier))

Hardware Details:

Baer: Custom built Desktop; Processor - AMD Phenom II X6 1100T 3.7GHz - 12GB DDR3 1600 RAM; Ubuntu 11.04 64-bit

Jose: Model - HP Pavilion g6 Notebook PC; Processor - Intel(R) Core(TM) i3 CPU M380 @ 2.53 GHz; RAM - 4.00 GB; 64-bit OS

Robert: Model- ASUS Eee PC; Processor- Intel(R) Atom(TM) CPU N270 @1.60GHz; RAM - 1.00GB; 32-bit OS

Armando Garcia: Model - Macbook Air; Processor - 1.6 GHz Intel Core i5; RAM - 4.00GB; 64-bit OS

Test Results:

Num. tuplesBaerGarciaPerezJimenez
100,000 <1 second 1 second 2 seconds 5 seconds
1,000,000 2 seconds 11 seconds 27 seconds 63 seconds
10,000,000 23 seconds 110 seconds 259 seconds 712 seconds
100,000,000 275 seconds 1097 seconds 2681 seconds 7973 seconds

Note: My computer heated up after a while, so the last test might have been slower than it should have been. -Robert J.

9/21/11

Baer, Garcia, Perez, Jimenez group begins.

Start by reading: "Pathologies of Big Data", Communications of the ACM, August 2009. Read this by next week. You might need to be on a campus computer to access the full text of the article.

9/15/11

'Warehouse-Scale Computing' talk by Luiz Barroso of Google, June 2011 Click on the Table of Contents tab to see the plenary talks, at the top. The MP4 videos are big files, so download with care. (If that link doesn't work, try this link to the Barroso talk.) A talk about all the challenges that come up when trying to do really, really, we're not kidding now, MASSIVE computing.

The Watson/DeepQA talk, also on the same page (first link) as the Barroso talk, is also good, but cuts off after 35 minutes. As far as the recorded part of the talk went, it is a nice survey of some of the (human) language and AI challenges Watson's team had to deal with.

9/8/11

After Dr Thomas puts up a program to print out/generate files of tuples, everyone will run timing tests on own MySQL installations, timing how long it takes MySQL to load tuples.

Time loading tests for: 100,000 tuples, 1,000,000 tuple, 10,000,000 tuples... go up by power of 10 each time.

Please insert a brief description of your hardware / OS here: (Chris)The computer I'm using for benchmarking uses an atom processor of unknown speed, and has 1 GB of memory. I am unsure whether it is dual core or simply hyperthreaded, but I suspect the latter. I'm running a 32 bit version of Windows 7. (/Chris)

(Matt)

The computer I'm benchmarking with is Windows 7 x64 with dual core Intel Xeon 5150 @ 2.66 GHz with 6 GB memory.

(/Matt)

(Lon)

TOSHIBA Satellite C655D,
AMD Athlon(tm) P320, Dual-Core Processor, 2.10 GHz
4.00 GB RAM
Windows 7 Home Premium, 64-bit Operating System, Service Pack 1

(/Lon)

Test results:

Num. tuplesChrisEmmanuelLonMatt
100,0006 seconds3 seconds
1,000,00064 seconds26 seconds
10,000,000613 seconds184 seconds
100,000,0007042 seconds2146 seconds

Source code gluing together all of the group's methods. The program prints directly to the Terminal/Xterm/Command Prompt window. For debugging, that might be what you want. If you want to create a plain text file full of tuples, just use 'java GenerateDataCELM 10 > fileOf10tuples.txt'. (Replace the 10 with however many tuples you want, and name the whatever.txt file whatever you want.)

Since we want MySQL to auto-generate the identifier, I suspect the given code may be buggy. I do not, offhand, know if MySQL will want the auto-generated column to be present, but blank (i.e., start each line of the plain text file with a comma) or not present, or what. When you figure that out, put the answer here so we'll all know.

I guessed (wildly?) at the meanings of all the magic numbers y'all left in your code; I may have guessed wrong and mis-named the constants. Let me know, y'all.

Lon, if you could modify the code with appropriately named constants to replace the magic numbers in your code, I'd appreciate it. Offhand, what is 65 for? 97?

65 is the ASCII decimal value offset to uppercase 'A' character. 97 is the ASCII decimal offset value to the lower case 'a' character. The code has been updated with appropriately named variables and comments. -Lon


I found a utility from Microsoft, rktools.exe. This is a utility package for Server 2003, which will allow us to do exactly what we were wanting to do. -Matt

To download go here: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=17657

As we are running windows 7 when installing it will say "this program has known compatibilities" but I have installed it on my systems (All windows 7 x86 & x64) multiple times without problems. If you don't want to use it thats fine I just wanted to show everyone its there, you can always uninstall it if you don't like it. This is the results from my system (win 7 x64):

M:\My Databases>now & mysql -uroot schooltest<20110320_dks.sql & now

Thu Sep 08 21:24:52 2011

<dataset being added to database>

Thu Sep 08 21:24:53 2011

Thursday Sep 15, 2011 I, Lon Porter, removed 'magic numbers' from the my method, added additional constants and comments. Final output is available here as text:

/*  September 2011
AHPCRC project, data generation program

Generates random tuples for the schema:
create table testdata ( 
identifier int auto_increment, 
is_male bool/bit (emmanuel), 
income decimal/real/money (chris), 
lastname char(50) (lon), 
age int (matt), 
primary key (identifier))
Intended to create files for use with MySQL 'load data' command
**/

import java.io.*;
import java.util.*;

public class GenerateDataCELM
{

//create and initialize variables - this variable is for Matt's getRandomAge()
private static Random randomAgeNumber = new Random();
public static final int MINAGE = 15; //Minimum age value allowed
public static final int MAXAGE = 99;
 //this variable is for Lon's generateLastName()
public static final int LON_SEED = 210986;
public static final int MAXNAMELENGTH = 50;
public static final int ALPHABET = 26;
public static final int LOWERCASEASCII = 97;
public static final int UPPERCASEASCII = 65;
public static Random randomLastName = new Random(LON_SEED);
 //this variable is for Chris's income generation
public static Random chrisRandom = new Random();
public static final int MAXINCOME = 101001;
public static final int BIGINCOME = 100000;
public static final int INCOMESTRINGLENGTH = 6;
public static final int NUMINCOMESTOGENERATE = 1;
 //this variable is for Emmanuel's gender generation
public static Random EmmanuelRandomGenerator = new Random();
public static final int GENDERS = 2;

public static void main(String[] args) {
 if (args.length < 1) {
  System.err.println("Please specify the number of tuples you want on the command line.");
  System.err.println("Example, to generate 100 tuples: java GenerateDataCELM 100");
  System.exit(0);
 }
 
 int numTuples = 0;
 try {
  numTuples = Integer.parseInt(args[0]);
 } catch (NumberFormatException e) {
  System.err.println("Please specify the number of tuples you want on the command line.");
  System.err.println("Example, to generate 100 tuples: java GenerateDataCELM 100");
  System.exit(0);
 }
 
 // loop around and print out fields generated by methods, comma-separated, \n on the end of each line
 
 for (int i = 0; i < numTuples; i++) {
  System.out.println(randomGenderGenerator() + "," + generateUniformIncome(NUMINCOMESTOGENERATE) + "," + generateLastName() 
      + "," + generateRandomAge());
 }
}


// Emmanuel's method
public static String randomGenderGenerator()
{
       
 int randomInt = EmmanuelRandomGenerator.nextInt(GENDERS);
 String randomString = Integer.toString(randomInt);
 //System.out.println(randomInt);
 
 return randomString;
}
   
/*
 By: Matt Stephens
 Created: 09/02/2011
 Upated:  09/08/2011
 
 When the program is run main calls generateRandomAge, where a value
 between 15-99 (Min/Max values), and returns the value as a string
 then prints the value out to the screen from main
 */
public static String generateRandomAge(){
 //this method is used to generate a numeric value
 //between 15-99 then retrun that value as a string
 String randomAge = "";
 
 //Using the random value generated with the difference from the maximum and minimum values
 //for the seed, Then adding this to the minimum value gives the possible range of values.
 randomAge = Integer.toString(MINAGE + (randomAgeNumber.nextInt(MAXAGE-MINAGE)));
 return randomAge;
}

/* generateLastName() returns a String of random US alphabet characters.
 * First character of the returned String is a capitalized letter,
 * all other characters of the returned String are lowercase letters.
 *
 * Class containing method will need to contain the following line for
 * constistant 'random' number generation:
 *
 * public static Random randomLastName = new Random(seed_value);
 *
 * Dependent on:
 *    java.util.Random;
 * 
 * Thus the class must import the java.util.Random package/library.
 * This can be done by including the line:
 *    import.util.Random;
 *
 * Written by Lon Porter
 * 8 September 2011
 * Updated 
 * 15 September 2011 - name 'magic number', ASCII offset values.
 */
public static String generateLastName()
{
 //Instance variables
 String name = ""; // the name to be returned, 
 int nameLen = 0; // length of name to be outputted, less than or equal to maxLen
 int randInt = 0; // used to hold randomly generated integers.
 int maxLen = MAXNAMELENGTH; //max length of returned string
 int currLen = 0; //current length of string to be returned
 int uppercaseOffset = UPPERCASEASCII; //ASCII decimal offset value to uppercase 'A' character
 int lowercaseOffset = LOWERCASEASCII; //ASCII decimal offset value to lowercase 'a' character
 int alphabet = ALPHABET; // 26 is expected as there are 26 char in US alphebet [a-z], could take other values.

 // nameLen assign an random integer between 1 and maxLen
 nameLen =  randomLastName.nextInt(maxLen)+1;

 //First letter of name set to be a random capital letter
 randInt=randomLastName.nextInt(alphabet)+uppercaseOffset; //[A-Z]
 name = name + (char) randInt;
 currLen = currLen + 1;

 // Add random letter to name until name length is met.
 while(currLen<nameLen)
 {
  randInt = randomLastName.nextInt(alphabet)+lowercaseOffset; //[a-z]
  name = name + (char) randInt;
  currLen = currLen + 1;
 }
  
 return name;
} //end generateLastName

/*On my computer, could generate 14.9 million numbers seperated by commas within normally allocated memory*/
/*Uses StringBuilder internally for performance reasons (2 seconds for 14.9 million, versus 10 seconds for 20 thousand, 40 seconds for 40 thousand, etc)*/
public static String generateUniformIncome(int sizeOfOutput)
{
 int[] result = new int[sizeOfOutput];
 Random numGenerator = new Random();
       numGenerator = chrisRandom;/* to change the random number generator to the one within the class */
 int temp;
 
 
 for (int x = 0; x < sizeOfOutput; x++)
 {
  temp = numGenerator.nextInt(MAXINCOME);/*generates a number between 0 and 101,000 (inclusive)*/
  
  if (temp >= BIGINCOME + 1)/*numbers above 100,000 are used to populate all higher incomes*/
  {
   float temp2 = numGenerator.nextFloat();
   temp2 *= BIGINCOME;
   int temp3 = (int) temp2;
   temp = temp3 * temp;
   
  }
  result[x] = temp;
  
 }
 
 
 
 StringBuilder resultStringBuilder =  new StringBuilder(sizeOfOutput * INCOMESTRINGLENGTH);
 
 for (int x = 0; x < sizeOfOutput; x++)
 {
  resultStringBuilder.append(result[x]);
 }
 
 
 return resultStringBuilder.toString();
 
}

/*Since I have not yet determined how to make a properly exponential decay curve, this somewhat approximates what it would look like.*/
/*Returns a normal distribution with the left side chopped off at zero (below 0 causes a redo of that number). The seed determines the scale of numbers given out. The size of the seed does alter how many numbers may be generated before running out of default space. Suggested, 30 to 40 thousand as the seed.*/
/*Uses StringBuilder for performance. 15 million records works with 40000 seed on my machine.*/
public static String generateNonUniformIncome(int sizeOfOutput, int seed)
{
 int[] result = new int[sizeOfOutput];
 Random numGenerator;//= new Random();
       numGenerator = chrisRandom;/*used to move this up into the class*/
 double temp;
 
 for (int x = 0; x < sizeOfOutput; x++)
 {
  temp = numGenerator.nextGaussian() * seed;
  //temp = temp + (int) .1 * maxIncome;
  if (temp < 0)
  {
   x = x - 1;
   continue;
  }
  result[x] = (int) temp;
 }
 StringBuilder resultStringBuilder =  new StringBuilder(sizeOfOutput * INCOMESTRINGLENGTH);
 
 for (int x = 0; x < sizeOfOutput; x++)
 {
  resultStringBuilder.append(result[x]);
 }
 
 
 return resultStringBuilder.toString();
 
 
}


}  //end of class

9/1/11

We need to select a different time to meet next week -- filled out Doodle survey yet. (Done)

On 9/8, and only on 9/8, we will meet at 3:15 in the conference room.

Plan for the week: each of us write a method to generate one of the fields in our fake mini-table. By next week, be generating fake data, loading and timing how long it takes to load data sets of different sizes.

Fake mini-table: "create table testdata ( identifier int auto_increment, is_male bool/bit (emmanuel), income decimal/real/money (chris), lastname char(50) (lon), age int (matt), primary key (identifier))"

I've labeled the rows with the names of the folks who are writing the Java methods to generate Strings for those rows. Post your methods to this wiki. public static String generateX() {}.

8/25/11

Start by reading: "Pathologies of Big Data", Communications of the ACM, August 2009. Read this by next week. You might need to be on a campus computer to access the full text of the article.

Student discussion commentary

(Matt)

/*
By: Matt Stephens
Created: 09/02/2011
Upated:  09/08/2011

When the program is run main calls GetRandomAge, where a value
between 15-99 (Min/Max values) is generated, and returns the value 
as a string then prints the value out to the screen from main
*/

import java.io.*;
import java.util.*;
public class AgeTestSet
{

        //create and initialize variables
        private static Random RandomAgeNumber = new Random();
        
        public static void main(String[] args) {
                System.out.println(GetRandomAge());
        }
       
        public static String GetRandomAge(){
                //this method is used to generate a numeric value
                //between 15-99 then return that value as a string
                int MinValue = 15;	//Minimum value allowed
                int MaxValue = 99;	//Maximum value allowed
                String RandomAge = "";
                
                //Using the random value generated with the difference from the maximum and minimum values
                //for the seed, Then adding this to the minimum value gives the possible range of values.
                RandomAge = Integer.toString(MinValue + (RandomAgeNumber.nextInt(MaxValue-MinValue)));
                return RandomAge;
        }
 }

(/Matt)

(Lon)

/* generateLastName() returns a String of random US alphabet characters.
 * First character of the returned String is a capitalized letter,
 * all other characters of the returned String are lowercase letters.
 *
 * Class containing method will need to contain the following line for
 * constistant 'random' number generation:
 *
 * public static Random randomLastName = new Random(210986);
 *
 * Dependent on:
 *    java.util.Random;
 * 
 * Thus the class must import the java.util.Random package/library.
 * This can be done by including the line:
 *    import.util.Random;
 *
 * Written by Lon Porter
 * 8 September 2011
 * Updated 
 * 15 September 2011 - name 'magic number', ASCII offset values.
 */
 public static String generateLastName()
 {
   //Instance variables
   String name = "";
   int nameLen = 0;
   int randInt = 0;
   int maxLen = 50; //max length of returned string
   int currLen = 0; //current length of string to be returned
   int uppercaseOffset = 65; //ASCII decimal offset value to uppercase 'A' character
   int lowercaseOffset = 97; //ASCII decimal offset value to lowercase 'a' character

   // nameLen assign an random integer between 1 and maxLen
   nameLen =  randomLastName.nextInt(maxLen)+1;

   //First letter of name set to be a random capital letter
   randInt=randomLastName.nextInt(26)+uppercaseOffset; //[A-Z]
   name = name + (char) randInt;
   currLen = currLen + 1;

   // Add random letter to name until name length is met.
   while(currLen<nameLen)
   {
      randInt = randomLastName.nextInt(26)+lowercaseOffset; //[a-z]
      name = name + (char) randInt;
      currLen = currLen + 1;
   }

   return name;
}

(/Lon)

(Thomas comment -- If we view Chris's code in the 'edit' window of the wiki, copy and paste from that window into a nice code editor should have ok formatting.)

(Chris) In the interest of putting out what I have so far, here follows some code for generating numbers (works unless I made some copy/paste/cleanup errors)(wiki messed up the formatting):


import java.util.Random; /* I need chrisRandom placed as a variable in the class*/ public class GenerateIncome {


/*On my computer, could generate 14.9 million numbers seperated by commas within normally allocated memory*/ /*Uses StringBuilder internally for performance reasons (2 seconds for 14.9 million, versus 10 seconds for 20 thousand, 40 seconds for 40 thousand, etc)*/ public static String generateUniformIncome(int sizeOfOutput) { int[] result = new int[sizeOfOutput]; Random numGenerator = new Random();

       numGenerator = chrisRandom;/* to change the random number generator to the one within the class */

int temp;


for (int x = 0; x < sizeOfOutput; x++) { temp = numGenerator.nextInt(101001);/*generates a number between 0 and 101,000 (inclusive)*/

if (temp >= 100001)/*numbers above 100,000 are used to populate all higher incomes*/ { float temp2 = numGenerator.nextFloat(); temp2 *= 100000; int temp3 = (int) temp2; temp = temp3 * temp;

} result[x] = temp;

}


StringBuilder resultStringBuilder = new StringBuilder(sizeOfOutput * 6);

for (int x = 0; x < sizeOfOutput; x++) { resultStringBuilder.append(result[x]); }


return resultStringBuilder.toString();

}

/*Since I have not yet determined how to make a properly exponential decay curve, this somewhat approximates what it would look like.*/ /*Returns a normal distribution with the left side chopped off at zero (below 0 causes a redo of that number). The seed determines the scale of numbers given out. The size of the seed does alter how many numbers may be generated before running out of default space. Suggested, 30 to 40 thousand as the seed.*/ /*Uses StringBuilder for performance. 15 million records works with 40000 seed on my machine.*/ public static String generateNonUniformIncome(int sizeOfOutput, int seed) { int[] result = new int[sizeOfOutput]; Random numGenerator = new Random();

       numGenerator = chrisRandom;/*used to move this up into the class*/

double temp;

for (int x = 0; x < sizeOfOutput; x++) { temp = numGenerator.nextGaussian() * seed; //temp = temp + (int) .1 * maxIncome; if (temp < 0) { x = x - 1; continue; } result[x] = (int) temp; } StringBuilder resultStringBuilder = new StringBuilder(sizeOfOutput * 6);

for (int x = 0; x < sizeOfOutput; x++) { resultStringBuilder.append(result[x]); }


return resultStringBuilder.toString();


}

}

(/Chris)


(Emmanuel) ***Updated*** 9/8/11 /*

  Random Gender Generator
  • /


import java.util.Random; import java.util.Scanner;

public class RandomGender {

 private int randomNoCount; 
 private int randomInt;
 public int randomGenderGenerator()
 {
      Random randomGenerator = new Random();
       
      randomInt = randomGenerator.nextInt(2);
      System.out.println(randomInt);
      return randomInt;
 }
   
 public static void main(String[] Args)
 {
      RandomGender num = new RandomGender();
      System.out.println("");
      num.randomGenderGenerator();
 }

} (/Emmanuel)

Useful Links

MySQL -- free DBMS software -- try the Community Server, or the "Windows download". Here is extra help installing MySQL on a Mac.

PostgresQL - another free DBMS we could compare to MySQL.

Oracle 10g Express DBMS -- a free, starter DBMS product from the giant DBMS company, Oracle

'Warehouse-Scale Computing' talk by Luiz Barroso of Google - given at FCRC '11 conference in June. All about the problems that arise when you try to do truly LARGE scale computing.