Hashing Problem

(Latest Revision: 04/10/2005)

Hashing Social Security Numbers

PRELIMINARIES:

Before working on this problem you need to read about hashing in the second half of chapter 12. As always, you need to be familiar with the directions and examples here: http://www.cs.csustan.edu/~john/Classes/General_Info/progAsgRules/

THE ASSIGNMENT:

Write a program that implements a hash table of social security numbers.

INPUT & OUTPUT:

The program reads from standard input. Input must have the format illustrated by this example:

20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q

The first number in the input (20 in the example) must be a positive integer, minTableSize. After reading minTableSize the program must calculate a table size and allocate a hash table of that size. The table size must be the smallest prime number not less than minTableSize. (The table size must equal minTableSize if minTableSize is prime. Otherwise the table size must be the first prime larger than minTableSize.) In the example above the table size is 23.

The second number in the input (15 in the example) must be a positive integer no greater than minTableSize, numKeys. After reading numKeys, the program must read that many keys (social security numbers) from the input. The keys must be formatted with hyphens as shown. The program must insert each key into the hash table. The input must be constructed so that no two of the keys are equal. The number of keys in this section of the input must be exactly equal to the value of numKeys. This is illustrated by the example above, where numKeys = 15 and there are exactly 15 keys in the section of the input following numKeys.

The program must calculate the hash address by treating the key as a 9-digit number N and taking N % tableSize as the address. In the example above, 041-39-9350 hashes to the address

41399350 % 23 = 17

(Note that C++ interprets an integer with a leading zero as an *octal* constant, and so the expression above would be incorrect if written as 041399350 % 23 = 17.)

The program must use counters to keep track of how many keys have hashed to each address in the hash table. The program must have a different counter for each address. The addresses are 0 .. tableSize-1. Each time the program encounters a new key, it must calculate the hash address using the hash function and increment the counter corresponding to that address.

[Note: the collision-resolution policy has no effect on which counter the program increments. The program always increments the counter that corresponds to the (original) hash address, even if the collision-resolution policy places the element in a location different from the original hash address.]

After processing the numKeys key values, the program must output a tabular report showing how many of the counters stopped when equal to 0, equal to 1, equal to 2, and so on, up to the largest value to which any of the counters grew. The report must also compare this information about the counters with values predicted by the "perfect randomizer" expression:


( Expected Frequency(k) ) = (e^-αα^k)/(k!)

This expression, simply put, gives the expected frequency of locations in the table whose counter stops at k.

The expression is valid if if the hash function has no bias. The frequency is expressed as a decimal fraction of the table size.

The quantity α is the load factor, α = (numKeys / tableSize). Of course, (k!) means "k factorial".

Incidentally, if you include the math.h library, you will have a "exp" function available to you for computing e^-α, and also a "pow" function for calculating powers of numbers (like α^k). You can read about how to use these functions by using the "man" pages on your Unix system. Just do "man exp" and "man pow."

To illustrate with a specific example, suppose that minTableSize is 1575 and numKeys is 800. Then tableSize is 1579 and α = 800/1579 ≈ 0.5066. (I'm using the symbol "≈" to mean 'approximately equal to'.) Using k = 3, the formula (e^-αα^k)/(k!) ≈ (0.6025) * (0.1300) / 6, which yields a value of 0.0131 (approximately).

The 0.0131 means that if we are hashing 800 keys into a table with 1579 slots, the statistical model predicts there will be 3-way collisions at about 1.31% of the locations in the table. In this example, the 1.31% works out to about 21 locations. That calculation goes as follows: (0.0131) * (tableSize) = (0.0131) * (1579) ≈ 20.685, which rounds to 21.

In the case at hand, one of the rows in the report will show the number 21 alongside the actual number of 3-way collisions recorded when the program hashed the 800 keys from the input. The report allows us to make a comparison between the value predicted by the statistical model and the value that actually turned up.

Lay out the report as a 4-column array: the first column is for the number k, the second column for the predicted values, the third column for the actual values, and the fourth column expressing the actual value as a percentage of the predicted value.

The table corresponding to the example of 800 keys hashed into a table with 1579 slots might look like this:


slot count      predicted       actual          act/pred

0                951             952             100%
1                482             486             101%
2                122             111              91%
3                 21              29             141%
4                  3               0               0%
5                  0               1             378%

After printing the tabular report, the program must process commands coming from the input. The possible commands are s, r, i, d, and q.

The s command stands for show. In response to this command, the program must echo the command and print a report representing what is in the hash table. The report must have a 'two-column' format: the first column is for table indices, and the second column is for key(s) stored in the location given by the index. The program must write social security numbers in "standard form" -- that is with dashes in the right places.

A couple of things to keep in mind here: First, in this case your program must report the locations where the keys are actually stored. If the program uses open addressing, the address where a key is stored is not necessarily the address computed by the hash function. Second, If the program uses chaining for collision resolution there will typically be more than one key in some slots.

The s command is not meant to be used for large tables, so just write your program to use one line for each occupied table slot. However, write the code so it entirely skips any empty table slot. The program must write nothing, not even a newline, corresponding to an empty slot in the table.

For example: when there are 80 keys and a minimum table size of 100, the first few lines of the response to the s command might look like this:


===========================================================================
s
SHOW COMMAND

INDEX      VALUES

0      254-40-5163  
5      576-29-0451  
9      957-32-7903  861-27-9125  031-39-2930  
13     126-56-3315  059-97-5934  202-75-3271  114-71-9146  
14     261-64-6675  
15     461-60-6678  
16     962-95-6236  435-92-9550  
17     612-12-0516  579-19-6940

This particular display was written by a version of the program that uses chaining as the collision-resolution policy. There is more than one key stored in some of the table slots. Therefore the 'column' for VALUES contains several keys in some rows. Note that there are "lines missing" in the display where the program has refrained from printing output corresponding to empty table slots. This is a desirable thing: the output is more compact this way.

The r command stands for retrieve. This command requires a parameter -- a single social security number (a key) to appear after the r. The program must echo the command and if possible find the key in the table. If the key is found, the program must then report the address to which the key hashes, the current value of the counter for that address, and the address at which the key is actually stored. Each of these three pieces of information must be appropriately labelled. For example consider the output appropriate to the commands:

r 598-21-8634 
r 564-64-7123

The first key above is in the table and the second is not. The following would be appropriate output:


===========================================================================
r 598-21-8634
RETRIEVE COMMAND

Key             Hash Address    Counter Value   Location in Table
598-21-8634     78              2               78
===========================================================================
r 564-64-7123
RETRIEVE COMMAND

Key             Hash Address    Counter Value   Location in Table
564-64-7123
NOT FOUND
===========================================================================

Note the use of lines of equal-signs to separate the output corresponding to different commands. Note also the error message. The program is required to write such a message when the key does not exist in the table.

The i command stands for insert. Like the rcommand, this command requires a parameter -- a key value coming next in the input. The program will echo the command and insert the key into the table (unless the table already contains such an element or the table is full). If insertion fails the program will print an appropriate message. If insertion succeeds the program will print the same three pieces of information as the r command. Of course the counter value will be the new value: one greater than the old value, because of the insertion.

The d command stands for delete. Like r and i, this command requires a parameter. In this case the program will echo the command and delete the element if it is present. Like r and i, it will print the hash address, new counter value, and actual location. If the operation fails because the key was not found (the only valid reason) then the program will print an appropriate message.

The q command stands for quit. It will occur exactly once, at the end of the input.

There can be any number of s, r, i, and d commands after the initial input. They can appear mixed in any order.

MAKEFILE GOTCHA, or (NOT) USING A MAKEFILE:

Normally it is a good idea to use a makefile. I had a big problem with my makefile for this program. I don't know, but maybe the problem is due to a bug in the compiler that 'kicks' in when math.h or time.h is included. What I experienced is that the program would compile but then not write anything - nothing at all. The same program worked fine and procuced lots of output when I compiled using the command g++ *.cpp. So please watch out for this problem.

(Update: The 'problem' referred to above may have just stemmed from some conflict with /usr/bin/hash on the Ultra's. I tested a makefile in which the name of the program was changed from 'hash' to 'hasher' and that makefile seems to be working fine. )

PSEUDO-RANDOM NUMBER GENERATION:

In order to test your program, you will need a small program that can generate sets of pseudo-random social security numbers. For that, you can use the little program called makeInFile.cpp included in the directory with this assignment.

Unless the set of keys you generate is very large indeed, it will be almost certain that all the keys are distinct (different from one another). However, this is not a perfect certainty.

You can use the unix sort and uniq commands to check your key sets for duplicates. If you find your set has a duplicate, you can repair the set by replacing the duplicate with a number you know is not already in the set. If you read the man pages on sort and uniq, I think you will probably know what to do, but speak to me if you have questions.

CALCULATION OF PRIMES:

In this program you need to have a function that inputs a number m and returns the smallest prime number p >= m. We can discuss this in class, and I can give you some simple pseudo-code.

DYNAMIC ALLOCATION OF ARRAYS:

This program must not allocate the hash table statically. It must read the value of minTableSize, compute the size of the hash table, and then dynamically allocate the correct amount of space for the table. See pages 171-172 in your text book for information on how to perform dynamic allocation of an array.

CHOICE OF IMPLEMENTATION:

In the 'story' of this program, the hash table is a prominent 'noun'. It would be a good idea to implement the hash table as a class. You don't necessarily have to do that, but you are required to use good design principles - modularity, small functions, information hiding, abstraction, top down structure, and so forth.

If you want to save time writing this program, consider using chaining as your collision resolution policy.

Unless you get a lot of help by re-using classes from some source, I think chaining will be the easiest way to implement the collision resolution. Deletion is a little complex to implement when using open-addressing, but it is straightforward when using chaining.

You can use the linked list code to implement chains. I have included linked list code that will work well in the directory containing this assignment. (See ListP.h and ListP.cpp )

TESTING:

You are responsible for deciding what sort of tests need to be done on your program to check it for correctness. I expect you to design your own test inputs. The sample inputs illustrated in this document are not adequate for thorough testing. Your test scripts must demonstrate adequate code coverage and data coverage.

When I do my testing of your program, time will be of the essence because I will be testing the programs of many students. I will prepare one or more files containing inputs to test your programs. I will run your program in a mode like this:

myNameForYourProgram < myinput > myoutput

I will then examine the contents of myoutput to see if it differs significantly from the output file I get when I run my own solution program.

If I have to change your program or one of my input files in order to get your program to accept my input, you will lose significant credit.

WHAT TO TURN IN:

You will turn in two "phases" of this assignment:

a level 3 version, and
a final version.

For each phase of the assignment, you will turn in a printer output (hardcopy) and you will send me an e-mail message. Please follow these rules:

Always send me e-mail as plain text in the main message body. Never send me attachments.
Always use the exact subject line I specify for each message. (I often get hundreds of e-mail messages in a week. The subject line allows me to find, filter and sort messages.) You will lose a significant number of points on the assignment if you use the wrong subject line.
Be very careful when you send the e-mail. You may use the instructions in your Hello World! lab excercise for guidance. Of course, you will need to make the obvious changes to those directions -- you have to use the correct subject line and filename.
Always send yourself a copy of each e-mail message you send to me, and check immediately to see if you receive the message intact. You are responsible for sending e-mail correctly.

Here is the list of things you have to turn in:

At the start of class on the first due date, place the following item on the "counter" in front of me:
- a hardcopy of your level 3 (or greater) program. (All the source code except the list code.) Make sure all the code is properly formatted and that it all shows on the paper.
Using the subject line: CS3100,prog5.3 send the following item to me by e-mail before midnight on the first due date:

One shell archive file (only one) containing items 1-4.
1. All source files for your level 3 program (all *.cpp files and *.h files -- list code too)
2. Your test script showing adequate testing of your level 3 program.
3. A file named 'README' containing the compilation command one should use to compile your program.
4. A copy of your 'makefile' if you used one (see above why maybe you should not).
At the start of class on the second due date, place the following item on the "counter" in front of me:
- a hardcopy of your final level program. (All the source code except the list code.) Make sure all the code is properly formatted and that it all shows on the paper.
Using the subject line: CS3100,prog5.f send the following item to me by e-mail before midnight on the second due date:

One shell archive file (only one) containing items 1-4.
1. All source files for your final level program (all *.cpp files and *.h files -- list code too)
2. Your test script showing adequate testing of your final level program.
3. A file named 'README' containing the compilation command one should use to compile your program.
4. A copy of your 'makefile' if you used one (see above why maybe you should not).

Note that there are no spaces in the subject lines given. It is important that you do not insert any spaces. My e-mail address is: john@ishi.csustan.edu.

DUE DATES:

For the due dates, see the class schedule.