Hashing Problem

(Latest Revision: 11/18/2000)

Hashing Social Security Numbers

PRELIMINARIES:

Before working on this problem you need to read about hashing in the second half of chapter 12. As always, you need to be familiar with the directions and examples here: http://shalim.csustan.edu/~john/Classes/General_Info/progAsgRules/

THE ASSIGNMENT:

Write a program that implements a hash table of social security numbers.

INPUT & OUTPUT:

The program reads from standard input. Input looks like this:

20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q

The first number in the input (20 in the example) is an integer, minTableSize. After reading it the program will calculate the table size, and allocate a hash table of that size. The table size must be the smallest prime number not less than minTableSize. In the example above, the table size will be 23.

The second number in the input (15 in the example) is an integer, numKeys. After reading numKeys, the program will read that many keys (social security numbers) from the input. The keys will be formatted with hyphens as shown. The program will insert each key into the hash table. No two of these keys are allowed to be equal. The number of keys in this section of the input must be exactly equal to the value of numKeys. This is illustrated by the example above, where numKeys = 15 and there are exactly 15 keys in the section of the input following numKeys.

The program will calculate the hash address by treating the key as a 9-digit number N and taking N % tableSize as the address. In the example above, 041-39-9350 hashes to the address

41399350 % 23 = 17

(Note that C++ interprets an integer with a leading zero as an *octal* constant, and so the expression above would be incorrect if written as 041399350 % 23 = 17.)

The program will use counters to keep track of how many keys have hashed to each address in the hash table. The program needs to have a different counter for each address. The addresses are 0 .. tableSize-1. Each time the program encounters a new key, it calculates the hash address using the hash function, and increments the counter corresponding to that address.

[Note: the collision-resolution policy has no effect on which counter the program increments. The program always increments the counter that corresponds to the (original) hash address, even if the collision-resolution policy places the element in a location different from the original hash address.]

After processing the numKeys key values, the program will output a tabular report showing how many of the counters stopped when equal to 0, equal to 1, equal to 2, and so on, up to the largest value to which any of the counters grew. The report will also compare this information about the counters with values predicted by the "perfect randomizer" expression:


( Expected Frequency(k) ) = exp(-alpha) * (alpha # k) / (k!)

This expression, simply put, is the expected frequency of locations in the table whose counter stops at k.

The expression is valid if if the hash function has no bias. The frequency is expressed as a decimal fraction of the table size.

The quantity exp(-alpha) refers to the exponential function: exp(-alpha) is "e to the (-alpha) power." Alpha is the load factor, the ratio (numKeys / tableSize). The pound sign (#) stands for exponentiation (alpha # k means "alpha to the kth power"). Of course, (k!) means "k factorial".

Incidentally, if you include the math.h library, you will have a "exp" function available to you, and also a "pow" function for calculating powers of numbers. You can read about how to use these functions by using the "man" pages on your Unix system. Just do "man exp" and "man pow."

To illustrate with a specific example, suppose that minTableSize is 1000 and numKeys is 800. Then tableSize is 1009 and alpha = 800/1009 = 0.793. Using k = 3, the formula exp(-alpha) * (alpha # k) / (k!) = (0.453) * (0.498) / 6, which yields a value of 0.038 (approximately).

The 0.038 means that if we are hashing 800 keys into a table with 1009 slots, the statistical model predicts there will be 3-way collisions at about 38 different locations in the table. (We get the number 38 by this calculation: (0.038) * (tableSize) = (0.038) * (1009) = 38.342, which rounds to 38. )

In the case at hand, one of the rows in the report will show the number 38 alongside the actual number of 3-way collisions recorded when the program hashed the 800 keys from the input. The report allows us to make a comparison between the value predicted by the statistical model and the value that actually turned up.

Lay out the report as a 4-column array: the first column is for the number k, the second column for the predicted values, the third column for the actual values, and the fourth column expressing the actual value as a percentage of the ideal value.

The table corresponding to the example of 800 keys hashed into a table with 1009 slots might look like this:


slot count      predicted       actual          act/pred

0                457             452              99%
1                362             382             106%
2                144             119              83%
3                 38              45             119%
4                  8              10             133%
5                  1               1              84%

After printing the tabular report, the program will process commands coming from the input. The possible commands are s, r, i, d, and q.

The s command stands for show. In response to this command, the program will echo the command and print a report representing what is in the hash table. The report must have a two-column format: the first column is for table indices, and the second column is for key(s) stored in the location given by the index. The program has to write social security numbers in "standard form" -- that is with dashes in the right places.

A couple of things to keep in mind here: First, in this case you will be reporting on the locations where the keys are actually stored. If you are using open addressing, the address where a key is stored is not necessarily the address computed by the hash function. Second, If you are doing chaining, there could be more than one key in a given slot.

The s command is not meant to be used for large tables, so just write your program to use one line for each occupied table slot. However, write the code so it entirely skips any empty table slot. The program must write nothing, not even a newline, corresponding to an empty slot in the table.

For example: when there are 80 keys and a minimum table size of 100, the first few lines of the response to the s command might look like this:


===========================================================================
s
SHOW COMMAND

sINDEX      VALUES

0      254-40-5163  
5      576-29-0451  
9      957-32-7903  861-27-9125  031-39-2930  
13     126-56-3315  059-97-5934  202-75-3271  114-71-9146  
14     261-64-6675  
15     461-60-6678  
16     962-95-6236  435-92-9550  
17     612-12-0516  579-19-6940

This particular display was written by a version of the program that uses chaining as the collision-resolution policy, so there is more than one key stored in some of the table slots. Note that there are "lines missing" in the display where the program has refrained from printing output corresponding to empty table slots. This is a desirable thing: the output is more compact this way.

The r command stands for retrieve. This command requires a parameter -- a single social security number (a key) to appear after the r . The program will echo the command and if possible find the key in the table. The program will then report the address to which the key hashes, the current value of the counter for that address, and the address at which the key is actually stored. Each of these three pieces of information must be appropriately labelled. For example consider the output appropriate to the commands:


r 598-21-8634 
r 564-64-7123

The first number is in the table and the second is not. The following would be appropriate output:


===========================================================================
r 598-21-8634
RETRIEVE COMMAND

Key             Hash Address    Counter Value   Location in Table
598-21-8634     78              2               78
===========================================================================
r 564-64-7123
RETRIEVE COMMAND

Key             Hash Address    Counter Value   Location in Table
564-64-7123
NOT FOUND
===========================================================================

Note the use of lines of equal-signs to separate the output corresponding to different commands. Note also the error message. The program is required to write such a message when the key does not exist in the table.

The i command stands for insert. Like the r command, this command requires a parameter -- a key value coming next in the input. The program will echo the command and insert the key into the table (unless the table already contains such an element or the table is full). If insertion fails the program will print an appropriate message. If insertion succeeds the program will print the same three pieces of information as the r command. Of course the counter value will be the new value: one greater than the old value, because of the insertion.

The d command stands for delete. Like r and i , this command requires a parameter . In this case the program will echo the command and delete the element if it is present. Like r and i , it will print the hash address, new counter value, and actual location. If the operation fails because the key was not found (the only valid reason) then the program will print an appropriate message.

The q command stands for quit. It will occur exactly once, at the end of the input.

There can be any number of s, r, i, and d commands after the initial input. They can appear mixed in any order. The numbers minTableSize and numKeys will both be positive integers and numKeys will be less than minTableSize.

PSEUDO-RANDOM NUMBER GENERATION:

In order to test your program, you will need a small program that can generate sets of pseudo-random social security numbers. For that, you can use the little program called makeInFile.cpp included in the directory with this assignment.

Unless the set of keys you generate is very large indeed, it will be almost certain that all the keys are distinct (different from one another). However, this is not a perfect certainty.

You can use the unix sort and uniq commands to check your key sets for duplicates. If you find your set has a duplicate, you can repair the set by replacing the duplicate with a number you know is not already in the set. If you read the man pages on sort and uniq, I think you will probably know what to do, but speak to me if you have questions.

CALCULATION OF PRIMES:

In this program you need to have a function that inputs a number m and returns the smallest prime number p >= m. We can discuss this in class, and I can give you some simple pseudo-code.

DYNAMIC ALLOCATION OF ARRAYS:

This program has to allocate space for the hash table based on the size value from the input. See page 156 in your text book for information on how to allocate an array dynamically. Incidentally, my copy of the book has a typographical error there. The example should read:


int ArraySize = 50 ;
double * B = new double[ArraySize] ;

CHOICE OF IMPLEMENTATION:

If you want to save time writing this program, consider using chaining as your collision resolution policy.

Unless you get a lot of help by re-using classes from some source, chaining is probably the easiest way to implement the collision resolution. Deletion is a little complex to implement when using open-addressing, but it is straightforward when using chaining.

You can use the linked list class from the text book to implement chains. I have included the code for the linked list class in the directory containing this assignment.

TESTING:

You are responsible for deciding what sort of tests need to be done on your program to check it for correctness. I expect you to design your own test inputs. The sample inputs illustrated in this document are not adequate for thorough testing. Your test scripts must demonstrate adequate code coverage and data coverage.

When I do my testing of your program, time will be of the essence because I will be testing the programs of many students in your class and in other classes. I will prepare one or more files containing inputs to test your programs. I will run your program in a mode like this:

myNameForYourProgram < myinput > myoutput

I will then examine the contents of myoutput to see if it differs significantly from the output file I get when I run my own solution program.

If I have to change your program or one of my input files in order to get your program to accept my input, you will lose points.

WHAT TO TURN IN:

You will be sending me two e-mail messages. Please follow these rules:

Always send me e-mail as plain text in the main message body. Never send me attachments.
I will tell you what subject line to use with each message, and I need you to use exactly the subject lines I give you. (I get hundreds of e-mail messages at a time and your subject line allows me to sort messages.)

Here is the list of things you have to turn in:

Send the following items to me by e-mail before midnight on the first due date:

A level three version of the source code and a script showing your test runs. Include all the source files (*.h files and *.cpp files) that are required to compile the program into an executable image -- everything I need so that I can compile and test your program. Combine all the source files and the script into one shell archive file and e-mail me the archive file with the subject line: CS3100,prog5.3.
Send the following items to me by e-mail before midnight on the second due date:

A final version of the source code and a script showing your test runs. Include all the source files (*.h files and *.cpp files) that are required to compile the program into an executable image -- everything I need so that I can compile and test your program. Combine all the source files and the script into one shell archive file and e-mail me the archive file with the subject line: CS3100,prog5.f.

Note that there are no spaces in the subject lines given. It is important that you do not insert any spaces. My e-mail address is: john@ishi.csustan.edu.

DUE DATES:

For the due dates, see the class schedule.