(Latest Revision: 11/09/99)

HASHING SOCIAL SECURITY NUMBERS


DUE DATES:

This is a version of a problem that appeared in "Data Structures with Abstract Data Types and Pascal," by Stubbs and Webre.

Write a program that implements a hash table of social security numbers.

All input will come from standard input and all output will go to standard output.

Input looks like this:
20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q
The first number in the input (20 in the example) is an integer, minTableSize. After reading it the program will calculate the table size, and allocate a hash table of that size. The table size must be the smallest prime number not less than minTableSize. In the example above, the table size will be 23.

The second number in the input (15 in the example) is an integer, numKeys. After reading numKeys, the program will read numKeys keys (social security numbers) from the input, formatted as shown, and insert each one into the hash table. (No two of these keys are allowed to be equal.) There must be exactly numKeys keys following numKeys. In the example, there are exactly 15 keys.

The program will calculate the hash address by treating the key as a 9-digit number N and taking N % tableSize as the address. In the example above, 041-39-9350 hashes to the address

41399350 % 23 = 17

(Note that C++ interprets an integer with a leading zero as an *octal* constant, and so the expression above would be incorrect if written as 041399350 % 23 = 17.)

The program will use counters to keep track of how many keys have hashed to each possible address. The possible addresses are 0 .. tableSize-1. Note that these counter values are independent of the collision resolution policy. When a key is hashed, a counter corresponding to the original hash address is incremented, regardless of where the program decides to actually place the key.

After processing the numKeys key values, the program will output a tabular report showing how many of the locations in the hash table had their counter end up equal to 0, 1, 2, ... , up to the largest size of any counter. The report will also compare that data with the numbers predicted by the "perfect randomizer" expression:

exp(-alpha) * (alpha ** k) / k!

This expression, simply put, is the expected frequency of locations in the table whose counter stops at k, if the hash function has no "bias." The frequency is expressed as a decimal fraction of the table size. The quantity alpha is the "load factor", the ratio of numKeys to tableSize.

For example, suppose that minTableSize is 1000 and numKeys is 800. Then tableSize is 1009, alpha = 800/1009 = 0.793, and for k = 3 the formula exp(-alpha) * (alpha ** k) / k! = (0.453) * (0.498) / 6 = 0.038 (approximately).

Thus if we are hashing 800 keys into a table with 1009 slots, we would expect that a hash function that is a perfect randomizer would result in about 38 positions with counter equal to 3.

Your program's output will show the number 38 and the actual number of locations that had counter size 3, so that a comparison of the experimental and ideal values can be made. The report will be set up so that this comparison can be done for all counter sizes that occur.

One good way to do this is to print a 4-column report: first column is for the number k, second column for the predicted values, third column for the actual values, and fourth column expressing the actual value as a percentage of the ideal value.

After printing the tabular report, the program will process commands coming from the input. The possible commands are s, r, i, d, and q.

The s command stands for show. In response to this command, the program will print a representation of what is in the hash table. This should be a two-column report: first column is table indices, and second column is key(s) stored in the location given by the index.

A couple of things to note here: First, in this case you will be reporting on the locations where the keys are actually stored. If you are using open addressing, this is not necessarily the address the key hashes to initially. Second, If you are doing chaining, there could be more than one key in a given slot.

The s command is not meant to be used for large tables, so just write the code so it uses one line for each occupied array slot. Also, write the code so it entirely skips any empty slot No index is written in that case and no line is used up on the output. The program has to write the social security number in "standard form" -- that is with dashes in the right places.

The r command stands for retrieve. This command requires a single social security number (a key) to appear right after the r. The program will find the key in the table, report which address the key hashes to, the current value of the counter for that address, and also which address the key is actually stored in. Each of these three pieces of information must be appropriately labelled. For example:

hash address: 4 ; counter: 3 ; location: 6

If the key is not found, an appropriate message must be printed.

The i command stands for insert. This command, just like r, requires a key value next in the input. The program will insert the key into the table, unless the table already contains such an element, or unless the table is full. If insertion fails the program will print an appropriate message. If insertion succeeds the program will print the same three pieces of information as the r command. Of course the counter value will be the new value: one greater than the old value, because of the insertion.

The d command stands for delete. This command requires a parameter just like r and i. In this case the program will delete the element if it is present. Like r and i, it will print the hash address, new counter value, and actual location. If the operation fails because the key was not found (the only valid reason) then the program will print an appropriate message.

The q command stands for quit.

There can be any number of s, r, i, and d commands after the initial input. They can appear mixed into any order. The numbers minTableSize and numKeys will both be positive integers and numKeys will be less than minTableSize.

ECHOING

Except for the list of keys that comes after numKeys, the program will echo each line of input before writing the output called for by that line of input. The output for different commands will be appropriately delimited by lines of pound signs (#) or some such character.

RANDOM NUMBER GENERATION

In order to test your program, you will have to write a small program that can generate sets of random social security numbers. We will discuss this problem in class. I can show you an easy way to do it.

Unless the set of keys you generate is very large, they are all likely to be different from one another, but this is not a certainty.

You can use the unix sort and uniq commands to check your key sets for duplicates, and then "repair" any sets you find had duplicates. If you read the man pages on sort and uniq, I think you'll probably know what to do, but let me know if you have questions.

CALCULATION OF PRIMES

In this program you need to have a function that inputs a number m and returns the smallest prime number p >= m. We can discuss in class too, and I can give you some simple pseudo code.

DYNAMIC ALLOCATION OF ARRAYS

This program has to allocate space for the hash table based on the size value from the input. See page 156 in your text book for information on how to allocate an array dynamically.

CHOICE OF IMPLEMENTATION

If you want to save time writing this program, consider using chaining as your collision resolution policy. Unless you get a lot of help by reusing classes from somewhere, chaining is probably the easiest way to implement. Deletion is a little complex when open addressing is used. You can use the linked list class from the text book to implement chains.