(Latest Revision: 11/09/99)
HASHING SOCIAL SECURITY NUMBERS
DUE DATES:
- Level 03 Program Due Friday, Dec 03. Turn in copies of all
your source files plus a script showing your test runs. You
must demonstrate adequate testing of the program. You must
shar all these materials into a single archive file and
send them via e-mail before midnight, or they will be
considered late and will not be accepted.
- Final Level Program Due Friday, Dec 10. Turn in copies of
all your source files plus a script showing your test runs.
You must demonstrate adequate testing of the program. You must
shar all these materials into a single archive file and
send them via e-mail before midnight, or they will be
considered late and will not be accepted.
This is a version of a problem that appeared in "Data
Structures with Abstract Data Types and Pascal," by Stubbs and
Webre.
Write a program that implements a hash table of social security
numbers.
All input will come from standard input and all output will go
to standard output.
Input looks like this:
20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q
The first number in the input (20 in the example) is an
integer, minTableSize. After reading it the program will
calculate the table size, and allocate a hash table of that
size. The table size must be the smallest prime number not
less than minTableSize. In the example above, the table size
will be 23.
The second number in the input (15 in the example) is an
integer, numKeys. After reading numKeys, the program will read
numKeys keys (social security numbers) from the input,
formatted as shown, and insert each one into the hash table.
(No two of these keys are allowed to be equal.) There must be
exactly numKeys keys following numKeys. In the example, there
are exactly 15 keys.
The program will calculate the hash address by treating the key
as a 9-digit number N and taking N % tableSize as the address.
In the example above, 041-39-9350 hashes to the address
41399350 % 23 = 17
(Note that C++ interprets an integer with a leading zero as an
*octal* constant, and so the expression above would be
incorrect if written as 041399350 % 23 = 17.)
The program will use counters to keep track of how many keys
have hashed to each possible address. The possible addresses
are 0 .. tableSize-1. Note that these counter values are
independent of the collision resolution policy. When a key is
hashed, a counter corresponding to the original hash address is
incremented, regardless of where the program decides to
actually place the key.
After processing the numKeys key values, the program will
output a tabular report showing how many of the locations in
the hash table had their counter end up equal to 0, 1, 2, ... ,
up to the largest size of any counter. The report will also
compare that data with the numbers predicted by the "perfect
randomizer" expression:
exp(-alpha) * (alpha ** k) / k!
This expression, simply put, is the expected frequency of
locations in the table whose counter stops at k, if the hash
function has no "bias." The frequency is expressed as a
decimal fraction of the table size. The quantity alpha is the
"load factor", the ratio of numKeys to tableSize.
For example, suppose that minTableSize is 1000 and numKeys is
800. Then tableSize is 1009, alpha = 800/1009 = 0.793, and for
k = 3 the formula exp(-alpha) * (alpha ** k) / k! = (0.453) *
(0.498) / 6 = 0.038 (approximately).
Thus if we are hashing 800 keys into a table with 1009 slots,
we would expect that a hash function that is a perfect
randomizer would result in about 38 positions with counter
equal to 3.
Your program's output will show the number 38 and the actual
number of locations that had counter size 3, so that a
comparison of the experimental and ideal values can be made.
The report will be set up so that this comparison can be done
for all counter sizes that occur.
One good way to do this is to print a 4-column report: first
column is for the number k, second column for the predicted
values, third column for the actual values, and fourth column
expressing the actual value as a percentage of the ideal value.
After printing the tabular report, the program will process
commands coming from the input. The possible commands are
s, r, i, d, and q.
The s command stands for show. In response to this
command, the program will print a representation of what is in
the hash table. This should be a two-column report: first
column is table indices, and second column is key(s) stored in
the location given by the index.
A couple of things to note here: First, in this case you will
be reporting on the locations where the keys are actually
stored. If you are using open addressing, this is not
necessarily the address the key hashes to initially. Second,
If you are doing chaining, there could be more than one key in
a given slot.
The s command is not meant to be used for large tables, so just
write the code so it uses one line for each occupied array
slot. Also, write the code so it entirely skips any empty slot
No index is written in that case and no line is used up on the
output. The program has to write the social security number in
"standard form" -- that is with dashes in the right places.
The r command stands for retrieve. This command
requires a single social security number (a key) to appear
right after the r. The program will find the key in the table,
report which address the key hashes to, the current value of
the counter for that address, and also which address the key is
actually stored in. Each of these three pieces of information
must be appropriately labelled. For example:
hash address: 4 ; counter: 3 ; location: 6
If the key is not found, an appropriate message must be
printed.
The i command stands for insert. This command, just
like r, requires a key value next in the input. The program
will insert the key into the table, unless the table already
contains such an element, or unless the table is full. If
insertion fails the program will print an appropriate message.
If insertion succeeds the program will print the same three
pieces of information as the r command. Of course the counter
value will be the new value: one greater than the old
value, because of the insertion.
The d command stands for delete. This command requires a
parameter just like r and i. In this case the program will
delete the element if it is present. Like r and i, it will
print the hash address, new counter value, and actual
location. If the operation fails because the key was not found
(the only valid reason) then the program will print an
appropriate message.
The q command stands for quit.
There can be any number of s, r, i, and d commands after the
initial input. They can appear mixed into any order. The
numbers minTableSize and numKeys will both be positive integers
and numKeys will be less than minTableSize.
ECHOING
Except for the list of keys that comes after numKeys, the
program will echo each line of input before writing the output
called for by that line of input. The output for different
commands will be appropriately delimited by lines of pound
signs (#) or some such character.
RANDOM NUMBER GENERATION
In order to test your program, you will have to write a small
program that can generate sets of random social security
numbers. We will discuss this problem in class. I can show
you an easy way to do it.
Unless the set of keys you generate is very large, they are all
likely to be different from one another, but this is not a
certainty.
You can use the unix sort and uniq commands to check your key
sets for duplicates, and then "repair" any sets you find had
duplicates. If you read the man pages on sort and uniq, I
think you'll probably know what to do, but let me know if you
have questions.
CALCULATION OF PRIMES
In this program you need to have a function that inputs a
number m and returns the smallest prime number p >= m. We can
discuss in class too, and I can give you some simple pseudo
code.
DYNAMIC ALLOCATION OF ARRAYS
This program has to allocate space for the hash table based on
the size value from the input. See page 156 in your text book
for information on how to allocate an array dynamically.
CHOICE OF IMPLEMENTATION
If you want to save time writing this program, consider using
chaining as your collision resolution policy. Unless you get a
lot of help by reusing classes from somewhere, chaining is
probably the easiest way to implement. Deletion is a little
complex when open addressing is used. You can use the linked
list class from the text book to implement chains.