(Latest Revision: 11/18/2000)
Hashing Social Security Numbers
PRELIMINARIES:
Before working on this problem you need to read about hashing in the second
half of chapter 12.
As always, you need to be familiar with the directions and examples here:
http://shalim.csustan.edu/~john/Classes/General_Info/progAsgRules/
THE ASSIGNMENT:
Write a program that implements a hash table of social security
numbers.
INPUT & OUTPUT:
The program reads from standard input. Input looks like this:
20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q
The first number in the input (20 in the example) is an
integer, minTableSize. After reading it the program will
calculate the table size, and allocate a hash table of that
size. The table size must be the smallest prime number not
less than minTableSize. In the example above, the table size
will be 23.
The second number in the input (15 in the example) is an
integer, numKeys. After reading numKeys, the program will read
that many keys (social security numbers) from the input. The
keys will be formatted with hyphens as shown.
The program will insert each key into the hash table.
No two of these keys are allowed to be equal. The number of keys in
this section of the input must be exactly equal to the value of numKeys.
This is illustrated by the example above, where numKeys = 15 and there are
exactly 15 keys in the section of the input following numKeys.
The program will calculate the hash address by treating the key
as a 9-digit number N and taking N % tableSize as the address.
In the example above, 041-39-9350 hashes to the address
41399350 % 23 = 17
(Note that C++ interprets an integer with a leading zero as an
*octal* constant, and so the expression above would be
incorrect if written as 041399350 % 23 = 17.)
The program will use counters to keep track of how many keys
have hashed to each address in the hash table. The program needs to have a different counter for each address. The addresses
are 0 .. tableSize-1. Each time the program encounters a new key,
it calculates the
hash address using the hash function, and increments the counter
corresponding to
that address.
[Note: the collision-resolution policy has no effect on which counter the program increments. The program always increments the counter that corresponds to the (original) hash address, even if the collision-resolution policy places the element in a location different from the original hash address.]
After processing the numKeys key values, the program will
output a tabular report showing how many of the counters stopped
when equal to 0, equal to 1, equal to 2,
and so on, up to the largest value to which any of the counters grew.
The report will also
compare this information about the counters with values predicted by the "perfect
randomizer" expression:
( Expected Frequency(k) ) = exp(-alpha) * (alpha # k) / (k!)
This expression, simply put, is the expected frequency of
locations in the table whose counter stops at k.
The expression is
valid if if the hash
function has no bias. The frequency is expressed as a
decimal fraction of the table size.
The quantity exp(-alpha) refers to the exponential function: exp(-alpha) is "e to the (-alpha) power." Alpha is the load factor, the ratio (numKeys / tableSize). The pound sign (#) stands for exponentiation (alpha # k means "alpha to the kth power"). Of course, (k!) means "k factorial".
Incidentally, if you include the math.h library, you will have a "exp" function available to you, and also a "pow" function for calculating powers
of numbers. You can read about how to use these functions by using the "man"
pages on your Unix system. Just do "man exp" and "man pow."
To illustrate with a specific example, suppose that
minTableSize is 1000 and numKeys is
800. Then tableSize is 1009 and alpha = 800/1009 = 0.793. Using
k = 3, the formula exp(-alpha) * (alpha # k) / (k!) = (0.453) *
(0.498) / 6, which yields a value of 0.038 (approximately).
The 0.038 means that if we are hashing 800 keys into a table with 1009 slots,
the statistical model predicts there will be 3-way
collisions at about 38 different locations in the table. (We get the number 38
by this calculation: (0.038) * (tableSize) = (0.038) * (1009) = 38.342, which rounds to 38. )
In the case at hand, one of the rows in the report
will
show the number 38 alongside the actual
number of 3-way collisions recorded when the program hashed the 800 keys
from the input. The report
allows us to make a comparison
between the value predicted by the statistical model and the value
that actually turned up.
Lay out the report as a 4-column array: the first
column is for the number k, the second column for the predicted
values, the third column for the actual values, and the fourth column
expressing the actual value as a percentage of the ideal value.
The table corresponding to the example of 800 keys hashed into a
table with 1009 slots might look like this:
slot count predicted actual act/pred
0 457 452 99%
1 362 382 106%
2 144 119 83%
3 38 45 119%
4 8 10 133%
5 1 1 84%
After printing the tabular report, the program will process
commands coming from the input. The possible commands are
s, r, i, d, and q.
The s command stands for show. In response to this
command, the program will echo the command and print a report
representing what is in
the hash table. The report must have a two-column format: the first
column is for table indices, and the second column is for key(s) stored in
the location given by the index. The program has to write social
security numbers in "standard form" -- that is with dashes in the right places.
A couple of things to keep in mind here: First, in this case you will
be reporting on the locations where the keys are actually
stored. If you are using open addressing, the address where a key
is stored is not
necessarily the address computed by the hash function. Second,
If you are doing chaining, there could be more than one key in
a given slot.
The s command is not meant to be used for large tables, so just
write your program to use one line for each occupied table
slot. However, write the code so it entirely skips any empty table slot.
The program must write nothing, not even a newline, corresponding
to an empty slot in
the table.
For example: when there are 80 keys and a minimum table size of 100, the
first few lines of the response to the s command might look like this:
===========================================================================
s
SHOW COMMAND
sINDEX VALUES
0 254-40-5163
5 576-29-0451
9 957-32-7903 861-27-9125 031-39-2930
13 126-56-3315 059-97-5934 202-75-3271 114-71-9146
14 261-64-6675
15 461-60-6678
16 962-95-6236 435-92-9550
17 612-12-0516 579-19-6940
This particular display was written by a version of the program
that uses chaining as the collision-resolution policy, so there is more
than one key stored in some of the table slots. Note that there are
"lines missing" in
the display where the program has refrained from printing output corresponding
to empty table slots. This is a desirable thing: the output is more compact this
way.
The r command stands for retrieve. This command
requires a parameter -- a single social security number (a key) to appear
after the r . The program will echo the command and if
possible find the key in the table. The program will then
report the address to which the key hashes, the current value of
the counter for that address, and the address at which the key is
actually stored. Each of these three pieces of information
must be appropriately labelled. For example consider the output
appropriate to the commands:
r 598-21-8634
r 564-64-7123
The first number is in the table and the second is
not. The following would be appropriate output:
===========================================================================
r 598-21-8634
RETRIEVE COMMAND
Key Hash Address Counter Value Location in Table
598-21-8634 78 2 78
===========================================================================
r 564-64-7123
RETRIEVE COMMAND
Key Hash Address Counter Value Location in Table
564-64-7123
NOT FOUND
===========================================================================
Note the use of lines of equal-signs to separate the output corresponding
to different commands. Note also the error message. The program is
required to write such a message when the key does not
exist in the table.
The i command stands for insert. Like the r command,
this command requires a parameter -- a key value coming next in the input.
The program
will echo the command and insert the key into the table (unless the table already
contains such an element or the table is full). If
insertion fails the program will print an appropriate message.
If insertion succeeds the program will print the same three
pieces of information as the r command. Of course the counter
value will be the new value: one greater than the old
value, because of the insertion.
The d command stands for delete. Like r and i , this command requires a
parameter . In this case the program will
echo the command and
delete the element if it is present. Like r and i , it will
print the hash address, new counter value, and actual
location. If the operation fails because the key was not found
(the only valid reason) then the program will print an
appropriate message.
The q command stands for quit. It will occur exactly once,
at the end of the input.
There can be any number of s, r, i, and d commands after the
initial input. They can appear mixed in any order. The
numbers minTableSize and numKeys will both be positive integers
and numKeys will be less than minTableSize.
PSEUDO-RANDOM NUMBER GENERATION:
In order to test your program, you will need a small
program that can generate sets of pseudo-random social security
numbers. For that, you can use the little program called makeInFile.cpp
included in the directory with this assignment.
Unless the set of keys you generate is very large indeed, it will be
almost certain that all the keys are distinct (different from one another).
However, this is not a perfect certainty.
You can use the unix sort and uniq commands to check your key
sets for duplicates. If you find your set has a duplicate, you
can repair the set by replacing the duplicate with a number you know is not already in the set. If you read the man pages on sort and uniq, I
think you will probably know what to do, but speak to me if you
have questions.
CALCULATION OF PRIMES:
In this program you need to have a function that inputs a
number m and returns the smallest prime number p >= m. We can
discuss this in class, and I can give you some simple pseudo-code.
DYNAMIC ALLOCATION OF ARRAYS:
This program has to allocate space for the hash table based on
the size value from the input. See page 156 in your text book
for information on how to allocate an array dynamically.
Incidentally, my copy of the book has a typographical error there.
The example should read:
int ArraySize = 50 ;
double * B = new double[ArraySize] ;
CHOICE OF IMPLEMENTATION:
If you want to save time writing this program, consider using
chaining as your collision resolution policy.
Unless you get a
lot of help by re-using classes from some source, chaining is
probably the easiest way to implement the collision resolution.
Deletion is a little
complex to implement when using open-addressing, but it is
straightforward when using chaining.
You can use the linked list class from the text book
to implement chains. I have included the code for the linked
list class in the directory containing this assignment.
TESTING:
You are responsible for deciding what sort of tests need to be done on your
program to check it for correctness. I expect you to design your own test
inputs. The sample inputs illustrated in this document are not adequate for
thorough testing. Your test scripts must demonstrate adequate code
coverage and data coverage.
When I do my testing of your program, time will be of the essence because I
will be testing the programs of many students in your class and in other
classes. I will prepare one or more files containing inputs to test your
programs. I will run your program in a mode like this:
myNameForYourProgram < myinput > myoutput
I will then examine the contents of myoutput to see if it differs
significantly from the output file I get when I run my own solution program.
If I have to change your program or one of my input files
in order to get your program to
accept my input, you will lose points.
WHAT TO TURN IN:
You will be sending me two e-mail messages. Please follow these rules:
- Always send me e-mail as plain text in the main
message body. Never send me attachments.
- I will tell you what subject line to use with each
message, and I need you to use exactly the
subject lines I give you. (I get hundreds of e-mail
messages at a time and your subject line allows me to
sort messages.)
Here is the list of things you have to turn in:
- Send the following items to me by e-mail before midnight on the
first due date:
A level three version of the source code and a script showing your
test runs. Include all the source files (*.h files and *.cpp
files) that are required to compile the program into an executable
image -- everything I need so that I can compile and test your
program. Combine all the source files and the script into one
shell archive file
and e-mail me the archive file with the subject line:
CS3100,prog5.3.
- Send the following items to me by e-mail before midnight on the
second due date:
A final version of the source code and a script showing your test
runs. Include all the source files (*.h files and *.cpp files)
that are required to compile the program into an executable image --
everything I need so that I can compile and test your program.
Combine all the source files and the script into one
shell archive file
and e-mail me the archive file with the subject line:
CS3100,prog5.f.
Note that there are no spaces in the subject lines given. It is important
that you do not insert any spaces. My e-mail address is: john@ishi.csustan.edu.
DUE DATES:
For the due dates, see
the class schedule.