(Latest Revision: 04/10/2005)
Hashing Social Security Numbers
PRELIMINARIES:
Before working on this problem you need to read about hashing in the second
half of chapter 12.
As always, you need to be familiar with the directions and examples here:
http://www.cs.csustan.edu/~john/Classes/General_Info/progAsgRules/
THE ASSIGNMENT:
Write a program that implements a hash table of social security
numbers.
INPUT & OUTPUT:
The program reads from standard input. Input must have the format illustrated
by this example:
20
15
259-97-6223 808-93-0709 636-86-3660 399-88-2968
041-39-9350 397-67-7732 522-23-1899 762-23-8562
927-06-2355 087-75-3173 296-59-8912 052-61-3718
910-26-9707 080-42-6115 917-34-3630
s
r 917-34-3630
r 911-34-3623
i 911-43-6298
d 808-93-0709
q
The first number in the input (20 in the example) must be a positive integer,
minTableSize. After reading minTableSize the program must
calculate a table size and allocate a hash table of that size. The table size
must be the smallest prime number not less than minTableSize. (The table size
must equal minTableSize if minTableSize is prime. Otherwise the table size
must be the first prime larger than minTableSize.) In the example above the
table size is 23.
The second number in the input (15 in the example) must be a positive integer
no greater than minTableSize, numKeys. After reading numKeys,
the program must read that many keys (social security numbers) from the input.
The keys must be formatted with hyphens as shown. The program must insert
each key into the hash table. The input must be constructed so that no two of
the keys are equal. The number of keys in this section of the input must be
exactly equal to the value of numKeys. This is illustrated by the example
above, where numKeys = 15 and there are exactly 15 keys in the section of the
input following numKeys.
The program must calculate the hash address by treating the key as a 9-digit
number N and taking N % tableSize as the address. In the example above,
041-39-9350 hashes to the address
41399350 % 23 = 17
(Note that C++ interprets an integer with a leading zero as an *octal*
constant, and so the expression above would be incorrect if written as
041399350 % 23 = 17.)
The program must use counters to keep track of how many keys have hashed to
each address in the hash table. The program must have a different counter for
each address. The addresses are 0 .. tableSize-1. Each time the program
encounters a new key, it must calculate the hash address using the hash
function and increment the counter corresponding to that address.
[Note: the collision-resolution policy has no effect on which counter the
program increments. The program always increments the counter that
corresponds to the (original) hash address, even if the collision-resolution
policy places the element in a location different from the original hash
address.]
After processing the numKeys key values, the program must output a tabular
report showing how many of the counters stopped when equal to 0, equal to 1,
equal to 2, and so on, up to the largest value to which any of the counters
grew. The report must also compare this information about the counters with
values predicted by the "perfect randomizer" expression:
( Expected Frequency(k) ) = (e-ααk)/(k!)
This expression, simply put, gives the expected frequency of locations in the
table whose counter stops at k.
The expression is valid if if the hash function has no bias. The frequency is
expressed as a decimal fraction of the table size.
The quantity α is the load factor, α = (numKeys /
tableSize). Of course, (k!) means "k factorial".
Incidentally, if you include the math.h library, you will have a "exp"
function available to you for computing e-α, and also a "pow"
function for calculating powers of numbers (like αk). You
can read about how to use these functions by using the "man" pages on your
Unix system. Just do "man exp" and "man pow."
To illustrate with a specific example, suppose that minTableSize is 1575 and
numKeys is 800. Then tableSize is 1579 and α = 800/1579 ≈ 0.5066. (I'm
using the symbol "≈" to mean 'approximately equal to'.) Using k = 3,
the formula
(e-ααk)/(k!)
≈ (0.6025) * (0.1300) / 6, which
yields a value of 0.0131 (approximately).
The 0.0131 means that if we are hashing 800 keys into a table with 1579 slots,
the statistical model predicts there will be 3-way collisions at about 1.31%
of the locations in the table. In this example, the 1.31% works out to about
21 locations. That calculation goes as follows: (0.0131) * (tableSize) =
(0.0131) * (1579) ≈ 20.685, which rounds to 21.
In the case at hand, one of the rows in the report will show the number 21
alongside the actual number of 3-way collisions recorded when the program
hashed the 800 keys from the input. The report allows us to make a comparison
between the value predicted by the statistical model and the value that
actually turned up.
Lay out the report as a 4-column array: the first column is for the number k,
the second column for the predicted values, the third column for the actual
values, and the fourth column expressing the actual value as a percentage of
the predicted value.
The table corresponding to the example of 800 keys hashed into a
table with 1579 slots might look like this:
slot count predicted actual act/pred
0 951 952 100%
1 482 486 101%
2 122 111 91%
3 21 29 141%
4 3 0 0%
5 0 1 378%
After printing the tabular report, the program must process commands coming
from the input. The possible commands are s,
r, i, d, and
q.
The s command stands for show. In response to this command,
the program must echo the command and print a report representing what is in
the hash table. The report must have a 'two-column' format: the first column
is for table indices, and the second column is for key(s) stored in the
location given by the index. The program must write social security numbers
in "standard form" -- that is with dashes in the right places.
A couple of things to keep in mind here: First, in this case your program must
report the locations where the keys are actually stored. If the program uses
open addressing, the address where a key is stored is not necessarily the
address computed by the hash function. Second, If the program uses chaining
for collision resolution there will typically be more than one key in some
slots.
The s command is not meant to be used for large tables, so just
write your program to use one line for each occupied table
slot. However, write the code so it entirely skips any empty table slot.
The program must write nothing, not even a newline, corresponding
to an empty slot in
the table.
For example: when there are 80 keys and a minimum table size of 100, the
first few lines of the response to the s command might look like this:
===========================================================================
s
SHOW COMMAND
INDEX VALUES
0 254-40-5163
5 576-29-0451
9 957-32-7903 861-27-9125 031-39-2930
13 126-56-3315 059-97-5934 202-75-3271 114-71-9146
14 261-64-6675
15 461-60-6678
16 962-95-6236 435-92-9550
17 612-12-0516 579-19-6940
This particular display was written by a version of the program that uses
chaining as the collision-resolution policy. There is more than one key
stored in some of the table slots. Therefore the 'column' for VALUES contains
several keys in some rows. Note that there are "lines missing" in the display
where the program has refrained from printing output corresponding to empty
table slots. This is a desirable thing: the output is more compact this way.
The r command stands for retrieve. This command requires a
parameter -- a single social security number (a key) to appear after the
r. The program must echo the command and if possible find the key in the
table. If the key is found, the program must then report the address to which
the key hashes, the current value of the counter for that address, and the
address at which the key is actually stored. Each of these three pieces of
information must be appropriately labelled. For example consider the output
appropriate to the commands: r 598-21-8634
r 564-64-7123
The first key above is in the table and the second is
not. The following would be appropriate output:
===========================================================================
r 598-21-8634
RETRIEVE COMMAND
Key Hash Address Counter Value Location in Table
598-21-8634 78 2 78
===========================================================================
r 564-64-7123
RETRIEVE COMMAND
Key Hash Address Counter Value Location in Table
564-64-7123
NOT FOUND
===========================================================================
Note the use of lines of equal-signs to separate the output corresponding to
different commands. Note also the error message. The program is required to
write such a message when the key does not exist in the table.
The i command stands for insert. Like the r
command, this command requires a parameter -- a key value coming next in
the input. The program will echo the command and insert the key into the
table (unless the table already contains such an element or the table is
full). If insertion fails the program will print an appropriate message. If
insertion succeeds the program will print the same three pieces of information
as the r command. Of course the counter value will be the new
value: one greater than the old value, because of the insertion.
The d command stands for delete. Like r and
i, this command requires a parameter. In this case the program will
echo the command and delete the element if it is present. Like r and
i, it will print the hash address, new counter value, and actual
location. If the operation fails because the key was not found (the only
valid reason) then the program will print an appropriate message.
The q command stands for quit. It will occur exactly once, at
the end of the input.
There can be any number of s, r,
i, and d commands after the initial input.
They can appear mixed in any order.
MAKEFILE GOTCHA, or (NOT) USING A MAKEFILE:
Normally it is a good idea to use a makefile. I had a big problem with my
makefile for this program. I don't know, but maybe the problem is due to a
bug in the compiler that 'kicks' in when math.h or time.h is included. What I
experienced is that the program would compile but then not write anything -
nothing at all. The same program worked fine and procuced lots of output when
I compiled using the command g++ *.cpp. So please watch
out for this problem.
(Update: The 'problem' referred to above may have just stemmed from some
conflict with /usr/bin/hash on the Ultra's. I tested a makefile in which the
name of the program was changed from 'hash' to 'hasher' and that makefile
seems to be working fine. )
PSEUDO-RANDOM NUMBER GENERATION:
In order to test your program, you will need a small
program that can generate sets of pseudo-random social security
numbers. For that, you can use the little program called
makeInFile.cpp
included in the directory with this assignment.
Unless the set of keys you generate is very large indeed, it will be
almost certain that all the keys are distinct (different from one another).
However, this is not a perfect certainty.
You can use the unix sort and uniq commands to check your key
sets for duplicates. If you find your set has a duplicate, you
can repair the set by replacing the duplicate with a number you know is not already in the set. If you read the man pages on sort and uniq, I
think you will probably know what to do, but speak to me if you
have questions.
CALCULATION OF PRIMES:
In this program you need to have a function that inputs a
number m and returns the smallest prime number p >= m. We can
discuss this in class, and I can give you some simple pseudo-code.
DYNAMIC ALLOCATION OF ARRAYS:
This program must not allocate the hash table statically. It must read the
value of minTableSize, compute the size of the hash table, and then
dynamically allocate the correct amount of space for the table. See pages
171-172 in your text book for information on how to perform dynamic allocation
of an array.
CHOICE OF IMPLEMENTATION:
In the 'story' of this program, the hash table is a prominent 'noun'. It
would be a good idea to implement the hash table as a class. You don't
necessarily have to do that, but you are required to use good design
principles - modularity, small functions, information hiding, abstraction, top
down structure, and so forth.
If you want to save time writing this program, consider using chaining as your
collision resolution policy.
Unless you get a lot of help by re-using classes from some source, I think
chaining will be the easiest way to implement the collision resolution.
Deletion is a little complex to implement when using open-addressing, but it
is straightforward when using chaining.
You can use the linked list code to implement chains. I have included linked
list code that will work well in the directory containing this assignment.
(See
ListP.h
and
ListP.cpp
)
TESTING:
You are responsible for deciding what sort of tests need to be done on your
program to check it for correctness. I expect you to design your own test
inputs. The sample inputs illustrated in this document are not adequate
for thorough testing. Your test scripts must demonstrate adequate code
coverage and data coverage.
When I do my testing of your program, time will be of the essence because I
will be testing the programs of many students. I will prepare one or more
files containing inputs to test your programs. I will run your program in a
mode like this:
myNameForYourProgram < myinput > myoutput
I will then examine the contents of myoutput to see if it differs
significantly from the output file I get when I run my own solution program.
If I have to change your program or one of my input files in order to get your
program to accept my input, you will lose significant credit.
WHAT TO TURN IN:
You will turn in two "phases" of this assignment:
- a level 3 version, and
- a final version.
For each phase of the assignment, you will turn in a printer output (hardcopy)
and you will send me an e-mail message. Please follow these rules:
- Always send me e-mail as plain text in the main message body.
Never send me attachments.
- Always use the exact subject line I specify for each
message. (I often get hundreds of e-mail messages in a week. The
subject line allows me to find, filter and sort messages.) You will lose
a significant number of points on the assignment if you use the wrong
subject line.
- Be very careful when you send the e-mail. You may use the
instructions in your
Hello World! lab excercise
for guidance. Of course, you will need to make the obvious changes to
those directions -- you have to use the correct subject line and
filename.
- Always send yourself a copy of each e-mail message you send to me,
and check immediately to see if you receive the message intact.
You are responsible for sending
e-mail correctly.
Here is the list of things you have to turn in:
- At the start of class on the
first due date,
place the following item on the "counter" in front of me:
- a hardcopy of your level 3 (or
greater) program. (All the source code except the list code.) Make
sure all the code is properly formatted and that it all shows on the
paper.
- Using the subject line: CS3100,prog5.3 send the following item
to me by e-mail before midnight on the
first due date:
One
shell archive file
(only one) containing items 1-4.
- All source files for your level 3
program (all *.cpp files and *.h files -- list code too)
- Your test script showing adequate testing of your level 3 program.
- A file named 'README' containing the compilation command one should
use to compile your program.
- A copy of your 'makefile' if you used one (see above why maybe you
should not).
- At the start of class on the
second due date,
place the following item on the "counter" in front of me:
- a hardcopy of your final level
program. (All the source code except the list code.) Make
sure all the code is properly formatted and that it all shows on the
paper.
- Using the subject line: CS3100,prog5.f send the following item
to me by e-mail before midnight on the
second due date:
One
shell archive file
(only one) containing items 1-4.
- All source files for your final level
program (all *.cpp files and *.h files -- list code too)
- Your test script showing adequate testing of your final level program.
- A file named 'README' containing the compilation command one should
use to compile your program.
- A copy of your 'makefile' if you used one (see above why maybe you
should not).
Note that there are no spaces in the subject lines given. It is important
that you do not insert any spaces. My e-mail address is: john@ishi.csustan.edu.
DUE DATES:
For the due dates, see
the class schedule.