Common Methods of Hashing
Note: we assume that keys are always first converted to integers.
A good hashing function is
- fast to compute
- produces a near-uniform distribution of the keys over the set of table
index values. (This is a case where creamy peanut butter is superior to
chunky.)
The Collision Problem: A collision occurs if two different
keys hash to the same address in the table. H(key1) = H(key2)
but key1 != key2.
The important operations are search, insert, and delete. The
object is to get near O(1) performance on all these operations.
Compared to a binary search tree implementation, hash tables are not as
versatile for implementing a dictionary. For example hashing does not support
an efficient traversal in key order. Also hash tables are subject to
overloading. If too many keys are added performance degrades severely. Many
common implementations rely on static memory allocation.
The fundamental property of hashing is near "associative access"
-- Address(object) = H(key value). Hash function H maps the key
space to the table address space.
It is possible to create a "perfect" (no collisions) hash function if all keys
are known in advance. This guarantees O(1) search. Problems with that: it is
very time-consuming to create the hash function, and there are applications
where it is impossible to know all the keys in advance.
Normally a compromise method is used involving a hash function
that generates a table address and possibly some additional
searching.
Digit Selection Hash:
Suppose the key is a social security number such as 987-65-4321.
The hash function could choose the last three digits 321 or use
the 3rd, 5th, and 8th digits: 752
The digits selected must be 'random' - no patterns that would
increase the number of collisions.
It is often worthwhile to analyze the key population in advance.
Choose digits yielding the most uniform distribution. (Learn
about statistics.) Common sense usually suffices. Realize that
correlations among digits are possible. For example, if we look
at the first three digits of the social security numbers of
people who live in Stanislaus county, we will find many instances
of a certain few patterns, and low concentrations of most
patterns - thus hashing by the first three digits of the social
security number is not a good idea. On the other hand the last
three or four digits of the social security number tends to be a
very good hash function.
Division Hash
H(key) = key mod m (the remainder on division by m). This yields
a hash value between 0 and m-1 (inclusive).
A prime table size is often used to avoid placing keys with
common factors in a restricted number of table locations. For
example if by chance all the keys and the table size too were
multiples of 5 then multiples of 5 would be the only table
locations keys would hash to. (Suppose tablesize is 5t and key
value is 5k. When we express 5k = q(5t)+r where 0<=r<=5t-1 is a
remainder, note that r = 5(k-qt) is a multiple of 5.)
Multiplicative Hash: Say the key is 1234567. The hash
function could calculate (1234567)2 = 1524155677489.
Then the hash function could finish up by doing digit selection
on 1524155677489. Usually it is best to do selection of middle
digits because they are more "mixed up". We could get the middle
five digits of this 13-digit number with the formula:
(1524155677489/10,000)%10,000. Thus our hash function could be
H(key) = (key2/10000)%10000.
Folding Hash: This involves combining portions of a key to
get a smaller result. For example H(1234567893572) =
123+456+789+357+2. Folding is often used in combination with
other methods, for example folding followed by division hash.