Common Methods of Hashing
Note: we assume that keys are always first converted to integers.
A good hashing function is
- fast to compute
- produces a near-uniform distribution of the keys over the set of table
index values. (This is a case where creamy peanut butter is superior to
chunky.)
The Collision Problem: A collision occurs if two different keys hash to
the same address in the table. H(key1) = H(key2) but key1 != key2.
The important operations are search, insert, and delete. The object is to get
near O(1) performance on all these operations.
Compared to a binary search tree implementation, hash tables are not as
versatile for implementing a dictionary. For example hashing does not support
an efficient traversal in key order. Also hash tables are subject to
overloading. If too many keys are added performance degrades severely. Many
common implementations rely on static memory allocation.
The fundamental property of hashing is near "associative access" --
Address(object) = H(key value). Hash function H maps the key space to the
table address space.
It is possible to create a "perfect" (no collisions) hash function if all keys
are known in advance. This guarantees O(1) search. Problems with that: it is
very time-consuming to create the hash function, and there are applications
where it is impossible to know all the keys in advance.
Normally a compromise method is used involving a hash function that generates
a table address and possibly some additional searching.
Digit Selection Hash:
Suppose the key is a social security number such as 987-65-4321. The hash
function could choose the last three digits 321 or use the 3rd, 5th, and 8th
digits: 752
The digits selected must be 'random' - no patterns that would increase the
number of collisions.
It is often worthwhile to analyze the key population in advance. Choose
digits yielding the most uniform distribution. (Learn about statistics.)
Common sense usually suffices. Realize that correlations among digits are
possible. For example, if we look at the first three digits of the social
security numbers of people who live in Stanislaus county, we will find many
instances of a certain few patterns, and low concentrations of most patterns -
thus hashing by the first three digits of the social security number is not a
good idea. On the other hand the last three or four digits of the social
security number tends to be a very good hash function.
Division Hash
H(key) = key mod m (the remainder on division by m). This yields a hash value
between 0 and m-1 (inclusive).
A prime table size is often used to avoid placing keys with common factors in
a restricted number of table locations. For example if by chance all the keys
and the table size too were multiples of 5 then multiples of 5 would be the
only table locations keys would hash to. (Suppose tablesize is 5t and key
value is 5k. When we express 5k = (5t)q+r where 0<=r<=5t-1 is a remainder,
note that r = 5(k-tq) is a multiple of 5.)
Multiplicative Hash: Say the key is 1234567. The hash function could
calculate (1234567)2 = 1524155677489. Then the hash function could
finish up by doing digit selection on 1524155677489. Usually it is best to do
selection of middle digits because they are more "mixed up". We could get the
middle five digits of this 13-digit number with the formula:
(1524155677489/10,000)%100,000. Thus our hash function could be H(key) =
(key2/10000)%100000.
Folding Hash: This involves combining portions of a key to get a
smaller result. For example H(1234567893572) = 123+456+789+357+2. Folding is
often used in combination with other methods, for example folding followed by
division hash.