Common Methods of Hashing

Note: we assume that keys are always first converted to integers.

A good hashing function is
  1. fast to compute
  2. produces a near-uniform distribution of the keys over the set of table index values. (This is a case where creamy peanut butter is superior to chunky.)
The Collision Problem: A collision occurs if two different keys hash to the same address in the table. H(key1) = H(key2) but key1 != key2.

The important operations are search, insert, and delete. The object is to get near O(1) performance on all these operations.

Compared to a binary search tree implementation, hash tables are not as versatile for implementing a dictionary. For example hashing does not support an efficient traversal in key order. Also hash tables are subject to overloading. If too many keys are added performance degrades severely. Many common implementations rely on static memory allocation.

The fundamental property of hashing is near "associative access" -- Address(object) = H(key value). Hash function H maps the key space to the table address space.

It is possible to create a "perfect" (no collisions) hash function if all keys are known in advance. This guarantees O(1) search. Problems with that: it is very time-consuming to create the hash function, and there are applications where it is impossible to know all the keys in advance.

Normally a compromise method is used involving a hash function that generates a table address and possibly some additional searching.

Digit Selection Hash:

Suppose the key is a social security number such as 987-65-4321. The hash function could choose the last three digits 321 or use the 3rd, 5th, and 8th digits: 752

The digits selected must be 'random' - no patterns that would increase the number of collisions.

It is often worthwhile to analyze the key population in advance. Choose digits yielding the most uniform distribution. (Learn about statistics.) Common sense usually suffices. Realize that correlations among digits are possible. For example, if we look at the first three digits of the social security numbers of people who live in Stanislaus county, we will find many instances of a certain few patterns, and low concentrations of most patterns - thus hashing by the first three digits of the social security number is not a good idea. On the other hand the last three or four digits of the social security number tends to be a very good hash function.

Division Hash

H(key) = key mod m (the remainder on division by m). This yields a hash value between 0 and m-1 (inclusive).

A prime table size is often used to avoid placing keys with common factors in a restricted number of table locations. For example if by chance all the keys and the table size too were multiples of 5 then multiples of 5 would be the only table locations keys would hash to. (Suppose tablesize is 5t and key value is 5k. When we express 5k = (5t)q+r where 0<=r<=5t-1 is a remainder, note that r = 5(k-tq) is a multiple of 5.)

Multiplicative Hash: Say the key is 1234567. The hash function could calculate (1234567)2 = 1524155677489. Then the hash function could finish up by doing digit selection on 1524155677489. Usually it is best to do selection of middle digits because they are more "mixed up". We could get the middle five digits of this 13-digit number with the formula: (1524155677489/10,000)%100,000. Thus our hash function could be H(key) = (key2/10000)%100000.

Folding Hash: This involves combining portions of a key to get a smaller result. For example H(1234567893572) = 123+456+789+357+2. Folding is often used in combination with other methods, for example folding followed by division hash.