CS 3100 Week 13 Notes

(Latest Revision: 12/01/2000)

Week 13 Notes for CS 3100 -- Fall 2000

Monday, November 27

Take Roll.
Check out this week's schedule
Announcement(s)
- Level three version of program #5 is due Friday.
- Reading assignment for chapter 12 has changed a little -- read about hashing first.
Finish Discussion of chapter 11 material -- priority queues.
- Array-based implementation of heap/priority queue
- Binary search tree implementation of heap/priority queue
- Balance binary search tree implementation of heap/priority queue
- advantages/disadvantages
Discussion of hashing
- What is the problem we want to solve with hashing?
- What is the ADT hash table?
- Hashing approximates the idea of having an array indexed by the set of all possible keys.
- When the idea above is not practical, we use the hash function as a mapping key ---> index.
- A good hash function is easy and fast to calculate and produces an even distribution of keys across the set of indices.
- A perfect hash function is one-to-one: different keys never hash to the same index in the table. If all keys are known in advance it is practical to create a perfect hash function, otherwise not.
- To increase the likelihood of even distribution of keys:
  - calculation of hash function should involve the entire key.
  - If hash function uses modular arithmetic, the base should have no factors less than 25 or so. To be on the safe side, choose a prime for the base.
- Collisions are likely

Wednesday, November 29

Take Roll.
Check out this week's schedule
Announcement(s)
- Level three version of program #5 is due Friday, Dec 01.
Load factor
```
alpha = (#items in table)/(#addresses in table)
```
- Load factor alpha measures how full the table is
- Average work required to do a search in a hash table increases with alpha.
Linear Probing for Collision Resolution
- Average #comparisons for successful search: (1/2){1+ [1/(1-alpha)]}
- Average #comparisons for un-successful search: (1/2){1+ [1/sqr(1-alpha)]}
- Both terms approach 1 as alpha-->0.
- Both terms approach infinity as alpha-->1. (obviously we lose precision when alpha gets close to 1.)
- When alpha=2/3, [1/(1-alpha)]=3, [1/sqr(1-alpha)]=9, successful search averages (1+3)/2=2 comparisons, and unsuccessful search averages (1+9)/2=10 comparisons.
Quadratic probing and double hashing for Collision Resolution
- Average #comparisons for successful search: -ln(1-alpha)/alpha
- Average #comparisons for un-successful search: 1/(1-alpha).
- Both terms approach 1 as alpha-->0.
- Both terms approach infinity as alpha-->1. (obviously we lose precision when alpha gets close to 1.)
- These functions are smaller than the corresponding work functions for linear probing.
The methods described are open addressing. They are problematic when the number of keys cannot be bounded.
Separate Chaining for Collision Resolution
- alpha can be greater than 1.
- alpha = (#items)/(#addresses) = average length of a chain.
- Average #comparisons for successful search: 1+(alpha/2) -- 1 probe plus the average number of additional probes required in the average non-empty list.
- Average #comparisons for un-successful search: alpha -- we have to probe until we get to the end of the average list.
Advice: If you can afford a large table (small alpha) and if you are willing to trade occasional slowness of search for very fast average search, then hashing may be the right choice.
Advice: If you are performing "life-and-death search for your city's poison control center" a (balanced) search-tree implementation will at least assure you an O(logN) bound on the number of probes required.
Typically keys do not map into a hash table in any semblance of key order and it is not possible to efficiently implement a hash table operation that traverses the table elements in key-order. Practically speaking, you would have to sort the items in a hash table to traverse the table in sorted order. If that traversal operation is necessary, think seriously about using a search tree instead of a hash table.
When memory accesses are exclusively to primary memory, the time savings gained by hashing is often not significant. However, the savings may very well be significant when accessing a very slow external data device such as a disk.

Friday, December 01

Take Roll.
Check out this week's schedule
Announcement(s)
Discuss hashing assignment
- If doing external chaining, counter can be length of list.
- Calculating (alpha # k) / (k!) can be "unstable" because (k!) gets very large very quickly as k increases, and because (alpha # k) can get either very large or very small very quickly as k increases. (The latter depends on whether the load factor alpha is greater than one or less.
  
  If alpha > 1 consider calculating it this way (alpha/1)*(alpha/2) ... * (alpha/k). That way the numerator and denominator tend to "cancel" and there is less chance of underflow.
- It is easy to make an error doing the math to generate the information in the table. Test with 800 keys and 1000 min table size to see if you get the same output as in the assignment sheet. The sample was actually generated by a solution program.
- Gotcha: remember that you need to do something "fancier" than integer division to calculate alpha.
Discuss "Birthday Surprise/Paradox", collisions, and value of statistics that the hashing program calculates.
Discuss balanced search trees.