The Average BST

QUESTION I: WHAT IS THE REAL BIG-O OF SEARCH IN A BST?

AND HOW DOES ONE CALCULATE IT?

An important question is:

I. What is the number of probes required to do a (binary) search in a binary search tree (BST)?

If you think about question I for a while, you realize that it doesn't have a simple answer. That's because the number of probes you need to do in a search depends on many things, like the number of nodes in the tree, the exact shape of the tree, which key value you are looking for, and whether that key value is actually in the tree.

Even if you just consider one particular tree, you find that you must think in terms of AVERAGES to get useful information. Here are two well-formed questions that apply to any particular binary search tree:

II. "What is the AVERAGE number of probes required for a successful search in a BST?", and

III. "What is the AVERAGE number of probes required for an unsuccessful search in a BST?

(There are good reasons for separating the cases of successful and unsuccessful search -- for one thing, this will figure prominently in the method we eventually use to calculate a general answer to question I.)

The answer to question II can be calculated "by hand" if the tree is not very large. For every node A in the tree, one just counts the number of nodes in the path from the root to A (including the root and A). That gives the number of probes required to find A, and so if we sum all those numbers and divide by the number of nodes in the tree, we get the number of probes required to find the "average node in the tree" (assuming that all node keys are equally likely to be sought.)

To calculate the average number of probes for an UNsuccessful search, we note that such searches always end when a nil pointer is encountered, and that there is one "class" of unsuccessful search for each nil pointer in the tree. Assuming again that all keys not in the tree are equally likely to be sought in an unsuccessful search, we can count the number of nodes in all the paths from the root to these nil pointers, sum those, and divide by the number of nil pointers, thus getting the average number of probes for an unsuccessful search.

The considerations above are all well and good, but how do we proceed when the number of nodes is very large? And also, what if we want some over-all information about search in binary search trees with a given number of nodes? For example, there are 14 distinct binary search tree "shapes" a tree with only four nodes can have. How can we succinctly describe the amount of work required to search in "a BST with 4 nodes"?"

One rather nice way to begin to answer is to imagine the following "experiment": Pick a number n. Randomly choose n distinct key values from some totally ordered set, and shuffle them so they are in some random order. Then build a binary search tree by starting with an empty tree, and inserting nodes with those keys in the order they got in the shuffle. The result is a

"random BST" of size n. Now calculate the average probe numbers as described before.

If you repeat the experiment many times, and then average the respective "average probe numbers", you will get a very good estimate of a more generalized average -- the average probe numbers for a BST of size n. Of course, for large trees, the amount of work you (or a computer) might have to do to REALLY calculate these quantities could easily be too much to be practical. But thinking about this experiment leads to a mathematical shortcut.

Why should we care about those numbers? Because in fact the part of the experiment up to the construction of the tree is very like what is happening out there in the world! People are taking sets of keys and constructing BST's for them. If we can figure out what would be found by doing the rest of the experiment, then we will know a lot about how efficient it is, on average, to do search in BST's. Why should we care about that? Well, if there's a better way to do these things, don't you think we computer scientists ought to try to find it? How will we know it when we see it? We'll have to know something about how efficient search is in BST's.

Let's give the objects of our quest names. Let S(n) be the average number of probes required in doing a successful search in a BST with n nodes. S is a function, there is a value of S for each value of n. Remember that S(n) stands for the average that you get by averaging over all the different key orderings that you might get in the shuffle, and that different shuffle orderings can result in differently shaped trees.

Let U(n) be the average number of probes required in doing an UNsuccessful search in a BST with n nodes. We would like to know the big-O measurements of S(n) and U(n). If they are O(log(n)), then search in BST's is no worse, ON AVERAGE, than binary search in ordered arrays. But if they are more than O(log(n)), then we would have to seriously question our use of BST's.

Here is a key observation:

Consider the "average node", that gets put into a tree in our series of experiments,

            10                         10  05  15  02  06  N  X  X  X  X
       05        15                                        ^average node
     02  06    N <-------- placement of node

Wherever the node is placed in the tree that gets built, FINDING THE NODE IN A SEARCH LATER, AFTER THE TREE IS COMPLETE, WILL REQUIRE ONE MORE PROBE THAN IT TOOK TO DO THE UNSUCCESSFUL SEARCH FOR IT THAT PRECEDED ITS INSERTION INTO THE TREE. The "partial" tree that this node is inserted into is determined only by the permutation (ordering) of the nodes that precede it into the tree. All orderings are equally likely to occur. (It does not matter what particular KEYS are in those nodes, only their relative ordering. All sets of keys are equally likely to occur as the predecessors of our average node.) Therefore, if our node is the Kth node to enter the tree, it enters, on the average, into an average BST of K-1 nodes, after AN AVERAGE UNSUCCESSFUL SEARCH, requiring U(K-1) probes.

So our average node, if Kth to enter, will later be found in successful searches at the cost of [U(K-1) + 1] probes. All values of K between 1 and n are equally likely. So the AVERAGE cost of a successful search for this average node is:

Eqn 1.    S(n) = { [U(0) + 1)] + [U(1) + 1] + ... + [U(n-1) + 1] } / n
               = { [ U(0) + U(1) + ... + U(n-1) ] / n } + 1

Eqn 1. is a key to discovering formulas for S(n) and U(n). We will put it together with some other equations, and "solve for the unknowns."

S(n) = [I(n) + n] / n,              { i.e. I(n) = n*S(n) - n }

where I(n) is the average internal path length of a BST of size n. The method of calculating the average number of probes for a successful search in a particular tree was to sum the number of nodes in every path from the root to a node, and to divide by n. The formula above is just the mathematical expression of that method, generalized to the AVERAGE BST of size n. I(n) is the AVERAGE sum you would get by adding all the EDGES in paths from the root to nodes in the tree. By adding n to I(n), you get the number of nodes in all those paths, and hence the number of probes required.

U(n) = E(n)/(n+1),                   { i.e. E(n) = (n+1)*U(n) }

where E(n) is the average external path length of a BST of size n. In any particular BST with n nodes there are n+1 "nil pointers". The number of edges in a path from the root to one of these nil pointers is the same as the number of probes that will be done in an unsuccessful search that terminates at that nil pointer. Thus, the average number of probes required in an unsuccessful search in this BST is the sum, E, of all these edge-path-lengths, divided by n+1, the number of places where the search can end. E(n) is the AVERAGE value of E. Therefore the formula above gives the number of probes required for unsuccessful search in the AVERAGE BST of size n.

E(n) = I(n) + 2n

E = I + 2n is true for all BST's of size n, and so the identity is true for the averages E(n) and I(n).

(E = I + 2n can be readily proved by induction. One method of proof is to Define I and E to be zero for an empty tree, to verify that E = I + 2n when n=0 and n=1, and to get the rest of the result by assuming that a larger tree's left and right subtrees satisfy the identity: If El and Er are the external path lengths of the left and right subtrees, if Il and Ir are the internal path lengths, and if nl and nr are the numbers of nodes in the left and right subtrees, then one can calculate the value of E for the whole tree as E = (El + nl + 1) + (Er + nr +1), and the value of I = (Il + nl) + (Ir + nr). The result follows by noting that E = I + 2n, where n = nl + nr + 1 -- the number of nodes in the whole tree.)

A substitution in the last equation above can be done, using the values of I(n) and E(n) that one gets from the two prior equations:

(n+1)*U(n) = n*S(n) -n + 2n

Then a little algebra done on the previous equation yields:

Eqn 2:     S(n) = [1 + (1/n)]*U(n) - 1

If we substitiute the value for S(n) given by Eqn 1 into Eqn 2, and simplify the resulting equation a bit, we get:

Eqn 3:    (n+1)*U(n) = U(0) + U(1) + ... + U(n-1) + 2n

Eqn 3 is true for n = 1, 2, 3, .... So we get an equation that is true for n = 2, 3, 4, ... if we substitiute n-1 for n in Eqn 3. That gives:

Eqn 4:     n*U(n-1) = U(0) + U(1) + ... + U(n-2) + 2n-2

Subtracting Eqn 4 from Eqn 3 and solving for U(n) gives:

Eqn 5:     U(n) = U(n-1) + [2/(n+1)]    (true for n = 2, 3, 4, ... )

It's obvious that it always requires one probe for an unsuccessful search in a BST with one node, i.e. U(1) = 1. Therefore Eqn 5 tells us that

U(2) = 1 + [2/(2+1)] = 1 + [2/3].  Using Eqn 5 again, we then see that 
U(3) = 1 + [2/3] + [2/4].  
U(4) = 1 + [2/3] + [2/4] + [2/5].
U(5) = 1 + [2/3] + [2/4] + [2/5] + [2/6].

Continuing, we see that in general:

Eqn 6:     U(n) = 1 + 2*{ [1/3] + [1/4] +  ... + [1/(n+1)] }, or
Eqn 7:     U(n) = 2*[H(n+1) - 1],

where H(n) = 1 + [1/2] + [1/3] + ... + [1/n] is the nth "harmonic number".

So we see that U is O(H). Let's analyze H. By looking at a histogram of y = 1/n, and noting that ln(n) is the area under the graph of y = 1/x from x=1 to x=n, it is easy to be convinced that

H(n) - 1 = [1/2] + ... + [1/n] < ln(n) < H(n-1) = 1 + [1/2] + ...+ [1/(n-1)]

Therefore 0 < ln(n) - H(n) + 1 < 1 - [1/n] < 1
              ^^^^^^^^^^^^^^^^
                  b(n)

Let b(n) = ln(n) - H(n) + 1. Then b(n+1) - b(n) = ln(n+1) - ln(n) - [1/(n+1)], and another look at the graph described just above will show that ln(n+1) - ln(n) is the area under the graph of y = 1/x from x=n to x=(n+1), while [1/(n+1)] is the area of the rectangle with base 1 and height [1/(n+1)] that lies completely under that section of the graph (except for one point). Therefore, the difference b(n+1) - b(n) is POSITIVE for n = 1, 2, 3, .... Therefore, {b(n)} is a strictly increasing, bounded sequence of real numbers, and by the COMPLETENESS axiom of the REAL NUMBERS, there exists a LIMIT to the sequence {b(n)}. So

[H(n)/ln(n)] = 1 + {[1-b(n)]/ln(n)}  ====> 1  as  n ====> infinity.

(Note we just used the fact that {b(n)} converges to get the conclusion that

{[1-b(n)]/ln(n)}  ====>  0.)

A little algebra shows that

[U(n)/ln(n)] = 2*(1 - {b(n)/ln(n)] - [1/(n+1)ln(n)} ===> 2 as n ===> infinity.

Thus U(n) and ln(n) are big-O of each other (in fact, asymptotically proportionate.) We can see that for reasonably large values of n, U(n) = 2*ln(n) is a very good approximation (we can make the error as small as we like, as a percentage of ln(n), by choosing n sufficiently large.)

But ln(n) = ln(2) * L(n), where L is the logarithm with base 2, and ln(2) = approximately 0.693, which means that 2*ln(2) = approximately 1.386, and that

U(n) = approximately 1.386*L(n) = approximately 1.4*L(n).

THAT is why it is said that search in a random BST is "only 40% worse" than bin search in an ordered array. Finally, note that by Eqn 2

S(n) = (1+[1/n])*U(n) - 1,

so that it's clear that S(n)/U(n) ===> 1 as n ===> infinity, and that

S(n) = 1.386*L(n) is also a very good approximation for large values of n.