QUESTION I: WHAT IS THE REAL BIG-O OF SEARCH IN A BST?
AND HOW DOES ONE CALCULATE IT?
An important question is:
I. What is the number of probes required to do a (binary) search in a binary
search tree (BST)?
If you think about question I for a while, you realize that it doesn't have
a simple answer. That's because the number of probes you need to do in a
search depends on many things, like the number of nodes in the tree, the
exact shape of the tree, which key value you are looking for, and whether
that key value is actually in the tree.
Even if you just consider one particular tree, you find that you must think
in terms of AVERAGES to get useful information. Here are two well-formed
questions that apply to any particular binary search tree:
II. "What is the AVERAGE number of probes required for a successful
search in a BST?", and
III. "What is the AVERAGE number of probes required for an unsuccessful
search in a BST?
(There are good reasons for separating the cases of successful and
unsuccessful search -- for one thing, this will figure prominently in the
method we eventually use to calculate a general answer to question I.)
The answer to question II can be calculated "by hand" if the tree is not
very large. For every node A in the tree, one just counts the number of
nodes in the path from the root to A (including the root and A). That gives
the number of probes required to find A, and so if we sum all those numbers
and divide by the number of nodes in the tree, we get the number of probes
required to find the "average node in the tree" (assuming that all node keys
are equally likely to be sought.)
To calculate the average number of probes for an UNsuccessful search, we
note that such searches always end when a nil pointer is encountered, and
that there is one "class" of unsuccessful search for each nil pointer in the
tree. Assuming again that all keys not in the tree are equally likely to be
sought in an unsuccessful search, we can count the number of nodes in all
the paths from the root to these nil pointers, sum those, and divide by the
number of nil pointers, thus getting the average number of probes for an
unsuccessful search.
The considerations above are all well and good, but how do we proceed when
the number of nodes is very large? And also, what if we want some over-all
information about search in binary search trees with a given number of nodes?
For example, there are 14 distinct binary search tree "shapes" a tree with
only four nodes can have. How can we succinctly describe the amount of work
required to search in "a BST with 4 nodes"?"
One rather nice way to begin to answer is to imagine the following
"experiment": Pick a number n. Randomly choose n distinct key values from
some totally ordered set, and shuffle them so they are in some random order.
Then build a binary search tree by starting with an empty tree, and inserting
nodes with those keys in the order they got in the shuffle. The result is a
"random BST" of size n. Now calculate the average probe numbers as
described before.
If you repeat the experiment many times, and then average the respective
"average probe numbers", you will get a very good estimate of a more
generalized average -- the average probe numbers for a BST of size n. Of
course, for large trees, the amount of work you (or a computer) might have
to do to REALLY calculate these quantities could easily be too much to be
practical. But thinking about this experiment leads to a mathematical
shortcut.
Why should we care about those numbers? Because in fact the part of the
experiment up to the construction of the tree is very like what is happening
out there in the world! People are taking sets of keys and constructing
BST's for them. If we can figure out what would be found by doing the rest
of the experiment, then we will know a lot about how efficient it is, on
average, to do search in BST's. Why should we care about that? Well, if
there's a better way to do these things, don't you think we computer
scientists ought to try to find it? How will we know it when we see it?
We'll have to know something about how efficient search is in BST's.
Let's give the objects of our quest names. Let S(n) be the average number
of probes required in doing a successful search in a BST with n nodes. S is
a function, there is a value of S for each value of n. Remember that S(n)
stands for the average that you get by averaging over all the different key
orderings that you might get in the shuffle, and that different shuffle
orderings can result in differently shaped trees.
Let U(n) be the average number of probes required in doing an UNsuccessful
search in a BST with n nodes. We would like to know the big-O measurements
of S(n) and U(n). If they are O(log(n)), then search in BST's is no worse,
ON AVERAGE, than binary search in ordered arrays. But if they are more than
O(log(n)), then we would have to seriously question our use of BST's.
Here is a key observation:
Consider the "average node", that gets put into a tree in our series of
experiments,
10 10 05 15 02 06 N X X X X
05 15 ^average node
02 06 N <-------- placement of node
Wherever the node is placed in the tree that gets built, FINDING THE NODE IN
A SEARCH LATER, AFTER THE TREE IS COMPLETE, WILL REQUIRE ONE MORE PROBE THAN
IT TOOK TO DO THE UNSUCCESSFUL SEARCH FOR IT THAT PRECEDED ITS INSERTION
INTO THE TREE. The "partial" tree that this node is inserted into is
determined only by the permutation (ordering) of the nodes that precede it
into the tree. All orderings are equally likely to occur. (It does not
matter what particular KEYS are in those nodes, only their relative
ordering. All sets of keys are equally likely to occur as the predecessors
of our average node.) Therefore, if our node is the Kth node to enter the
tree, it enters, on the average, into an average BST of K-1 nodes, after AN
AVERAGE UNSUCCESSFUL SEARCH, requiring U(K-1) probes.
So our average node, if Kth to enter, will later be found in successful
searches at the cost of [U(K-1) + 1] probes. All values of K between 1 and
n are equally likely. So the AVERAGE cost of a successful search for this
average node is:
Eqn 1. S(n) = { [U(0) + 1)] + [U(1) + 1] + ... + [U(n-1) + 1] } / n
= { [ U(0) + U(1) + ... + U(n-1) ] / n } + 1
Eqn 1. is a key to discovering formulas for S(n) and U(n). We will put it
together with some other equations, and "solve for the unknowns."
S(n) = [I(n) + n] / n, { i.e. I(n) = n*S(n) - n }
where I(n) is the average internal path length of a BST of size n. The
method of calculating the average number of probes for a successful search
in a particular tree was to sum the number of nodes in every path from the
root to a node, and to divide by n. The formula above is just the
mathematical expression of that method, generalized to the AVERAGE BST of
size n. I(n) is the AVERAGE sum you would get by adding all the EDGES in
paths from the root to nodes in the tree. By adding n to I(n), you get the
number of nodes in all those paths, and hence the number of probes required.
U(n) = E(n)/(n+1), { i.e. E(n) = (n+1)*U(n) }
where E(n) is the average external path length of a BST of size n. In any
particular BST with n nodes there are n+1 "nil pointers". The number of
edges in a path from the root to one of these nil pointers is the same as
the number of probes that will be done in an unsuccessful search that
terminates at that nil pointer. Thus, the average number of probes required
in an unsuccessful search in this BST is the sum, E, of all these
edge-path-lengths, divided by n+1, the number of places where the search can
end. E(n) is the AVERAGE value of E. Therefore the formula above gives the
number of probes required for unsuccessful search in the AVERAGE BST of size
n.
E(n) = I(n) + 2n
E = I + 2n is true for all BST's of size n, and so the identity is true for
the averages E(n) and I(n).
(E = I + 2n can be readily proved by induction. One method of proof is to
Define I and E to be zero for an empty tree, to verify that E = I + 2n when
n=0 and n=1, and to get the rest of the result by assuming that a larger
tree's left and right subtrees satisfy the identity: If El and Er are the
external path lengths of the left and right subtrees, if Il and Ir are the
internal path lengths, and if nl and nr are the numbers of nodes in the left
and right subtrees, then one can calculate the value of E for the whole tree
as E = (El + nl + 1) + (Er + nr +1), and the value of I = (Il + nl) + (Ir +
nr). The result follows by noting that E = I + 2n, where n = nl + nr + 1 --
the number of nodes in the whole tree.)
A substitution in the last equation above can be done, using the values of
I(n) and E(n) that one gets from the two prior equations:
(n+1)*U(n) = n*S(n) -n + 2n
Then a little algebra done on the previous equation yields:
Eqn 2: S(n) = [1 + (1/n)]*U(n) - 1
If we substitiute the value for S(n) given by Eqn 1 into Eqn 2, and simplify
the resulting equation a bit, we get:
Eqn 3: (n+1)*U(n) = U(0) + U(1) + ... + U(n-1) + 2n
Eqn 3 is true for n = 1, 2, 3, .... So we get an equation that is true for
n = 2, 3, 4, ... if we substitiute n-1 for n in Eqn 3. That gives:
Eqn 4: n*U(n-1) = U(0) + U(1) + ... + U(n-2) + 2n-2
Subtracting Eqn 4 from Eqn 3 and solving for U(n) gives:
Eqn 5: U(n) = U(n-1) + [2/(n+1)] (true for n = 2, 3, 4, ... )
It's obvious that it always requires one probe for an unsuccessful search in
a BST with one node, i.e. U(1) = 1. Therefore Eqn 5 tells us that
U(2) = 1 + [2/(2+1)] = 1 + [2/3]. Using Eqn 5 again, we then see that
U(3) = 1 + [2/3] + [2/4].
U(4) = 1 + [2/3] + [2/4] + [2/5].
U(5) = 1 + [2/3] + [2/4] + [2/5] + [2/6].
Continuing, we see that in general:
Eqn 6: U(n) = 1 + 2*{ [1/3] + [1/4] + ... + [1/(n+1)] }, or
Eqn 7: U(n) = 2*[H(n+1) - 1],
where H(n) = 1 + [1/2] + [1/3] + ... + [1/n] is the nth "harmonic number".
So we see that U is O(H). Let's analyze H. By looking at a histogram of
y = 1/n, and noting that ln(n) is the area under the graph of y = 1/x from
x=1 to x=n, it is easy to be convinced that
H(n) - 1 = [1/2] + ... + [1/n] < ln(n) < H(n-1) = 1 + [1/2] + ...+ [1/(n-1)]
Therefore 0 < ln(n) - H(n) + 1 < 1 - [1/n] < 1
^^^^^^^^^^^^^^^^
b(n)
Let b(n) = ln(n) - H(n) + 1. Then b(n+1) - b(n) = ln(n+1) - ln(n) -
[1/(n+1)], and another look at the graph described just above will show that
ln(n+1) - ln(n) is the area under the graph of y = 1/x from x=n to x=(n+1),
while [1/(n+1)] is the area of the rectangle with base 1 and height
[1/(n+1)] that lies completely under that section of the graph (except for
one point). Therefore, the difference b(n+1) - b(n) is POSITIVE for
n = 1, 2, 3, .... Therefore, {b(n)} is a strictly increasing, bounded
sequence of real numbers, and by the COMPLETENESS axiom of the REAL NUMBERS,
there exists a LIMIT to the sequence {b(n)}. So
[H(n)/ln(n)] = 1 + {[1-b(n)]/ln(n)} ====> 1 as n ====> infinity.
(Note we just used the fact that {b(n)} converges to get the conclusion that
{[1-b(n)]/ln(n)} ====> 0.)
A little algebra shows that
[U(n)/ln(n)] = 2*(1 - {b(n)/ln(n)] - [1/(n+1)ln(n)} ===> 2 as n ===> infinity.
Thus U(n) and ln(n) are big-O of each other (in fact, asymptotically
proportionate.) We can see that for reasonably large values of n,
U(n) = 2*ln(n) is a very good approximation (we can make the error as small
as we like, as a percentage of ln(n), by choosing n sufficiently large.)
But ln(n) = ln(2) * L(n), where L is the logarithm with base 2, and ln(2) =
approximately 0.693, which means that 2*ln(2) = approximately 1.386, and
that
U(n) = approximately 1.386*L(n) = approximately 1.4*L(n).
THAT is why it is said that search in a random BST is "only 40% worse" than
bin search in an ordered array. Finally, note that by Eqn 2
S(n) = (1+[1/n])*U(n) - 1,
so that it's clear that S(n)/U(n) ===> 1 as n ===> infinity, and that
S(n) = 1.386*L(n) is also a very good approximation for large values of n.