(Latest Revision: 04/11/2005))
QUESTION I: WHAT IS THE REAL BIG-O OF SEARCH IN A BST?
AND HOW DOES ONE CALCULATE IT?
An important question is:
I. What is the number of probes required to do a (binary) search in a binary
search tree (BST)?
If you think about question I for a while, you realize that it doesn't have
a simple answer. That's because the number of probes you need to do in a
search depends on many things, like the number of nodes in the tree, the
exact shape of the tree, which key value you are looking for, and whether
that key value is actually in the tree.
Even if you just consider one particular tree, you find that you must think
in terms of AVERAGES to get useful information. Here are two well-formed
questions that apply to any particular binary search tree:
II. "What is the AVERAGE number of probes required for a successful search
in a BST?", and
III. "What is the AVERAGE number of probes required for an unsuccessful
search in a BST?
(There are good reasons for separating the cases of successful and
unsuccessful search -- for one thing, this will figure prominently in the
method we eventually use to calculate a general answer to question I.)
The answer to question II can be calculated "by hand" if the tree is not
very large. For every node A in the tree, one just counts the number of
nodes in the path from the root to A (including the root and A). That
gives the number of probes required to find A, and so if we sum all those
numbers and divide by the number of nodes in the tree, we get the number of
probes required to find the "average node in the tree" (assuming that all
node keys are equally likely to be sought.)
To calculate the average number of probes for an UNsuccessful search, we
note that such searches always end when a nil pointer is encountered, and
that there is one "class" of unsuccessful search for each nil pointer in
the tree. Assuming that all these classes of unsuccessful search are
equally likely, we can count the number of nodes in each path from the root
to a nil pointers, sum all those numbers, and divide by the number of nil
pointers, thus getting the average number of probes for an unsuccessful
search.
The considerations above are all well and good, but how do we proceed when
the number of nodes is very large? And also, what if we want some overall
information about search in binary search trees with a given number of
nodes? For example, there are 14 distinct binary search tree "shapes" a
binary tree with only four nodes can have. How can we succinctly describe
the amount of work required to search in "a BST with 4 nodes"?"
One rather nice way to begin to answer is to imagine the following
"experiment": Pick a number n. Randomly choose n distinct key values from
some totally ordered set, and shuffle them so they are in some random
order. Then build a binary search tree by starting with an empty tree, and
inserting nodes with those keys in the order they got in the shuffle. The
result is a random BST of size n. Now calculate the average probe
numbers as described before.
If you repeat the experiment many times, and then average the respective
"average probe numbers", you will get a very good estimate of a more
generalized average -- the average probe numbers for a BST of size n. Of
course, for large trees, the amount of work you (or a computer) might have
to do to REALLY calculate these quantities could easily be too much to be
practical. But thinking about this experiment leads to a mathematical
shortcut.
Why should we care about those numbers? Because in fact the part of the
experiment up to the construction of the tree is very like what is
happening out there in the world! People are taking sets of keys and
constructing BST's for them. If we can figure out what would be found by
doing the rest of the experiment, then we will know a lot about how
efficient it is, on average, to do search in BST's. Why should we care
about that? Well, if there's a better way to do these things, don't you
think we computer scientists ought to try to find it? How will we know it
when we see it? We'll have to know something about how efficient search is
in BST's.
Let's give the objects of our quest names.
Let S(n) be the average number of probes required in doing a
successful search in a BST with n nodes.
S is a function, there is a value of S for each value of n. Remember that
S(n) stands for the average that you get by averaging over all the different
key orderings that you might get in the shuffle.
Let U(n) be the average number of probes required in doing an
UNsuccessful search in a BST with n nodes.
We would like to know the big-O measurements
of S(n) and U(n). If they are O(log(n)), then search in BST's is no worse,
ON AVERAGE, than binary search in ordered arrays. But if they are more
than O(log(n)), then we would have to seriously question our use of BST's.
Here is a key observation:
Consider the "average node", that gets put into a tree in our series of
experiments,
10 10 05 15 02 06 N X X X X
05 15 ^average node
02 06 N <-------- placement of node
Wherever the node is placed in the tree that gets built, finding the
node in a search later, after the tree is complete, will require one more
probe than it took to do the unsuccessful search for it that preceded its
insertion into the tree. The "partial" tree that this node is inserted
into is determined only by the permutation (ordering) of the nodes that
precede it into the tree. All orderings are equally likely to occur. (It
does not matter what particular keys are in those nodes, only their
relative ordering. All sets of keys are equally likely to occur as the
predecessors of our average node.) Therefore, if our node is the Kth node
to enter the tree, it enters, on the average, into an average BST of K-1
nodes, after an average unsuccessful search, requiring U(K-1)
probes.
So our average node, if Kth to enter, will later be found in successful
searches at the cost of [U(K-1) + 1] probes. All values of K between 1 and
n are equally likely. So the average cost of a successful search
for this average node is:
Eqn 1. S(n) = { [U(0) + 1)] + [U(1) + 1] + ... + [U(n-1) + 1] } / n
= { [ U(0) + U(1) + ... + U(n-1) ] / n } + 1
Eqn 1. is a key to discovering formulas for S(n) and U(n). We will put it
together with some other equations, and "solve for the unknowns."
S(n) = [I(n) + n] / n, { i.e. I(n) = n*S(n) - n }
where I(n) is the average internal path length of a BST of size n. The
method of calculating the average number of probes for a successful search
in a particular tree was to sum the number of nodes in every path from the
root to a node, and to divide by n. The formula above is just the
mathematical expression of that method, generalized to the average
bst of size n. I(n) is the average sum you would get by adding
all the edges in paths from the root to nodes in the tree. By
adding n to I(n), you get the number of nodes in all those paths, and hence
the number of probes required.
U(n) = E(n)/(n+1), { i.e. E(n) = (n+1)*U(n) }
where E(n) is the average external path length of a BST of size n. In any
particular BST with n nodes there are n+1 "nil pointers". The number of
edges in a path from the root to one of these nil pointers is the same as
the number of probes that will be done in an unsuccessful search that
terminates at that nil pointer. Thus, the average number of probes
required in an unsuccessful search in this BST is the sum, E, of all these
edge-path-lengths, divided by n+1, the number of places where the search
can end. E(n) is the average value of E. Therefore the formula
above gives the number of probes required for unsuccessful search in the
average BST of size n.
E(n) = I(n) + 2n
E = I + 2n is true for all BST's of size n, and so the identity is true for
the averages E(n) and I(n).
(One can readily prove by induction that E = I + 2n. One method of proof
is to Define I and E to be zero for an empty tree, to verify that E = I +
2n when n=0 and n=1, and to get the rest of the result by assuming that a
larger tree's left and right subtrees satisfy the identity: If El and Er
are the external path lengths of the left and right subtrees, if Il and Ir
are the internal path lengths, and if nl and nr are the numbers of nodes in
the left and right subtrees, then one can calculate the value of E for the
whole tree as E = (El + nl + 1) + (Er + nr +1), and the value of I = (Il +
nl) + (Ir + nr). The result follows by noting that E = I + 2n, where n =
nl + nr + 1 -- the number of nodes in the whole tree.)
One may make a substitution in the last equation above, using the values of
I(n) and E(n) that one gets from the two prior equations:
Eqn 2: (n+1)*U(n) = n*S(n) -n + 2n = n*S(n) + n
Basically, Eqn 1 is:
S(n) = { [ U(0) + U(1) + ... + U(n-1) ] / n } + 1
If we multiply Eqn 1 by n we get:
n*S(n) = { [ U(0) + U(1) + ... + U(n-1) ] } + n
If we then substitute the right hand side of the equation above into Eqn 2, we
get:
Eqn 3: (n+1)*U(n) = U(0) + U(1) + ... + U(n-1) + 2n
Eqn 3 is true for n = 1, 2, 3, .... so we get an equation that is true for n
= 2, 3, 4, ... if we substitiute n-1 for n in Eqn 3. That gives:
Eqn 4: n*U(n-1) = U(0) + U(1) + ... + U(n-2) + 2n-2
Subtracting Eqn 4 from Eqn 3 and solving for U(n) gives:
Eqn 5: U(n) = U(n-1) + [2/(n+1)] (true for n = 2, 3, 4, ... )
It's obvious that it always requires one probe for an unsuccessful search
in a BST with one node, i.e. U(1) = 1. Therefore Eqn 5 tells us that
U(2) = 1 + [2/(2+1)] = 1 + [2/3]. Using Eqn 5 again, we then see that
U(3) = 1 + [2/3] + [2/4].
U(4) = 1 + [2/3] + [2/4] + [2/5].
U(5) = 1 + [2/3] + [2/4] + [2/5] + [2/6].
Continuing, we see that in general:
Eqn 6: U(n) = 1 + 2*{ [1/3] + [1/4] + ... + [1/(n+1)] }, or
Eqn 7: U(n) = 2*[H(n+1) - 1],
where H(n) = 1 + [1/2] + [1/3] + ... + [1/n] is the nth "harmonic number".
So we see that U is O(H). Let's analyze H. By looking at a histogram of y
= 1/n, and noting that ln(n) is the area under the graph of y = 1/x from
x=1 to x=n, it is easy to be convinced that
H(n) - 1 = [1/2] + ... + [1/n] < ln(n) < H(n-1) = 1 + [1/2] + ...+ [1/(n-1)]
Therefore 0 < ln(n) - H(n) + 1 < 1 - [1/n] < 1
^^^^^^^^^^^^^^^^
b(n)
Let b(n) = ln(n) - H(n) + 1. Then b(n+1) - b(n) = ln(n+1) - ln(n) -
[1/(n+1)], and another look at the graph described just above will show
that ln(n+1) - ln(n) is the area under the graph of y = 1/x from x=n to
x=(n+1), while [1/(n+1)] is the area of the rectangle with base 1 and
height [1/(n+1)] that lies completely under that section of the graph
(except for one point). Therefore, the difference b(n+1) - b(n) is
positive for n = 1, 2, 3, .... Therefore, {b(n)} is a strictly
increasing, bounded sequence of real numbers, and by the
completeness axiom of the real numbers, there exists a
limit to the sequence {b(n)}. So
[H(n)/ln(n)] = 1 + {[1-b(n)]/ln(n)} ====> 1 as n ====> infinity.
(Note we just used the fact that {b(n)} converges to get the conclusion
that
{[1-b(n)]/ln(n)} ====> 0.)
A little algebra shows that
[U(n)/ln(n)] = 2*(1 - {b(n)/ln(n)] - [1/(n+1)ln(n)} ===> 2 as n ===> infinity.
Thus U(n) and ln(n) are big-O of each other (in fact, asymptotically
proportionate.) We can see that for reasonably large values of n, U(n)
≈ 2*ln(n) is a very good approximation (we can make the error as small
as we like, as a percentage of ln(n), by choosing n sufficiently large.)
But ln(n) = ln(2) * log2(n), and ln(2) ≈ 0.693, which means
that 2*ln(2) ≈ 1.386, and that
U(n) ≈ 1.386*log2(n) ≈ 1.4*log2(n).
THAT is why it is said that search in a random BST is "only 40% worse" than
binary search in an ordered array. Finally, note that by Eqn 2
S(n) = (1+[1/n])*U(n) - 1,
so that it's clear that S(n)/U(n) ===> 1 as n ===> infinity, and that
S(n) ≈ 1.386*log2(n) is also a very good approximation for
large values of n.