Details About Select2

Details About The Linear-Time Selection Algorithm

(Follows the treatment in Horowitz and Sahni)

(* Theory of Algorithms *)

(* Algorithm to Find the kth Smallest Element of the Array A,
   Where A Has n Elements. *)

procedure SELECT2 (A,k,n);

BEGIN SELECT2

0) If n <= r then sort A with a simple sort algorithm and
      return(A[k]);
        (* r is a small integer, like 5,6, or 7.  Step 0 is O(1) *)

1)  Divide A into floor(n/r) subsets of size r each, and leave
      the remaining elements out of the subsets ;
        (* Step 1 is O(floor(n/r)).  *)

2)  Compute the floor(n/r) medians of the subsets with a simple
      sort algorithm, and put them in array M ;
        (* Step 2 is O(floor(n/r). *)

3)  Medmed := SELECT2( M, ceiling(floor(n/r)/2), floor(n/r) ) ;

        (* i.e. Medmed is the median of the medians in M -- note
           that the size of the list is about n/r, and the
           element we are selecting is about the n/2r'th *)

        (* Step 3 is O(T(floor(n/r)) -- where T(x) is the
           cumulative worst case function for SELECT2.  In
           other words T(x) = the most work that can be done in
           a SELECT2 performed on a set with x or fewer
           elements (note: the "or fewer" guarantess that T is
           non-decreasing, which makes some of the analysis of
           this algorithm simpler to do. *)

4)  Use a quicksort style PARTITION algorithm to partition A
      into 3 sets:  S := (elements of A smaller than medmed);
      E := (elements of A equal to medmed); L := (elements of A
      larger than medmed) ;

      (* Now A is "partially sorted" so that elements of S come
         first, then E, then L.  In other words A = S|E|L.
         Step 4 is O(n). *)

5)  Let j be the position of the first element of E within A and
      let i be the position of the last element of E within A.

      (* Step 5 is O(n) -- it can be done as part of step 4 *)

6)  case

      j <= k <= i: return(medmed) ;         (* O(1) step *)
      k < j:  return(SELECT2(S,k,j-1));     (* O(T(j-1)) step *)
      k > i:  return(SELECT2(L,k-i,n-i));   (* O(T(n-i)) step *)
    end case;

END SELECT2

-----------------------------------------------------------------------
DISCUSSION OF SELECT2:

How large can the sets S and L be?  The answer to this question
has a direct bearing on how efficient SELECT2 is.

We divided A into s groups G(1), G(2), ... G(s) of size r,
where s = floor(n/r), possibly with a few elements left over if
n is not divisible by r.  The number of left-over elements is
less than r and is equal to:

n - r*s = n - r*floor(n/r)

Consider the elements of M = {m(1), m(2), ... , m(s)}.  The
m(i) are the medians of the elements of the groups G(i).

If r is odd then the groups G(i) "look" like this:

  #  #  #  #  #  #  #
           ^
           ^
           m(i)

We can't really say that m(i) is GREATER than, say, half the
elements of G(i) because we don't know how many times values
are duplicated in G(i).  However m(i) is greater than or equal
to (r+1)/2 of the elements of G(i).  (counting m(i).)
Similarly, m(i) is less than or equal to (r+1)/2 of the
elements of G(i).  

If r is even then the G(i) "look" this way:

  #  #  #  #  #  #                  #  #  #  #  #  #
           ^                              ^
           ^               OR             ^
           m(i)                           m(i)

In either case above, we can say m(i) is less than or equal to
at least r/2 of the elements of G(i) and m(i) is greater than
or equal to at least r/2 of the elements of G(i).

So whether r is odd or even, m(i) is greater than or equal to
at least ceiling(r/2) of the elements of G(i), and m(i) is less
than or equal to at least ceiling(r/2) of the elements of G(i).

Similar considerations tell us that medmed, which is the median
element of M has the property that medmed is greater than or
equal to at least ceiling(s/2) of the elements of M and medmed
is less than or equal to at least ceiling(s/2) of the elements
of M.  Here s is the number of elements of M.

           G1   G2   G3   G4   G5 |  G6   G7   G8   G9   Think
in                                                       of each
this       #    #    #    #    #  |  #    #    #    #    Gi    
corner,                           |                      increasing
elts       #    #    #    #    #  |  #    #    #    #    in this
known                             |                      direction.
to be      #    #    #    #    #  |  #    #    #    #      |
<= mm                        -----|-------------------     |
           m1   m2   m3   m4 | mm | m5   m6   m7   m8      V
          -------------------------
           #    #    #    #  | #    #    #    #    #     
                             |                           
           #    #    #    #  | #    #    #    #    #   in this corner,
                             |                         elts known to
           #    #    #    #  | #    #    #    #    #   be >= mm

Think of the medians increasing in this direction ----->

To summarize, medmed is greater than or equal to at least
ceiling(s/2) of the elements of M, each of which is is greater
than or equal to at least ceiling(r/2) of the elements of the
group it belongs to.  This means that medmed is is greater than
or equal to at least ceiling(s/2)*ceiling(r/2) of the elements
of A.

Similarly medmed is is less than or equal to at least
ceiling(s/2)*ceiling(r/2) of the elements of A.

Let Q = ceiling(s/2)*ceiling(r/2).

Note that s = floor(n/r).  so ceiling(s/2) is about n/2r, Q is
about n/4, and so what we have shown (with slightly more
precision) is that medmed is no less than about 1/4 the
elements of A, and no more than about 1/4 the elements of A.

Continuing on with this, the fact that medmed is GREATER THAN
OR EQUAL TO Q OR MORE elements of A implies that medmed is LESS
THAN NO MORE THAN n-Q elements of A.  Thus the size of the set
L is no more than n-Q.

  SSSSS EEE LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
  
In the figure, L can't be as big as is indicated, because then
the union of S and E would have size smaller than Q.  Thus
medmed would be greater than or equal to fewer than Q elements.

Similarly, since medmed is less than or equal to at least Q
elements of A, there are no more than n-Q elements in S.

Therefore, step 6 in SELECT2 requires no more than T(n-Q) work,
where T is the cumulative worst-case work function for SELECT2
defined in the comment after step 3 of the algorithm listing
above.

Steps 0, 1, 2, 4, and 5 are all O(n) in terms of the work
done.  Step 3 requires no more work than T(floor(n/r)), and
step 6 requires no more than T(n-Q) work.  Thus

T(n) <= T(n/r) + T(n-Q) + C*n,

where C is some constant reflecting the upper bound to the
amount of work that is required by steps 1, 2, 4, and 5.

How big is n-Q?

Q = ceiling(r/2) * ceiling(s/2) where s = floor(n/r).

Since division by r leaves a remainder of at most r-1,

s = floor(n/r) >= [n/r] - [(r-1)/r] = (n-r+1)/r.

Thus ceiling(s/2) >= s/2 >= (n-r+1)/(2r),

and so Q = ceiling(r/2) * ceiling(s/2)
                  >= (r/2)[(n-r+1)/(2r)] = (n-r+1)/4.

Consequently n-Q <= n - (n-r+1)/4 = (3n+r-1)/4.

(Note:  The smaller the value of r, the smaller is this bound.
        This stems from the fact that there are potentially
        fewer "left-over" elements after dividing A into
        subsets of size r.  Thus 3/4 of the diagram above more
        closely approximates the set of elements of A that
        could possibly be larger than mm.  As we shall see, the
        algorithm is not O(n) when r < 5.  Thus r=5 is a
        parameter value that is optimal for a low worst case
        recursive call in step 6.  However, for the recursive
        call in step 3, the higher the value of r the better.
        On the face of it, it is better to worry about
        optimizing step 6, since in any case the size of the
        problem passed to select2 in step 3 is quite a bit
        smaller in comparison to the one that will be passed,
        on average or at worst, in step 6.)
        
############
Consider this:

If for some positive integer K we can show that

(3n+r-1)/4  <= [1 - (1/r) - (1/K)] * n

for all sufficiently large n, then we can prove that T(n) is
O(n) in the following way:

Since n-Q <= (3n+r-1)/4 <= [1 - (1/r) - (1/K)] * n for all n>m

where m is some positive integer, we have

T(n) <= T(n/r) + T([1 - (1/r) - (1/K)]n) + C*n, for all n>m.

Now choose a constant D >= C such that T(n) <= Dn for n<=m.
Then

T(n) <= T(n/r) + T([1 - (1/r) - (1/K)] * n) + D*n, for all n>=1.

We CLAIM that T(n) <= KDn for all n>=1.

PROOF:  The base case is established by the choice of D.  If
the result is true for n = 1,2, ..., q-1 then since

T(q) <= T(q/r) + T([1 - (1/r) - (1/K)] * q) + D*q,

we get

T(q) <= KDq/r + KDq[1-(1/r)-(1/K)] + Dq

by invoking the inductive hypothesis.  Thus

T(q) <= KDq[(1/r)+[1-(1/r)-(1/K)] + Dq

        = KDq[1-(1/K)] + Dq
        = Dq[K-1] + Dq = KDq

############
---------------------------------------------------------
 |  COROLLARY:  SELECT2 is O(n) for r = 5, 6, 7, ...   |
----------------------------------------------------------
PROOF:  The inequality

(3n+r-1)/4 <= [1 - (1/r) - (1/K)]n

is true if and only if

(r-1) <= 4n[1 - (1/r) - (1/K)] - 3n

if and only if

 (*)  (r-1) <= n - (4n/r) - (4n/K) = n[1-(4/r)-(4/K)].

When r > 4 and 4/K is sufficiently small that

[1-(4/r)-(4/K)] > 0,

the inequality (*) above is preserved by dividing both sides by

[1-(4/r)-(4/K)].

Thus

(3n+r-1)/4 <= [1 - (1/r) - (1/K)]n

is true when r > 4, K is sufficiently small, and

n >= (r-1)/[1-(4/r)-(4/K)]

NOTE:  r > 4 ==> r >= 5 ==> (4/r) <= (4/5) ==> 1-(4/r) >= (1/5)

Therefore a value of K such that (4/K) < (1/5) will suffice for
all values of r > 4.  Thus any K > 20 will do.  (We find below
that K=20 works for r=5).  The corresponding value of n is

(r-1)/[1-(4/r)-(4/K)] = (21/17)(r-1)/(1-(84/17)r),

which is asymptotic to about (5r/4).

Here are some values of n versus r calculated with a Pascal
program called selFact:

For r =   5 n must be at least 421.
For r =   6 n must be at least 35.
For r =   7 n must be at least 26.
For r =   8 n must be at least 23.
For r =   9 n must be at least 22.
For r =  10 n must be at least 22.
For r =  11 n must be at least 23.
For r =  12 n must be at least 24.
For r =  13 n must be at least 24.
For r =  14 n must be at least 25.
For r =  15 n must be at least 26.
For r =  16 n must be at least 27.
For r =  17 n must be at least 28.
For r =  18 n must be at least 29.
For r =  19 n must be at least 31.
For r =  20 n must be at least 32.
.
.
.
For r = 100 n must be at least 129.


          
-----------------------------------------------------------------------
SPECIALIZING THE DISCUSSION OF SELECT2:

Suppose r = 5.

Then ceiling(r/2) is 3.  Also ceiling(s/2) >= s/2, so

Q = ceiling(r/2) * ceiling(s/2) >= 3*s/2 = (1.5)*s

Thus n-Q <= n - (1.5)*s.

How much are we taking away when we subtract (1.5)*s above?
Consider that s = floor(n/5).  When you divide any integer by
5, the result has a decimal part of 0, 0.2, 0.4, 0.6, or 0.8.
Therefore

floor(n/5) >= n/5 - 0.8, so

(1.5)*s >= (0.3)*n - 1.2, so

n - (1.5)*s <= n - (0.3)*n + 1.2 = (0.7)*n + 1.2.

Now for sufficiently large n,

(0.7)*n + 1.2 <= (0.75)*n = 3n/4

(n >= 24 is sufficiently large.)  So we get:

T(n) <= T(n/5) + T(3n/4) + C*n, for n >= 24,

and by enlarging C if need be so that T(n) <= C*n for n <= 24,
we can say that

T(n) <= T(n/5) + T(3n/4) + C*n  for ALL n >= 0.

Now using the HINT that T(n) <= 20*C*n, we can easily prove it
by induction.  The fact that C is chosen so that T(n) <= C*n
for n <= 24 gives us the base case of n=1 (and 23 more).  Now
assume that the limit of 20*C*n is correct for n = 1, 2, ..., k-1.
Then

T(k) <= T(k/5) + T(3k/4) + C*k
     <= 20Ck/5        + 20C(3k/4)    + Ck
     =  4Ck           + 15Ck         + Ck
     =  20Ck

This proves that SELECT2 is an O(n) algorithm when r = 5.