A Lecture on Abstraction of Data Structures and the Meaning of Big-O Complexity def: a DATA TYPE is a COMBINATION of two sets: 1. a set of values (like integer, or real, or char) 2. a set of operations on those values (e.g. addition, mult, print, ord) The OPERATIONS are inextricable from the notion of the DATA TYPE. It is wrong to think of a data type as just a certain set of values. def: a STRUCTURED DATA TYPE, or DATA STRUCTURE is a data type whose values 1. can be decomposed into to a set of component data elements, each of which is either atomic or another data structure. (e.g. array of integers, set of pointers, array of arrays) 2. include a set of associations, or relationships, (structure) involving the component elements. (e.g. the random access structure of an array, the FIFO structure of a queue, the LIFO structure of a stack, the sequential access structure of a linked list) This course is about DATA STRUCTURES -- Obviously you need to know what they are. The definitions above are FUNDAMENTAL. You are expected to know them by heart. The way we view things like pointers, sets, real numbers, integers, records, binary search trees, and so on, involves abstraction -- inside a computer, these structures really do not exist in the way we like to think of them. They are all implemented as a jumble of bits in the vast majority of modern computers. The way we see these things in our "mind's eye" is an abstraction -- our model of what these things are is taken from other areas, and we have found ways to IMPLEMENT our abstraction on the computer. Def: an ABSTRACT data type, in words used in the first edition of our text, is "a data type that exists as a product of our imagination and concentrates on the essential properties of the data type, ignoring implementation constraints and details." The special SPECIFICATION modules found in our text are descriptions of ABSTRACT data structures and types. Def: a VIRTUAL data type is one that exists on a virtual processor such as a programming language. Def: a PHYSICAL data type is one that exists on a physical processor such as the machine level of a computer. Our Text puts forward a particular FORM for SPECIFYING AN ABSTRACT DATA TYPE or STRUCTURE. This is extremely important, because it gives us all a reliable standard to use when we refer to a given data type, and it allows us to learn about new data structures in such a way that we can understand them with precision. This FORM has the same value to the programmer that the conventions for making blueprints do to a builder. SPECIFICATION OF A DATA STRUCTURE: To describe an abstract data structure is to provide adequate information in the following four areas: 1. elements -- what are the component data types from which the data structure is constructed? 2. structure -- how are the elements related to each other in the composite? 3. domain -- what is the set of allowable structured values? 4. operations -- the actions that can be carried out upon the data. important in defining the data -- you can not really be clear on the structure without knowing what operations exist. Each operation is described as a "BLACK BOX". One describes what inputs are allowed, and what outputs result from any allowed input. But one does not describe "how" the operation accomplishes the task of furnishing the output. Not "how", just "what". "How" is up to the implementer of the specified data structure -- and may vary greatly from implementation to implementation. Our text uses the terms "requires" and "results" in the operation headers in the specification modules. The "requires" section is used to give input and precondition information about the operation, and the "results" section is for output and postcondition information. PRECONDITIONS are assertions that are assumed to be true immediately prior to execution of the operation. (preconditions are called "assumptions" elsewhere) Outcome of execution is UNDEFINED by the specification in cases where one or more of the preconditions is not true. An IMPORTANT CONVENTION is that wherever possible there be an operation provided in the specification to TEST for every precondition. (Without such tests, the operation can not be used without the risk of failure of some kind due to the use of "bad input".) POSTCONDITIONS are DESCRIPTIONS OF THE RESULTS of the actions performed by an operation whose preconditions have been met. IMPLEMENTATION CONSIDERATIONS: We can IMPLEMENT abstract data types -- these "blueprints" -- by using the built-in data types of a programming language, or by directly using the physical data types of a given computer. The text contains many implementations of data structures, constructed as "PACKAGES" of Pascal code. The code includes type and variable declarations to implement ("represent") elements and, partly, structure. Operations are implemented with procedure and function declarations. In many cases we discuss DIFFERENT implementations of a given data structure, pointing out the advantages and disadvantages of each one. An important consideration in this is the question of HOW EFFICIENT the operations turn out to be in the various implementations. And to evaluate efficiency, we need to know about various tests used, including big-O analyses of (asymptotic) complexity. THE BIG-O MEASURE OF COMPLEXITY: We talk about the big-O of mathematical FUNCTIONS, usually positive real-valued functions defined on the set of positive integers. That is because we are usually interested in functions that count the number of steps, or the amount of time, required to carry out a given algorithm on a data set of a given size. The size of the data set is always a positive integer, and the number of steps or amount of time is always a positive real number. NOTATION: The set of positive integers is usually denoted by the symbol N. R denotes the set of real numbers -- the familiar set of numbers corresponding to the points on a line. R+ denotes the set of POSITIVE real numbers. A positive real-valued function defined on N is a correspondence or "rule" which ASSIGNS (pairs) A POSITIVE REAL NUMBER TO EACH POSITIVE INTEGER. Take as an example the case of a program that sorts lists. The CORRESPONDENCE of a number to the average TIME required for the program to sort a list containing that number of items is a positive real- valued function on N. The various times are the positive real numbers assigned by the rule. We often denote functions using lower case letters, particularly "f","g", "h", and "k". A familiar short-hand which states that the function f is a positive real valued function defined on N is "f: N --> R+". We customarily denote the assigned value of a function by using the notation f(s), where s is any symbol we are using to denote a positive integer. For example, if f is the function described above, then "f(3)" would stand for the average time required to sort a list of 3 items, f(5) would stand for the average time needed for sorting a list of 5 items. And, in general, if "s" stands for a positive integer then "f(s)" stands for the average time required to sort a list of s items. DEFINITION OF BIG-O: Suppose that f: N --> R+ and g: N --> R+. If for SOME positive constants m and C, (1) f(n) < [C * g(n)], for EVERY positive integer n that is greater than m, then we say that f is big-O of g (also written "f is O(g)"). For example, suppose that car F is capable of great speed, but takes quite a while to accelerate to its top speed. Suppose that car G is very quick to accelerate, but not capable of F's top speed. G may be able to beat F in some short races, but there must be a number of miles m such that F can beat G in any race of more than m miles. Let s be any positive integer. If f is the function that assigns the time f(s) that car F requires to race s miles, and if g is the function that assigns the time g(s) that car G requires to race s miles, then according to our definition of big-O, f is O(g). Since F's elapsed time is less than G's elapsed time in any race of more than m miles, f(n) < 1 * g(n) whenever n > m. So the conditions given in line (1) hold true, with the constant C having the value "1". Now suppose that car G can not hold its top speed for the whole race, but can always manage to travel at a rate at least half as fast as car F is traveling at the same time. Then g is O(f), because g(s) < 2 * f(s) for all s > 1. It may seem odd, but two positive real valued functions on N can each be big-O of the other, and such pairs of functions are considered roughly EQUIVALENT -- in mathematical parlance, "ASYMPTOTICALLY PROPORTIONATE". f is big-O of g if and only if f(n)<[C*g(n)] whenever n>m, for some positive constants m and C. This is true if and only if f(n)/g(n)m, and if and only if g(n)/f(n)>1/C when n>m. On the other hand, if g is big-O of f, then for some positive K and r, g(a)<[K*f(a)] whenever a>r, and this is the same as g(a)/f(a)1/K. So C>f(s)/g(s)>1/K whenever s is greater than both m and r. If f(s)/g(s) was a constant, not depending at all on s, then we would say that f and g are proportionate. The last inequality above shows that the ratio of f(s) to g(s), while quite possibly not a constant, does become "trapped" between C and 1/K for large enough values of s. This phenomenon is called "asymptotic proportionality". We think of the two functions as being "roughly proportionate". Computer scientists classify algorithms by how they compare in the "big-O" sense. Typically the functions f(n)=n, g(n)=n2, k(n)=log(n), h(n)=n*log(n) are used as yardsticks to measure algorithms. If you know that a sorting algorithm requires big-O of n2 steps to sort n items , then generally speaking you can say that it is inefficient. On the other hand, a sort that requires only big-O of n*log(n) steps is generally considered quite efficient. A more detailed analysis is needed to find out which method is the better for any specific sorting task, but knowledge of the big-O information is almost always the starting point for such an analysis.