for Data Structures + Algorithms: CS 3100

Soon all students will have accounts and we'll go to the lab to perform the Hello World Assignment.

Section 5.1 Topics - Basics of Grammars & Simple Languages

One way to look at the idea of a language is that it consists of a bunch of strings that are "legal" according to some set of rules.

For example:

A set of rules that tell which strings are "legal" for a language is a GRAMMAR for the language.

It can be challenging to write down a CORRECT grammar for a language.

Many kinds of languages, have a structure that is recursive, and creating a correct grammar may be made easier by using recursive structure in the grammar rules. Also, if we want to write a program (recognition algorithm) that is able to test a string to see if it is legal in the language, the task may be made easier if we utilize recursion in the program.

XY means X followed by Y

<word> means any instance of word, where word is a symbol that must be defined elsewhere in the grammar

<identifier> = <letter> | <identifier><letter> | <identifier><digit>

<letter> =

<digit> =

The rules say that an identifier is either a letter, or an identifier followed by a letter, or an identifier followed by a digit

Notice the recursive nature of the definition.

After studying the grammar for a while, one may notice that if a string is empty it's not an identifier, and if a string consists of just one character, then it is an identifier if and only if it is a letter.

The observations above can be used as

If the length of the string is more than one, then it is an identifier if and only if the last character is a letter OR digit, and the initial characters are an identifier.

This suggests the basic idea of the algorithm at the bottom of page 161. Notice the recursive nature of that algorithm.

The algorithm goes basically as given below in pseudo-C++.

Next let's trace the way the algorithm works on a few example input strings: r, 2, r2, 2D, r2D. Does the following sample program implement the algorithm and work correctly?bool isAnIdentifier (string s) { if (length(s) < 1) return false ; // base case else if (length(s) == 1) // another base case { if (s is a letter) return true; else return false; } else if (the last character of s is a NOT letter or a digit) return false ; else return isAnIdentifier(s minus its last character) ; // the recursive call }

#include <iostream> #include <string> using namespace std; bool isAnIdentifier (string str) ; bool isLetter (char c) ; bool isDigit (char p) ; int main () { string testStr; cout << endl << endl << "The Identifier Checker is running." ; cout << endl << endl ; while (cin >> testStr) { cout << testStr << endl; if (isAnIdentifier(testStr)) cout << "The string IS an identifier." << endl ; else cout << "The string IS NOT an identifier." << endl ; } cout << endl << endl << "The Identifier Checker Thanks You!" ; cout << endl << endl ; return 0 ; } /* The grammar: <identifier> = <letter> | <identifier><letter> | <identifier><digit> <letter> = a | b | ... | z | A | B | ... | Z | _ <digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 */ bool isAnIdentifier (string str) { int len = str.length() ; if (len < 1) return false ; else if (len == 1) { if ( isLetter (str[0]) ) return true ; else return false ; } else if ( !isLetter(str[len-1]) && !isDigit(str[len-1]) ) return false ; else return isAnIdentifier (str.substr(0,len-1)) ; } bool isLetter (char c) { if ( ( ('a' <= c) && (c <= 'z') ) || ( ('A' <= c) && (c <= 'Z') ) || ( c == '_') ) return true ; else return false; } bool isDigit (char p) { if ( ('0' <= p) && (p <= '9') ) return true ; else return false; }

There's a similar example for strings of the form AA grammar for PALINDROMES<pal> = empty string | <ch> | a<pal>a | b<pal>b | ... | Z<pal>Z <ch> = a | b | ... | z | A | B | ... | Z The text discusses on page 163 how to construct a recognition algorithm based on this grammar.

Then follows Question 1 asking for us to write a similar grammar.

Try working the problems on page 170 to get a feeling for how to work recursively.

Section 5.2.2 ... introduces the ideas of postfix and prefix expressions instead of infix. They have the advantage of being unambiguous without any need for parentheses, precedence rules, or association rules. If an infix expression is fully parenthesized, there is a simple algorithm for converting it to either postfix or prefix form. It's possible to write very simple programs that recognize and/or find the value of either postfix or prefix expressions. On the other hand, it's harder to write code that recognizes or evaluates infix expressions.

Below is a grammar for prefix expressions. To make it simpler, only single lower-case letters are allowed as identifiers, and literal numerical values are not allowed.

When we write an algorithm to look for this formA grammar for PREFIX expressions<prefix> = <identifier> | <operator><prefix><prefix> <operator> = + | - | * | / <identifier> = a | b | c | ... | z

<operator><prefix><prefix>we have a technical problem of figuring out where the boundary should be between the first and second prefix (sub) expressions. For example if our algorithm is checking something like

to see if it is a prefix expression of the form "+prefix1prefix2", what division point should it use as the boundary between "prefix1" and "prefix2"?+-+xy/cd*ab

There's an easy base case for this problem. If we are testing something like *ab to see if it is a prefix expression, then it's obvious that if anything will work, it's going to involve the 'a' being "prefix1". It's not super easy to see exactly why the following idea works, but we can leverage the easy base case into a recursive algorithm that finds the end of "prefix1", no matter how complicated it is. For example, the following algorithm will figure out that "prefix1" for

is+-+xy/cd*ab

-+xy/cd

The algorithm inputs a string and an integer representing a first position. If the string contains a prefix expression that starts at the first position, the algorithm returns the position of theendof that prefix expression. Otherwise the algorithm returns -1 as a failure code.algorithm name:endPreinputs:string strExp, and integer firstoutputs:integer { if first < 0, return (-1) ; // no place to start the expression let integer last = the index of the last character in strExp if first > last, return (-1) // no characters for the expression let ch = the character at position first of strExp if (ch is an identifier) return (first) // base case else if (ch is NOT an operator) return (-1) // there's no expression else { /* recursively find the end of the first prefix expression after the operator ch, if it's there. */ let firstEnd = endPre (strExp, first+1) if (firstEnd == -1) // If there was no expression return (-1) /* Otherwise return the position of the end of the second expression after ch, if it exists, or the failure code (-1) if it does not exist. */ else return endPre(strExp, firstEnd+1) } }

Once we have the algorithm above, we can write a very simple algorithm to input a string and return true or false, depending on whether the string is a prefix expression.We also see from the algorithm on page 170 that we can use endPre to help us write a simple algorithm toalgorithm name:isPrefixinputs:string strExpoutputs:boolean { let lastChar = endPre(strExp, 0) if ( (lastChar >= 0) && (lastChar == strExp.length()-1) ) return (true) else return (false) }

What is the answer to Question 5 on page 171?

The student should read section 5.3 as an introduction to ideas of backtracking that will be studied in more depth later. There's no need to study this section for "mastery". Read it to get an appreciation of the methodology.

There's a close relationship between recursion and mathematical induction. People often use mathematical induction proofs to demonstrate that a recursive algorithm really works. Also, people can use mathematical induction to prove how much or how little work an algorithm does.

With mathematical induction, one starts out, for instance, with the problem of proving a series of statements - one statement P(n) for each number n = 1, 2, 3, ... , where the list of numbers n is the infinite list of natural counting numbers. The proof technique is to give a proof that P(1) is true, and then to prove the general result that P(n) --> P(n+1) for all values of n >= 1. In other words, the second part is to give

With a simple recursive algorithm, we write code for one or more base cases, which is like making the proof for P(1), and then we have another part of the algorithm that completes the solution of a problem of size n+1 by calling the algorithm recursively to solve a problem of size n (or less).

The discussion in the last two paragraphs should give you an idea of the relationship between recursion and mathematical induction.

Read section 5.4 and try to follow the logic of the example proofs. Ask me about anything that seems difficult to understand. Keep in mind that if you have a question, probably a lot of the other students have the same question, so it might be good to ask the question in class, if it's the kind of thing I'm likely to be able to answer there.

Look at the sample program earlier in these notes - the one that checks strings to see if they are legal C++ identifiers. You can solve the problem given in the programming assignment with a program that is VERY similar to that one.

To do that, the central thing will be to figure out what to use in the new program instead of the function isAnIdentifier. You'll need a similar function - one that tests to see if the string conforms to this grammar:

<expr> = S | I(C)T(<expr>)

The new function will need a NEW NAME, one that helps everyone who looks at the function remember what the purpose of the function is, which is to test a string to see if it is an expression or not. (Don't use the old name - that would be very bad style.)

The new function should work with the new grammar. The first part of the grammar is the base case. If the string is literally an upper case 'S' then the function should return 'true'. If the string is empty or it's a single character that is NOT an 'S', then the function should return false. If the string is MORE than one character, then the only way it can be an expression is if it fits the second pattern in the grammar. So for that case the function has to check whether the string is long enough to contain the initial "I(C)T(" and the final ")" and at least one character inside the "(" and the ")". In that case, if there are not enough characters for that, the function should return false. Still thinking about that case of the second grammar rule applying, if the string is long enough, then to have a chance to be an expression it needs to start with "I(C)T(" and end with ")". So the function should use str.substr compare substrings of str to see. If str doesn't contain those pieces then the function should return false. Finally, if the function has verified that str starts with "I(C)T(" and ends with ")", then it should make a recursive call to see if the substring of str that is between the "(" and the ")" is an expression. In that case str is an expression if and only if that substring is an expression.

We'll be concerned with analyzing

If we know, for example, how many operations an algorithm will execute, on a problem of size N, in the worst case, then when we learn how fast our computer can execute the operations, we can calculate approximately how much CPU time the program may require.

Below is a version of the code for

The code for the function searches from the beginning of the list, down to the position numElts-1. It does one comparison for each value of indexNow from 1 to numElts-1. Therefore, each call toint findIndexofLargest (const ItemType theArray[ ], int numElts) { int indexSoFar = 0; // Index of largest entry found so far for (int indexNow = 1; indexNow < numElts; indexNow++) { if (theArray[indexNow] > theArray[indexSoFar]) indexSoFar = indexNow; } return indexSoFar; // Index of largest entry }

Next below is a version of the code for selection sort. The purpose of selectionSort is to sort a list of items into ascending order. The input is the list, represented as an array of items and a size value. The output is the list, now sorted.

Each repetition of the for-loop body executes one pass of the selection sort. In the first pass, lastPstn is listSize-1, which is the index of the last item in the list.void selectionSort (ItemType theArray[ ], int listSize) { /* lastPstn is the index of the last item in the sub-array of items that are not sorted yet. */ int lastPstn, maxLoc ; for (lastPstn = listSize-1; lastPstn >= 1; lastPstn--) { maxLoc = findIndexofLargest (theArray, lastPstn + 1); std::swap (theArray[maxLoc], theArray[lastPstn]); } }

In the first pass, selectionSort calls the helper-function findIndexofLargest(). The helper function searches through the entire list, finds the index of the maximum element of the list, and returns it. (If there happen to be two or more list items with the maximum value, the helper function returns the index of the first one found.) SelectionSort stores the returned value in the variable maxLoc, and then interchanges whatever is in position lastPstn in the list with the maximum item that is in position maxLoc of the list. At the end of the first pass, the maximum item in the list has been moved to the last position in the list, which is where it is supposed to be when the list is sorted. How many comparisons of list items are done by findIndexofLargest() in the first pass? Since the parameter to the call to findIndexofLargest() is lastPstn + 1, the number of comparisons is the value of lastPstn when the parameters is passed, namely listSize-1.

If the length of the list is more than two, the for-loop does another pass with the control variable lastPstn now equal to listSize-2. So now the unsorted portion of the list is assumed to range from index 0 to index listSize-2. The second pass works like the first, except that the helper function searches only the unsorted part of the list, and returns an index where the maximum of that part of the list is located. So the second swap - the one that happens in the second pass, moves the second largest element of the list into the second to the last position of the list. How many comparisons of list items are done by findIndexofLargest() in the second pass? Since the value of lastPstn in the second pass is listSize-2, that is the number of such comparisons in the second pass.

The sort continues on like this, doing pass after pass, each time with the unsorted part of the list smaller by one slot. The last pass happens when the unsorted part of the list contains only two elements. After that pass, the list is sure to be in sorted order.

The total number of list comparisons done by the sort is

(listSize-1) + (listSize-2) + ... + 1.

There's a standard "math trick" that tells us how much that is.

(listSize)(listSize-1)/2 == (1/2)(listSize

The point is: The number of list item comparisons done by selectionSort, (1/2)(listSize

It's

Now, that's not the most precise way computer scientists have of saying that, but what's important now is that you get the idea.

P.S. Computer scientists would express the idea of "roughly proportional" by saying that the number of list item comparisons is Θ(N

If the size of a list is N = 2

How many list comparisons and moves are performed at each level of the sort? There is a merge that assembles the two sorted halves at level 0 into the final sorted list. With a bit of thought you can see that at most N - 1 (= 2

At level 1, there are two pairs of lists of size 2

At level 2, there are four pairs of lists of size 2

So counting up all the moves that have to be done, we get 2*N moves for each level, and the levels are 0, 1, 2, ..., k-1. So that's k levels, and a total of 2*N*k moves, which is 2*N*log

For the comparisons, in the worst case, here's the sum:

(N-1) + (N-2) + (N-2

There's one term above for each level where merging is performed. There are k such levels, levels 0 through k-1. So the sum of all the terms is k*N - (1+2+2

1 + x + x

When x = 2, (x - 1) is just 1, so we get

(1+2+2

So, in the worst case, the total number of comparisons done by merge sort is k*N - N + 1, which equals N*log

So we see that the amount of data moves and data comparisons performed by mergeSort, when sorting a list of size N, are both O( N*log

The story is not really any different for the case when N is NOT a power of two, but the details of the analysis are more challenging, and I don't want to take the time to present them all.

To sum up, no matter what the size, N, of the list, mergeSort performs O(N*log

The idea of quicksort is to sort a list in this manner: Start with a list of N elements, then

- Pick an element of the list to use as a
*pivot*, - Doing O(N) work,
*partition*the list into a first segment containing only elements less than or equal to the pivot, a second segment containing the pivot, and a third segment containing only elements greater than or equal to the pivot, and - Sort the first first and third segments (using quicksort).

The challenging part of implementing the sort is the partition algorithm. There are different approaches. The idea presented in the text is to use two indices to scan the list. One scans from the left, the other from the right. Scanning from the left, you stop when you come to an item >= the pivot. Scanning from the right, you stop when you come to an item <= the pivot. If the indices have met or crossed, you stop the scan, else swap the two items and continue the scan. There can be a final step where the pivot is swapped into position. The basic idea of this partitioning algorithm is easy to understand, but it can be tricky to code it correctly.

The algorithm in the text begins the partition process by choosing the pivot and placing items at the ends of the list that serve as 'sentinels' that prevent either of the scans from running off the end of the list.

The quicksort of the text has a further optimization - it uses insertion sort on small sublists (below some size MIN_SIZE) to avoid the overhead of quicksort on small sublists.

If you consider the partitioning algorithm, it's easy to see it will require m+1 comparisons on a sublist of size m, including the three comparisons required for the perform the "first, middle, last" sort.

The best case for quicksort occurs when the value of the pivot is a median value that causes the partition process to split the list into 'halves' of equal size. In that case, the sizes of all the sublists shrink quickly through the levels of recursion, and the number of levels of recursion is O(log

The big disadvantage of quicksort's is its worst case behavior. If the pivot happens to always be close to the minimum or maximum value in the sublist, then the number of levels of the recursion will be more than a constant times N, the size of the list. The amount of list comparisons and list data moves required at each level of recursion will be roughly proportionate to the size of the list at that level. The average size of the sublist will be about N/2. Therefore quicksort will do more than a constant times N*N/2 list comparisons and list data moves in the worst case. This is no better performance than a simple sort, which in many cases would be unacceptable.

Without severely harming its best case and average case performance, there's no known sure-fire way to prevent quicksort's worst case behavior.

One way to look at Radix sort is that you use a queue for each digit in a set of numbers, or each possible character in a string. You start by queueing each item by its last digit or character in the corresponding queue. Then you empty the queues in sort-order, and re-queue each item, according to the next symbol to the left. After the final queueing, the items will come off the queues in sorted order.

If each item is a string of k symbols, then each item has to be queued k times. So the amount of work required in the sort is roughly proportional to kN, where N is the number of items to sort. Thus radix sort is technically an O(N) algorithm. However, due to the overhead of expressing all sort keys as strings of the same length, and dealing with the extraction of individual characters within keys, radix sort tends not to be competitive with quicksort, mergesort, and other O(NlogN) sorts, except in limited cases.

Distribution sort is an O(N) sort that can be extremely fast. However it is only practical in limited situations. Suppose the key set to be sorted is suitable to be the index set of an array, X. Then the algorithm is just to make one pass over the data, and as each item is encountered, you look at its key, and place the item in X[key]. For example if you have items with keys from 000 to 999 all jumbled up, you can just make a pass over them and put the item with key j into X[j]. Now the array contains the items in sorted order.