7th ed. chapter 18

(Latest Revision: Sun Sep 19, 2005 )

Chapter Eighteen -- Distributed Coordination -- Lecture Notes

Introduction

This chapter is about two problems:
1. How do we synchronize concurrent processes in a distributed system?
2. How do we handle deadlock in a distributed system?
Section 18.1 -- Event Ordering

One obvious way to settle competition over resources is to grant requests in first come first served (FCFS) order.

You can do that if you can tell who asked first.

If all processes share the same system clock, then it's easy to get that information.

It is more of a challenge in a distributed system.
- 18.1.1 The Happened-Before Relation
  
  How do we determine which of two events in a distributed system happened first? The logic of it tells us that the following three laws are true for any three events X, Y, and Z:
  1. Local "before" is global "before": If X and Y are events in the same (single-threaded) process, and X was executed before Y (going by the local system clock), then
    
    X-->Y
    
    (X happened before Y globally -- within the context of the entire distributed system.)
  2. Causality: If X is the event of sending a message by one process and Y is the event of receiving that message by another process, then X-->Y.
  3. Transitivity: If X-->Y and Y-->Z then X-->Z.
  If we can establish that E₁-->E₂ by using the rules above then it is possible that event E₁ affected event E₂ causally.
  
  On the other hand it is possible that two events are simply not related by the happened-before relation. In that case neither event could have affected the other causally. (Neither could have executed when information about the other was available.) Such pairs of events are said to be concurrent events
  
  This is illustrated by figure 18.1 on page 665. The figure depicts three processes and some messages going between processes. We can use rules 1-3 to conclude that, for example, p0 happened before r4. However, there is no path in the graph from q0 to p2, or from p2 to q0.
  
  Suppose we want to order all the events in figure 18.1 on a single timeline. Suppose we want all the processes to agree to act as though that timeline indicates the actual order in which all the events happened.
  
  We must make the timeline "respect" the happened-before relation, else we cannot guarantee that on the timeline events come after the events that caused them.
  
  On the other hand, we are free to arbitrarily choose the relative order of two concurrent events. Whichever order we choose, it will be consistent with a possible reality -- a way things could have happened. There will be no paradoxical logical consequences.
- 18.1.2 Implementation
  
  This section is about how we can create the kind of global timeline discussed above. (To keep the details of this discussion simple, we assume that there is just one process running at each site -- each processor -- in the distributed system.)
  - Each process P_i starts with a counter C_i initialized to zero.
  - Each P_i increments C_i after each significant event X and assigns the timestamp C_i to event X.
  - When a process P_m sends a message it puts the current value C_m of its counter in the message.
  - When a process P_i receives a message it checks the counter value C_m in the message. P_i sets the value of the counter to C=1+max(C_m,C_i). C becomes the timestamp of the event of receiving the message.
  The overall effect is that if X and Y are any two events throughout the global distributed system, and if X-->Y then the timestamp of X is less than the timestamp of Y.
  
  Note that according to this scheme we can assign a unique timestamp to each and every event of possible interest. Therefore, the timestamping procedure implements a global timeline.
  
  Given any pair W, Z of concurrent events, each will have a timestamp that defines its position on the timeline. Thus the timestamping procedure chooses a relative ordering for each concurrent pair. All these choices are consistent with causality. Therefore the timeline produced is a possible reality.
  
  If two events have the same timestamp then we can break the tie using the process id numbers.
  
  All in all then, we have a way to impose a total ordering of all events. Given any two events, we can say that one of them is "earlier" than the other and we can determine which is which just by looking at timestamps (and pid's if necessary to break a tie).
Section 18.2 -- Mutual Exclusion

This section discusses three methods of implementing mutual exclusion in a distributed system. All three algorithms achieve mutual exclusion, satisfy bounded waiting, and are deadlock- and starvation-free. All three violate the progress requirement.
- 18.2.1 Centralized Approach
  
  There is a mutex coordinator. When a client wants exclusive access it sends a request message to the coordinator. After the coordinator replies the client may enter its critical section (CS). After the client leaves the CS it sends a release message to the coordinator.
  
  The coordinator maintains a FIFO queue of clients. (A client is at the front of the queue while it executes in its CS.) When the coordinator receives a request from a client it puts the client in the queue. If the queue now has one element the coordinator sends a reply to the client. When the coordinator receives a release message it removes the front client from the queue. If the queue is not empty now, it send a reply to the client at the front of the queue.
  
  This simple method requires only three messages per access -- request, reply, release.
  
  The disadvantage is that the coordinator can become a bottleneck.
  
  If the coordinator fails "we're in trouble" -- however the remaining processes can "elect" a new coordinator.
- 18.2.2 Fully Distributed Approach
  - When a process P_i wants to enter its CS it generates timestamp TS_i and sends request(P_i, TS_i) to all peers.
  - When P_i has received a reply message from all peers it may enter its CS.
  - If a process P receives a request(P_j, TS_j):
    - If P is in its CS P queues and defers the reply.
    - If P is not in its CS and not interested it replies immediately.
    - If P is not in its CS but desires to be in its CS then if P's request timestamp TS>TS_j (P_j asked first) then P sends a reply immediately, else P queues and defers its reply.
  - When a process P leaves the CS it replies to all deferred requests.
  This algorithm requires 2*(N-1) messages per access to the CS.
  
  For this algorithm to work all the processes have to know about each other. There must be "introductions all around" when a new process joins the group.
- 18.2.3 Token-Passing Approach
  
  Pass a token around a logical ring -- the process with the token may enter the CS.
  
  If the token is lost the processes can hold an election to generate a new token.
  
  If a process fails, the remaining processes can hold an election to form a new ring.
  
  This algorithm is efficient when processes "almost always" want to enter the CS -- there is one message per access if all processes always want to enter.
Section 18.3 -- Atomicity
- 18.3.1 The Two-Phase Commit Protocol
  
  The 2PC is one way to assure that a transaction commits on all involved sites in a distributed system, or aborts on all sites.
  
  Typically each site S has a coordinator C which will be in charge of a transaction initiated at S. C takes care of assigning other sites to perform parts of the transaction. The participating sites perform the required actions but defer writing their records to their logs. (See discussion of log-based recovery in chapter 6 (Process Synchronization). Assume all logs mentioned below are on stable storage.). Instead of committing (or aborting) their subset of the transaction they send a message to C informing it that they have "completed" their part of the transaction.
  
  Phase 1: The site coordinator C then puts a <prepare T> record in its log. Then C sends a prepare(T) message to all the sites where T executed.
  
  Sites may decide to abort or commit. If abort, they write <no T> to the log and then send abort(T) to C. If commit, they put <ready T> in the log and send a ready(T) message to C. This is a "solemn promise" to obey C when it gives the order to commit or abort. (A marriage is similar to a 2PC. The parties say "I do" instead of "ready.")
  
  Phase 2: If C gets ready(T) messages from all the sites within a certain time-out period it writes <commit T> to log and sends a commit(T) message to all sites.
  
  Otherwise C writes <abort T> to log and sends an abort(T) message to all sites.
  
  The receiving sites then obediently write the commit or abort record to their logs.
- 18.3.2 Failure Handling in 2PC
  - 18.3.2.1 Failure of a Participating Site
    
    If upon recovery a participating site S_k finds a <commit T> record in the log then S_k executes redo(T). (The reason: all "intentions" of the transaction will be in the log but possibly some changes to the target data did not get flushed to stable storage before the failure.)
    
    S_k performs undo(T) if there is an <abort T> record in the log.
    
    If S_k contains a <ready T> record but no <commit T> or <abort T> then S_k promised to obey the directive of the coordinator C. We need to find out what C said to do. If C can tell us, then this is handled as one of the two cases above (same as finding a commit or abort record.)
    
    If C cannot answer then S_k may poll all other sites to see if any of them committed or aborted T. If it finds one that committed it executes redo(T). If it finds one that aborted, it executes undo(T).
    
    If it cannot get the information immediately S_k must ask the other sites from time to time until one answers. (At least C should be able to answer eventually.)
    
    If there is no <abort T>, <commit T> or <ready T> record then S_k did not send a ready(T) to C. In this case S_k makes sure never to send a ready(T) to C. S_k performs an undo(T). (C will eventually tell the all sites to abort -- it may have done so already.)
  - 18.3.2.2 Failure of the Coordinator
    
    If one of the participating sites contains a <commit T> in its log then T must be committed everywhere. (C decided to commit.)
    
    If one of the participating sites contains a <abort T> in its log then T must be aborted everywhere. (C decided to abort.)
    
    If a participating site S_k does not contain a <ready T> in its log then S_k can't have sent a ready(T) message to C, so C can't have decided to commit. C may have decided to abort. Whether or not C decided to abort, everything will be consistent if S_k now decides to abort. In this case S_k decides to never send a ready(T) message to C.
    
    The only remaining possibility is that all active sites have a <ready T> in their logs and no <commit T> or <abort T>. In that case it is impossible to tell whether C has or will decide to commit or abort. The sites have "solemnly promised" to obey C's decision. There's no choice but to wait for C to recover. Problem: the pending transaction may tie up resources.
  - 18.3.2.3 Failure of the Network
    
    If the link between S_k and S_i fails, then S_k can take the same actions it would take if S_i had failed, and conversely for S_i.
Section 18.4 -- Concurrency Control
- 18.4.1 Locking Protocols
  
  Recall the following points that were made in chapter 6 (Process Synchronization).
  - When we want to insure atomicity of some transaction, we may not have to treat the whole section as a single critical section to be protected by a single lock or semaphore.
  - It is enough to insure serializability -- to insure that when two transactions execute concurrently the effect on the data is the same as if one transaction was carried out completely first and then the other.
  - If we use a lock for each data item and require transactions to follow a locking protocol, we can ensure serializability.
  - The so-called two-phase locking protocol may be used.
    - The transaction may obtain but not release locks during the growing phase.
    - The transaction may release locks but not obtain any new locks during the shrinking phase.
    - The two-phase locking protocol ensures conflict serializability but does not ensure freedom from deadlock.
    - There are conflict-serializable schedules that cannot be obtained through two-phase locking.
  We can use the two-phase locking protocol in a distributed environment. However, we have to consider how the lock manager will function in the distributed system. Five different schemes are presented.
  - 18.4.1.1 Non-replicated Scheme
    
    If there is no replicated data it is simple to use a lock manager at each site. A process executing a transaction may lock data at various sites simply by communicating with the respective lock managers using a request-wait-grant-release paradigm.
    
    This approach makes handling deadlock more complicated. (See section 18.5)
  - 18.4.1.2 Single-Coordinator Approach
    
    If there is one system-wide lock manager then administrating locks on duplicated data is no great task. Deadlock can be handled as on a centralized system. However the lock manager can be a bottleneck and a single point of failure.
    
    A compromise is to have a multiplicity of lock managers. Each manager is responsible for locks on only some of the data. Arrange it so that all replicas of any particular datum are managed by the same lock manager. That way a process has to talk to only one manager to lock any particular piece of data.
    
    This approach makes handling deadlock more complicated. (See section 18.5)
  - 18.4.1.3 Majority Protocol
    
    Put a lock manager at each site, responsible for all the data at that site (only).
    
    A transaction makes requests of at least half the managers of replicas of the desired data. The transaction has to wait until a majority of the lock managers have granted the lock request.
    
    The scheme otherwise has the standard request-wait-grant-release pattern.
    
    Complex implementation
    
    2*((N/2)+1) messages required to acquire a lock and (N/2)+1 messages to release (Formulas use "integer division.")
    
    This approach makes handling deadlock more complicated. (See section 18.5) Deadlock can happen even if processes are only trying to lock one datum.
  - 18.4.1.4 Biased Protocol
    
    This scheme is like the majority protocol except that shared and exclusive locks are handled differently.
    
    To get a shared lock all you need is permission from the manager of one of the replicas.
    
    To get an exclusive lock you must get permission from the managers of all the replicas.
    
    Low overhead on reading but high on writing.
    
    This approach makes handling deadlock more complicated. (See section 18.5)
  - 18.4.1.5 Primary Copy
    
    Designate a primary copy of each datum.
    
    To get a lock, just get permission from the manager of the primary copy.
    
    Simple design.
    
    If the manager at the primary site fails then we can't lock the data -- even if replicas are still accessible.
- 18.4.2 Timestamping
  - 18.4.2.1 Generation of Unique Timestamps
    
    The discussion in this part of the text is a little confusing. What it boils down to is that we can use the event-ordering scheme that was developed in section 18.1.
  - 18.4.2.2 Timestamp-Ordering Scheme
    
    This section is basically an exercise in the text which I am skipping this term in the interest of saving time :-).
Section 18.5 -- Deadlock Handling
- 18.5.1 Deadlock Prevention
  
  In a distributed system:
  - We can perform the resource ordering deadlock prevention scheme by defining a global ordering of the resources. This algorithm is simple and has low overhead. However it requires that we get all sites and processes to agree on the resource ordering.
  - If we designate one process to be the banker, we can perform deadlock avoidance using the banker's algorithm, but apparently the banker would inevitably be a severe bottleneck.
  This section develops new methods of deadlock prevention based on numbering schemes similar to the resource ordering scheme. The difference is that the new methods do not depend on "compliance" on the part of the participating processes.
  
  The Wait-Die Scheme: Each process gets a unique timestamp before it starts to execute. A younger process that attempts to wait for a resource held by an older process is rolled back. (The young process "dies.") This implies that age decreases monotonically as we go forward along any chain in the wait-for graph of the system. Therefore there can be no cycles in the wait-for graph.
  
  The Wound-Wait Scheme: This algorithm is similar to wait-die, except it works like this: if an older process attempts to wait for a resource held by a younger process the younger process is rolled back and the older process gets the resource. (The young process is "wounded.") In this scheme, age increases monotonically as we go forward along any chain in the wait-for graph.
  
  When processes are rolled back and restarted, they keep their old timestamps. This prevents starvation.
  
  Wound-wait preempts resources, but wait-die does not.
  
  In the wait-die scheme, after a young process Y dies because it tried to wait on an old process O, the operating system is likely to quickly restart Y, whereupon Y is likely to try to wait on O again. Thus, time and resources may be wasted by repeated roll-back and restart.
  
  In the wound-wait scheme, suppose a young process Y is rolled back because an older process O wounds it. When Y is restarted it may immediately attempt to acquire the resource O took from it. If O still holds the resource then Y will be allowed to wait for it. Thus, depending on other conditions, wound-wait may result in fewer rollbacks than wait-die.
  
  We may quibble over which scheme causes more rollbacks. However it is a significant problem if there are any rollbacks. Processing time is lost. It is difficult to decide algorithmically (program) what to do with stateful resources held by the process that is rolled-back.
- 18.5.2 Deadlock Detection
  
  To eliminate unnecessary rollbacks and preemption of resources we can utilize a deadlock detection algorithm.
  
  (To keep details in this discussion simple, we assume there is just one instance of each resource type, and we use a wait-for graph to keep track of the system state.)
  
  Since the processes and resources are scattered all over the distributed system there doesn't seem to be an obvious answer to the question of how the wait-for graph should be represented.
  
  One way to handle the problem is to store parts of the graph on each local system. Let G be the graph. Denote by G_s that part of G which is stored at site S. G_s has a node for each process (local or not) that either holds or is waiting for a resource located at S.
  
  G is the union of all the G_s. There can be a cycle in G even if there is no cycle in any of the G_s's.
  - 18.5.2.1 Centralized Approach
    
    In the centralized approach we choose a deadlock-detection coordinator to accept copies of the local graphs and construct a union G_c.
    
    The constructed graph G_c is seldom if ever the real wait-for graph G of the system because the system is constantly in flux and there is communication lag involved in creating G_c.
    
    Unfortunately, looking at examples we can see it is possible that G_c will contain false cycles
    
    Using our distributed global event-ordering methodology (c.f. 18.1) we can use an algorithm based on the following ideas to avoid detecting false cycles:
    - When a process P_i requests a resource at site S held by process P_j at the same site S then the system at site S inserts an edge [P_i-->P_j] into the local wait-for graph.
    - When process P_i, at site S₁, requests a resource held by process P_j at a different site S₂, P_i generates a timestamp TS and sends the request and timestamp to S₂.
    - The system at S₁ inserts a labelled edge [TS, P_i-->P_j] into the local wait-for graph.
    - S₂ inserts a copy of [TS, P_i-->P_j] into its local wait-for graph if and only if S₂ cannot immediately grant the requested resource when the request arrives.
    - When the deadlock-detection coordinator decides to check for cycles it sends a "let's do it" message to all the sites in the system.
    - When a site gets the "let's do it" message from the coordinator, the site sends the coordinator its copy of the local wait-for graph.
    - After the coordinator receives the expected reply from every local wait-for graph it constructs a "union" graph like this:
      - It makes one vertex for each process found in any of the local graphs
      - Into the graph it puts all edges that have the form P_i-->P_j, where P_i and P_j reside at the same site.
      - If an edge of the form [TS, P_i-->P_j] is found in more than one local wait-for graph, the coordinator puts that edge in the constructed union.
    The construction has the property that if the constructed graph has a cycle then the (actual) system is deadlocked
    
    Also if the constructed graph has no cycle then the system was not deadlocked when the "let's do it" message was sent out.
  - 18.5.2.2 Fully Distributed Approach
    
    Our text gives an overview of a fully distributed deadlock-detection scheme published by Obermarck (R. Obermarck, "Distributed Deadlock Detection Algorithm," ACM Transactions on Database Systems, Volume 7, Number 2 (1982), pp. 187-208).
    
    In a "fully distributed approach" all sites in the distributed system participate equally in the work of determining if a deadlock has occurred. Instead of designating one single process to play the role of the deadlock-detection coordinator, there is a deadlock-detection coordinator at every site.
    
    In this scheme the local wait-for graph contains the usual nodes and edges corresponding to local processes waiting for local resources.
    
    Additionally, a special node P_ex may be in the local wait-for graph at any site S. If a process P at site S is waiting for a resource external to S that is held by a process Q, then the local wait-for graph at S has an edge of the form P-->P_ex. Similarly if a process P' at some site other than S is waiting for a resource held by a process Q' at S, then the local wait-for graph at S has an edge of the form P_ex-->Q'.
    
    Obviously the system is deadlocked if there is a cycle not involving P_ex in any of the local wait-for graphs.
    
    Obviously the system is not deadlocked if there are no cycles at all in any of the local wait-for graphs.
    
    If a local wait-for graph has a cycle containing P_ex, but no cycles without P_ex, then the system may be deadlocked but further information is required to find out.
    
    Through the use of a numbering scheme, one particular site S is "elected" to investigate further.
    
    The wait-for graph at S contains a (simple) cycle in which one of the edges is of the form P-->P_ex. P is waiting for a resource at a site S'. S sends the information about the cycle to S' and S' forms the union of the cycle with its local wait-for graph. S' examines the union. If S' finds a cycle not containing P_ex, then the system is deadlocked. If S' only finds a cycle involving P_ex, then it ships it off to another site and the algorithm "recurs." Eventually a site will either find no cycle -- in which case we conclude there is no deadlock, or find a cycle not containing P_ex -- in which case we conclude that there is a deadlock.
Section 18.6 -- Election Algorithms

Many of the schemes described in chapter 18 depend on the existence of a coordinator performing a service at one of the sites. Section 18.6 provides information about what can be done if the coordinator process fails (e.g. its platform computer crashes.)

Specifically, section 18.6 presents election algorithms that can be used to choose a unique site where a new coordinator will be started.

In this section we go back to assuming that there is one process at each site. Also we assume that each site has a unique id number, and that the coordinator is supposed to be the "living" process with the highest number. Thus basically the problem is for the "surviving" processes to collectively determine which of them has the highest number.
- 18.6.1 The Bully Algorithm
  
  If the coordinator does not respond to a process P_i for a sufficiently long time then P_i decides to run for the office of the coordinator.
  
  P_i "starts an election" by sending an "I am running" election message to every P_j such that j>i. If there are no replies within a timeout period then P_i assumes the role of coordinator and sends "I am coordinator" messages to all Pk where k<i.
  
  However if P_i gets an "I'm bigger than you" reply from some P_j with j>i then P_i waits for some process to send it an "I am coordinator" message.
  
  If P_i does not get that message before a timeout expires it will have to start another election. (All higher processes may have failed.)
  
  Here is a list telling how a process must handle a couple of kinds of messages:
  - If a process P_i gets an "I am coordinator" message from a process P_j with j>i then P_i should record the information and try to use P_j as the coordinator in the future.
  - If a process P_i gets an "I am running" message from Pk where k<i then P_i responds to Pk with an "I am bigger than you" message. Next P_i starts an election, unless it is already running one.
  When a failed process restarts, it starts an election. It will win and become the coordinator if it has the highest number.
- 18.6.2 Ring Algorithm
  
  If a process P_i decides that the coordinator may be down, it creates an initially empty active list and sends an "elect(P_i)" message to its neighbor on the right. It then adds itself ('i') to its active list.
  
  If P_i gets an elect(j) message from the process on the left
  - If this is the first elect message P_i has received or sent, P_i creates a new active list, puts i and j in it and sends an elect(i) message followed by an elect(j) message to the process on the right;
  - otherwise if i != j then P_i adds j to its active list and passes the elect(j) message to the right;
  - otherwise i == j and now P_i has the id numbers of all active processes. P_i computes the max and in the future tries to use that process as the coordinator.
Section 18.7 -- Reaching Agreement
- 18.7.1 Unreliable Communications
- 18.7.2 Faulty Processes
Section 18.8 -- Summary