6th ed. chapter 17

(Latest Revision: Fri Dec 6 12:32:57 PST 2002 )

Chapter Seventeen -- Distributed Coordination -- Lecture Notes

Introduction

This chapter is about two problems:
1. How do we synchronize concurrent processes in a distributed system?
2. How do we handle deadlock in a distributed system?
Section 17.1 -- Event Ordering

One obvious way to settle competition over resources is to grant requests in first come first served (FCFS) order.

You can do that if you can tell who asked first.

If all processes share the same system clock, then it's easy to get that information.

It is more of a challenge in a distributed system.
- 17.1.1 The Happened-Before Relation
  
  How do we determine which of two events in a distributed system happened first? The logic of it tells us that the following three laws are true for any three events X, Y, and Z:
  1. Local "before" is global "before": If X and Y are events in the same (single-threaded) process, and X was executed before Y (going by the local system clock), then
    
    X-->Y
    
    (X happened before Y globally -- within the context of the entire distributed system.)
  2. Causality: If X is the event of sending a message by one process and Y is the event of receiving that message by another process, then X-->Y.
  3. Transitivity: If X-->Y and Y-->Z then X-->Z.
  If we can establish that E1-->E2 by using the rules above then it is possible that event E1 affected event E2 causally.
  
  On the other hand it is possible that two events are simply not related by the happened-before relation. In that case neither event could have affected the other causally. (Neither could have executed when information about the other was available.) Such pairs of events are said to be concurrent events
  
  This is illustrated by figure 17.1 on page 597. The figure depicts three processes and some messages going between processes. We can use rules 1-3 to conclude that, for example, p0 happened before r4. However, there is no path in the graph from q0 to p2, or from p2 to q0.
  
  Suppose we want to order all the events in figure 17.1 on a single timeline. Suppose we want all the processes to agree to act as though that timeline indicates the actual order in which all the events happened.
  
  We must make the timeline "respect" the happened-before relation, else we cannot guarantee that on the timeline events come after the events that caused them.
  
  On the other hand, we are free to arbitrarily choose the relative order of two concurrent events. Whichever order we choose, it will be consistent with a possible reality -- a way things could have happened. There will be no paradoxical logical consequences.
- 17.1.2 Implementation
  
  This section is about how we can create the kind of global timeline discussed above. (To keep the details of this discussion simple, we assume that there is just one process running at each site -- each processor -- in the distributed system.)
  - Each process Pi starts with a counter Ci initialized to zero.
  - Each Pi increments Ci after each significant event X and assigns the timestamp Ci to event X.
  - When a process Pm sends a message it puts the current value Cm of its counter in the message.
  - When a process Pi receives a message it checks the counter value Cm in the message. Pi sets the value of the counter to C=1+max(Cm,Ci). C becomes the timestamp of the event of receiving the message.
  The overall effect is that if X and Y are any two events throughout the global distributed system, and if X-->Y then the timestamp of X is less than the timestamp of Y.
  
  Note that according to this scheme we can assign a unique timestamp to each and every event of possible interest. Therefore, the timestamping procedure implements a global timeline.
  
  Given any pair W, Z of concurrent events, each will have a timestamp that defines its position on the timeline. Thus the timestamping procedure chooses a relative ordering for each concurrent pair. All these choices are consistent with causality. Therefore the timeline produced is a possible reality.
  
  If two events have the same timestamp then we can break the tie using the process id numbers.
  
  All in all then, we have a way to impose a total ordering of all events. Given any two events, we can say that one of them is "earlier" than the other and we can determine which is which just by looking at timestamps (and pid's if necessary to break a tie).
Section 17.2 -- Mutual Exclusion

This section discusses three methods of implementing mutual exclusion in a distributed system. All three algorithms achieve mutual exclusion, satisfy bounded waiting, and are deadlock- and starvation-free. All three violate the progress requirement.
- 17.2.1 Centralized Approach
  
  There is a mutex coordinator. When a client wants exclusive access it sends a request message to the coordinator. After the coordinator replies the client may enter its critical section (CS). After the client leaves the CS it sends a release message to the coordinator.
  
  The coordinator maintains a FIFO queue of clients. (A client is at the front of the queue while it executes in its CS.) When the coordinator receives a request from a client it puts the client in the queue. If the queue now has one element the coordinator sends a reply to the client. When the coordinator receives a release message it removes the front client from the queue. If the queue is not empty now, it send a reply to the client at the front of the queue.
  
  This simple method requires only three messages per access -- request, reply, release.
  
  The disadvantage is that the coordinator can become a bottleneck.
  
  If the coordinator fails "we're in trouble" -- the remaining processes can "elect" a new coordinator.
- 17.2.2 Fully Distributed Approach
  - When a process Pi wants to enter its CS it generates timestamp TSi and sends request(Pi,TSi) to all peers.
  - When Pi has received a reply message from all peers it may enter its CS.
  - If a process P receives a request(Pj,TSj):
    - If P is in its CS P queues and defers the reply.
    - If P is not in its CS and not interested it replies immediately.
    - If P is not in its CS but desires to be in its CS then if P's request timestamp TS>TSj (Pj asked first) then P sends a reply immediately, else P queues and defers its reply.
  - When a process P leaves the CS it replies to all deferred requests.
  This algorithm requires 2*(N-1) messages per access to the CS.
  
  For this algorithm to work all the processes have to know about each other. There must be "introductions all around" when a new process joins the group.
- 17.2.3 Token-Passing Approach
  
  Pass a token around a logical ring -- the process with the token may enter the CS.
  
  If the token is lost the processes can hold an election to generate a new token.
  
  If a process fails, the remaining processes can hold an election to form a new ring.
  
  This algorithm is efficient when processes "almost always" want to enter the CS -- there is one message per access if all processes always want to enter.
Section 17.3 -- Atomicity
- 17.3.1 The Two-Phase Commit Protocol
  
  The 2PC is one way to assure that a transaction commits on all involved sites in a distributed system, or aborts on all sites.
  
  Typically each site S has a coordinator C which will be in charge of a transaction initiated at S. C takes care of assigning other sites to perform parts of the transaction. The participating sites perform the required actions but defer writing their records to their logs. (See discussion of log-based recovery in chapter 7. Assume all logs mentioned below are on stable storage.). Instead of committing (or aborting) their subset of the transaction they send a message to C informing it that they have "completed" their part of the transaction.
  
  Phase 1: The site coordinator C then puts a <prepare T> record in its log. Then C sends a prepare(T) message to all the sites where T executed.
  
  Sites may decide to abort or commit. If abort, they write <no T> to the log and then send abort(T) to C. If commit, they put <ready T> in the log and send a ready(T) message to C. This is a "solemn promise" to obey C when it gives the order to commit or abort. (A marriage is similar to a 2PC. The parties say "I do" instead of "ready.")
  
  Phase 2: If C gets ready(T) messages from all the sites within a certain time-out period it writes <commit T> to log and sends a commit(T) message to all sites.
  
  Otherwise C writes <abort T> to log and sends an abort(T) message to all sites.
  
  The receiving sites then obediently write the commit or abort record to their logs.
- 17.3.2 Failure Handling in 2PC
  - 17.3.2.1 Failure of a Participating Site
    
    If upon recovery a participating site Sk finds a <commit T> record in the log then Sk executes redo(T). (The reason: all "intentions" of the transaction will be in the log but possibly some changes to the target data did not get flushed to stable storage before the failure.)
    
    Sk performs undo(T) if there is an <abort T> record in the log.
    
    If Sk contains a <ready T> record but no <commit T> or <abort T> then Sk promised to obey the directive of the coordinator C. We need to find out what C said to do. If C can tell us, then this is handled as one of the two cases above (same as finding a commit or abort record.)
    
    If C cannot answer then Sk may poll all other sites to see if any of them committed or aborted T. If it finds one that committed it executes redo(T). If it finds one that aborted, it executes undo(T).
    
    If it cannot get the information immediately Sk must ask the other sites from time to time until one answers. (At least C should be able to answer eventually.)
    
    If there is no <abort T>, <commit T> or <ready T> record then Sk did not send a ready(T) to C. In this case Sk makes sure never to send a ready(T) to C. Sk performs an undo(T). (C will eventually tell the all sites to abort -- it may have done so already.)
  - 17.3.2.2 Failure of the Coordinator
    
    If one of the participating sites contains a <commit T> in its log then T must be committed everywhere. (C decided to commit.)
    
    If one of the participating sites contains a <abort T> in its log then T must be aborted everywhere. (C decided to abort.)
    
    If a participating site Sk does not contain a <ready T> in its log then Sk can't have sent a ready(T) message to C, so C can't have decided to commit. C may have decided to abort. Whether or not C decided to abort, everything will be consistent if Sk now decides to abort. In this case Sk decides to never send a ready(T) message to C.
    
    The only remaining possibility is that all active sites have a <ready T> in their logs and no <commit T> or <abort T>. In that case it is impossible to tell whether C has or will decide to commit or abort. The sites have "solemnly promised" to obey C's decision. There's no choice but to wait for C to recover. Problem: the pending transaction may tie up resources.
  - 17.3.2.3 Failure of the Network
    
    If the link between Sk and Si fails, then Sk can take the same actions it would take if Si had failed, and conversely for Si.
Section 17.4 -- Concurrency Control
- 17.4.1 Locking Protocols
  
  Recall the following points that were made in chapter 7 (Process Synchronization).
  - When we want to insure atomicity of some transaction, we may not have to treat the whole section as a single critical section to be protected by a single lock or semaphore.
  - It is enough to insure serializability -- to insure that when two transactions execute concurrently the effect on the data is the same as if one transaction was carried out completely first and then the other.
  - If we use a lock for each data item and require transactions to follow a locking protocol, we can ensure serializability.
  - The so-called two-phase locking protocol may be used.
    - The transaction may obtain but not release locks during the growing phase.
    - The transaction may release locks but not obtain any new locks during the shrinking phase.
    - The two-phase locking protocol ensures conflict serializability but does not ensure freedom from deadlock.
    - There are conflict-serializable schedules that cannot be obtained through two-phase locking.
  We can use the two-phase locking protocol in a distributed environment. However, we have to consider how the lock manager will function in the distributed system. Five different schemes are presented.
  - 17.4.1.1 Non-replicated Scheme
    
    If there is no replicated data it is simple to use a lock manager at each site. A process executing a transaction may lock data at various sites simply by communicating with the respective lock managers using a request-wait-grant-release paradigm.
    
    This approach makes handling deadlock more complicated. (See section 17.5)
  - 17.4.1.2 Single-Coordinator Approach
    
    If there is one system-wide lock manager then administrating locks on duplicated data is no great task. Deadlock can be handled as on a centralized system. However the lock manager can be a bottleneck and a single point of failure.
    
    A compromise is to have a multiplicity of lock managers. Each manager is responsible for locks on only some of the data. Arrange it so that all replicas of any particular datum are managed by the same lock manager. That way a process has to talk to only one manager to lock any particular piece of data.
    
    This approach makes handling deadlock more complicated. (See section 17.5)
  - 17.4.1.3 Majority Protocol
    
    Put a lock manager at each site, responsible for all the data at that site (only).
    
    A transaction makes requests of at least half the managers of replicas of the desired data. The transaction has to wait until a majority of the lock managers have granted the lock request.
    
    The scheme otherwise has the standard request-wait-grant-release pattern.
    
    Complex implementation
    
    2*((N/2)+1) messages required to acquire a lock and (N/2)+1 messages to release (Formulas use "integer division.")
    
    This approach makes handling deadlock more complicated. (See section 17.5) Deadlock can happen even if processes are only trying to lock one datum.
  - 17.4.1.4 Biased Protocol
    
    This scheme is like the majority protocol except that shared and exclusive locks are handled differently.
    
    To get a shared lock all you need is permission from the manager of one of the replicas.
    
    To get an exclusive lock you must get permission from the managers of all the replicas.
    
    Low overhead on reading but high on writing.
    
    This approach makes handling deadlock more complicated. (See section 17.5)
  - 17.4.1.5 Primary Copy
    
    Designate a primary copy of each datum.
    
    To get a lock, just get permission from the manager of the primary copy.
    
    Simple design.
    
    If the manager at the primary site fails then we can't lock the data -- even if replicas are still accessible.
- 17.4.2 Timestamping
  - 17.4.2.1 Generation of Unique Timestamps
    
    The discussion in this part of the text is a little confusing. What it boils down to is that we can use the event-ordering scheme that was developed in section 17.1.
  - 17.4.2.2 Timestamp-Ordering Scheme
    
    This section is basically an exercise in the text which I am skipping this term in the interest of saving time :-).
Section 17.5 -- Deadlock Handling
- 17.5.1 Deadlock Prevention
  
  In a distributed system:
  - We can perform the resource ordering deadlock prevention scheme by defining a global ordering of the resources. This algorithm is simple and has low overhead. However it requires that we get all sites and processes to agree on the resource ordering.
  - If we designate one process to be the banker, we can perform deadlock avoidance using the banker's algorithm, but apparently the banker would inevitably be a severe bottleneck.
  This section develops new methods of deadlock prevention based on numbering schemes similar to the resource ordering scheme. The difference is that the new methods do not depend on "compliance" on the part of the participating processes.
  
  The Wait-Die Scheme: Each process gets a unique timestamp before it starts to execute. A younger process that attempts to wait for a resource held by an older process is rolled back. (The young process "dies.") This implies that age decreases monotonically along all chains in the wait-for graph of the system. Therefore there can be no cycles in the wait-for graph.
  
  The Wound-Wait Scheme: This algorithm is similar to wait-die, except it works like this: if an older process attempts to wait for a resource held by a younger process the younger process is rolled back and the older process gets the resource. (The young process is "wounded.") In this scheme, age increases monotonically along all chains in the wait-for graph.
  
  When processes are rolled back and restarted, they keep their old timestamps. This prevents starvation.
  
  Wound-wait preempts resources, but wait-die does not.
  
  In the wait-die scheme, after a young process Y dies because it tried to wait on an old process O, the operating system is likely to quickly restart Y, whereupon Y is likely to try to wait on O again. Thus, time and resources may be wasted by repeated roll-back and restart.
  
  In the wound-wait scheme, suppose a young process Y is rolled back because an older process O wounds it. When Y is restarted it may immediately attempt to acquire the resource O took from it. If O still holds the resource then Y will be allowed to wait for it. Thus, depending on other conditions, wound-wait may result in fewer rollbacks than wait-die.
  
  We may quibble over which scheme causes more rollbacks. However it is a significant problem if there are any rollbacks. Processing time is lost. It is difficult to decide algorithmically (program) what to do with stateful resources held by the process that is rolled-back.
- 17.5.2 Deadlock Detection
  
  To eliminate unnecessary rollbacks and preemption of resources we can utilize a deadlock detection algorithm.
  
  (To keep details in this discussion simple, we assume there is just one instance of each resource type, and we use a wait-for graph to keep track of the system state.)
  
  Since the processes and resources are scattered all over the distributed system there doesn't seem to be an obvious answer to the question of how the wait-for graph should be represented.
  
  One way to handle the problem is to store parts of the graph on each local system. Let G be the graph. Denote by Gs that part of G which is stored at site S. Gs has a node for each process (local or not) that either holds or is waiting for a resource located at S.
  
  G is the union of all the Gs. There can be a cycle in G even if there is no cycle in any of the Gs's.
  - 17.5.2.1 Centralized Approach
    
    In the centralized approach we choose a deadlock-detection coordinator to accept copies of the local graphs and construct a union Gc.
    
    The constructed graph Gc is seldom if ever the real wait-for graph G of the system because the system is constantly in flux and there is communication lag involved in creating Gc.
    
    Unfortunately, looking at examples we can see it is possible that Gc will contain false cycles
    
    Using our distributed global event-ordering methodology (c.f. 17.1) we can use an algorithm based on the following ideas to avoid detecting false cycles:
    - When a process Pi at site S requests a resource held by process Pj at the same site S then the system at site S inserts an edge [Pi-->Pj] into the local wait-for graph.
    - When process Pi at site S1 requests a resource held by process Pj at a different site S2, Pi generates a timestamp TS and sends the request and timestamp to S2.
    - The system at S1 inserts a labelled edge [TS,Pi-->Pj] into the local wait-for graph.
    - S2 inserts a copy of [TS,Pi-->Pj] into its local wait-for graph if and only if S2 cannot immediately grant the requested resource when the request arrives.
    - When the deadlock-detection coordinator decides to check for cycles it sends a "let's do it" message to all the sites in the system.
    - When a site gets the "let's do it" message from the coordinator, the site sends the coordinator its copy of the local wait-for graph.
    - After the coordinator receives the expected reply from every local wait-for graph it constructs a "union" graph like this:
      - It makes one vertex for each process found in any of the local graphs
      - Into the graph it puts all edges that have the form Pi-->Pj, where Pi and Pj reside at the same site.
      - If an edge of the form [TS,Pi-->Pj] is found in more than one local wait-for graph, the coordinator puts that edge in the constructed union.
    The construction has the property that if the constructed graph has a cycle then the (actual) system is deadlocked
    
    Also if the constructed graph has no cycle then the system was not deadlocked when the "let's do it" message was sent out.
  - 17.5.2.2 Fully Distributed Approach
    
    Our text gives an overview of a fully distributed deadlock-detection scheme published by Obermarck (R. Obermarck, "Distributed Deadlock Detection Algorithm," ACM Transactions on Database Systems, Volume 7, Number 2 (1982), pp. 187-208).
    
    In a "fully distributed approach" all sites in the distributed system participate equally in the work of determining if a deadlock has occurred. Instead of designating one single process to play the role of the deadlock-detection coordinator, there is a deadlock-detection coordinator at every site.
    
    In this scheme the local wait-for graph contains the usual nodes and edges corresponding to local processes waiting for local resources.
    
    Additionally, a special node P-ex may be in the local wait-for graph at any site S. If a process P is waiting for a a resource external to S that is held by a process Q, then the local wait-for graph at S has an edge of the form P-->P-ex. Similarly if a process P' at some site other than S is waiting for a resource held by a process Q' at S, then the local wait-for graph at S has an edge of the form P-ex-->Q'.
    
    Obviously the system is deadlocked if there is a cycle not involving P-ex in any of the local wait-for graphs.
    
    Obviously the system is not deadlocked if there are no cycles at all in any of the local wait-for graphs.
    
    If a local wait-for graph has a cycle containing P-ex, but no cycles without P-ex, then the system may be deadlocked but further information is required to find out.
    
    Through the use of a numbering scheme, one particular site S is "elected" to investigate further.
    
    The wait-for graph at S contains a (simple) cycle in which one of the edges is of the form P-->P-ex. P is waiting for a resource at a site S'. S sends the information about the cycle to S' and S' forms the union of the cycle with its local wait-for graph. S' examines the union. If S' finds a cycle not containing P-ex, then the system is deadlocked. If S' only finds a cycle involving P-ex, then it ships it off to another site and the algorithm "recurs." Eventually a site will either find no cycle -- in which case we conclude there is no deadlock, or find a cycle not containing P-ex -- in which case we conclude that there is a deadlock.
Section 17.6 -- Election Algorithms

Many of the schemes described in chapter 17 depend on the existence of a coordinator performing a service at one of the sites. Section 17.6 provides information about what can be done if the coordinator process fails (e.g. its platform computer crashes.)

Specifically, section 17.6 presents election algorithms that can be used to choose one unique site where a new coordinator will be started.

In this section we go back to assuming that there is one process at each site. Also we assume that each site has a unique id number, and that the coordinator is supposed to be the "living" process with the highest number. Thus basically the problem is for the "surviving" processes to collectively determine which of them has the highest number.
- 17.6.1 The Bully Algorithm
  
  If the coordinator does not respond to a process Pi for a sufficiently long time then Pi decides to run for the office of the coordinator.
  
  Pi "starts an election" by sending an "I am running " election message to every Pj such that j>i. If there are no replies within a timeout period then Pi assumes the role of coordinator and sends "I am coordinator" messages to all Pk where k<i.
  
  However if Pi gets an "I'm bigger than you" reply from some Pj with j>i then Pi waits for some process to send it an "I am coordinator" message.
  
  If Pi does not get that message before a timeout expires it will have to start another election. (All higher processes may have failed.)
  
  Here is a list telling how a process must handle a couple of kinds of messages:
  - If a process Pi gets an "I am coordinator" message from a process Pj with j>i then Pi should record the information and try to use Pj as the coordinator in the future.
  - If a process Pi gets an "I am running" message from Pk where k<i then Pi responds to Pk with an "I am bigger than you" message. Next Pi starts an election, unless it is already running one.
  When a failed process restarts, it starts an election. It will win and become the coordinator if it has the highest number.
- 17.6.2 Ring Algorithm
  
  If a process Pi decides that the coordinator may be down, it creates an initially empty active list and sends an "elect(Pi)" message to its neighbor on the right. It then adds itself ('i') to its active list.
  
  If Pi gets an elect(j) message from the process on the left
  - If this is the first elect message Pi has received or sent, Pi creates a new active list, puts i and j in it and sends an elect(i) message followed by an elect(j) message to the process on the right;
  - otherwise if i != j then Pi adds j to its active list and passes the elect(j) message to the right;
  - otherwise i == j and now Pi has the id numbers of all active processes. Pi computes the max and in the future tries to use that process as the coordinator.
Section 17.7 -- Reaching Agreement
- 17.7.1 Unreliable Communications
- 17.7.2 Faulty Processes
Section 17.8 -- Summary