(Latest Revision:
Fri Dec 6 12:32:57 PST 2002
)
Chapter Seventeen
--
Distributed Coordination
--
Lecture Notes
- Introduction
This chapter is about two problems:
- How do we synchronize concurrent processes in a distributed
system?
- How do we handle deadlock in a distributed system?
- Section 17.1 -- Event Ordering
One obvious way to settle competition over resources is to grant requests
in first come first served (FCFS) order.
You can do that if you can tell who asked first.
If all processes share the same system clock, then it's easy to get
that information.
It is more of a challenge in a distributed system.
- 17.1.1 The Happened-Before Relation
How do we determine which of two events in a distributed system
happened first? The logic of it tells us that the following
three laws are true for any three events X, Y, and Z:
- Local "before" is global "before": If X and Y are
events in the same (single-threaded) process, and X was
executed before Y (going by the local system clock), then
X-->Y
(X happened before Y globally -- within the context of the
entire distributed system.)
- Causality: If X is the event of sending a message by
one process and Y is the event of receiving that message by
another process, then X-->Y.
- Transitivity: If X-->Y and Y-->Z
then X-->Z.
If we can establish that E1-->E2 by using the
rules above then it is possible that event E1 affected event E2
causally.
On the other hand it is possible that two events are simply not
related by the happened-before relation. In that case
neither event could have affected the other causally. (Neither
could have executed when information about the other was
available.) Such pairs of events are said to be concurrent
events
This is illustrated by figure 17.1 on page 597. The figure
depicts three processes and some messages going between
processes. We can use rules 1-3 to conclude that, for example,
p0 happened before r4. However, there is no path in the graph
from q0 to p2, or from p2 to q0.
Suppose we want to order all the events in figure 17.1 on a
single timeline. Suppose we want all the processes to agree to
act as though that timeline indicates the actual order in which
all the events happened.
We must make the timeline "respect" the happened-before
relation, else we cannot guarantee that on the timeline events
come after the events that caused them.
On the other hand, we are free to arbitrarily choose the relative
order of two concurrent events. Whichever order we choose, it
will be consistent with a possible reality -- a way things
could have happened. There will be no paradoxical logical
consequences.
- 17.1.2 Implementation
This section is about how we can create the kind of global timeline
discussed above. (To keep the details of this discussion simple, we
assume that there is just one process running at each site -- each
processor -- in the distributed system.)
- Each process Pi starts with a counter Ci initialized to
zero.
- Each Pi increments Ci after each significant event X and
assigns the timestamp Ci to event X.
- When a process Pm sends a message it puts the current value
Cm of its counter in the message.
- When a process Pi receives a message it checks the counter
value Cm in the message. Pi sets the value of the counter
to C=1+max(Cm,Ci). C becomes the timestamp of the event of
receiving the message.
The overall effect is that if X and Y are any two events
throughout the global distributed system, and if X-->Y
then the timestamp of X is less than the timestamp of Y.
Note that according to this scheme we can assign a unique
timestamp to each and every event of possible interest.
Therefore, the timestamping procedure implements a global
timeline.
Given any pair W, Z of concurrent events, each will have a
timestamp that defines its position on the timeline. Thus the
timestamping procedure chooses a relative ordering for each
concurrent pair. All these choices are consistent with
causality. Therefore the timeline produced is a possible
reality.
If two events have the same timestamp then we can break the tie
using the process id numbers.
All in all then, we have a way to impose a total ordering of all
events. Given any two events, we can say that one of them is
"earlier" than the other and we can determine which is which just
by looking at timestamps (and pid's if necessary to break a tie).
- Section 17.2 -- Mutual Exclusion
This section discusses three methods of implementing mutual exclusion
in a distributed system. All three algorithms achieve mutual
exclusion, satisfy bounded waiting, and are deadlock- and
starvation-free. All three violate the progress requirement.
- 17.2.1 Centralized Approach
There is a mutex coordinator. When a client wants exclusive
access it sends a request message to the coordinator. After the
coordinator replies the client may enter its critical section
(CS). After the client leaves the CS it sends a release message
to the coordinator.
The coordinator maintains a FIFO queue of clients. (A client is
at the front of the queue while it executes in its CS.) When the
coordinator receives a request from a client it puts the client
in the queue. If the queue now has one element the coordinator
sends a reply to the client. When the coordinator receives a
release message it removes the front client from the queue. If
the queue is not empty now, it send a reply to the client at the
front of the queue.
This simple method requires only three messages per access --
request, reply, release.
The disadvantage is that the coordinator can become a bottleneck.
If the coordinator fails "we're in trouble" -- the remaining
processes can "elect" a new coordinator.
- 17.2.2 Fully Distributed Approach
- When a process Pi wants to enter its CS it generates
timestamp TSi and sends request(Pi,TSi) to all
peers.
- When Pi has received a reply message from all peers
it may enter its CS.
- If a process P receives a request(Pj,TSj):
- If P is in its CS P queues and defers the reply.
- If P is not in its CS and not interested it replies
immediately.
- If P is not in its CS but desires to be in its CS then
if P's request timestamp TS>TSj (Pj asked first) then P
sends a reply immediately, else P queues and defers its
reply.
- When a process P leaves the CS it replies to all deferred
requests.
This algorithm requires 2*(N-1) messages per access to the CS.
For this algorithm to work all the processes have to know about
each other. There must be "introductions all around" when a new
process joins the group.
- 17.2.3 Token-Passing Approach
Pass a token around a logical ring -- the process with the token
may enter the CS.
If the token is lost the processes can hold an election to
generate a new token.
If a process fails, the remaining processes can hold an election
to form a new ring.
This algorithm is efficient when processes "almost always" want to
enter the CS -- there is one message per access if all processes
always want to enter.
- Section 17.3 -- Atomicity
- 17.3.1 The Two-Phase Commit Protocol
The 2PC is one way to assure that a transaction commits on all
involved sites in a distributed system, or aborts on all sites.
Typically each site S has a coordinator C which will be in charge
of a transaction initiated at S. C takes care of assigning other
sites to perform parts of the transaction. The participating
sites perform the required actions but defer writing their records to their logs. (See discussion of log-based
recovery in chapter 7. Assume all logs mentioned below are on
stable storage.). Instead of committing (or aborting) their
subset of the transaction they send a message to C informing it
that they have "completed" their part of the transaction.
Phase 1: The site coordinator C then puts a <prepare
T> record in its log. Then C sends a prepare(T)
message to all the sites where T executed.
Sites may decide to abort or commit. If abort, they write <no
T> to the log and then send abort(T) to C. If commit,
they put <ready T> in the log and send a ready(T)
message to C. This is a "solemn promise" to obey C when it gives
the order to commit or abort. (A marriage is similar to a 2PC.
The parties say "I do" instead of "ready.")
Phase 2: If C gets ready(T) messages from all the
sites within a certain time-out period it writes <commit T>
to log and sends a commit(T) message to all sites.
Otherwise C writes <abort T> to log and sends an
abort(T) message to all sites.
The receiving sites then obediently write the commit or abort
record to their logs.
- 17.3.2 Failure Handling in 2PC
- 17.3.2.1 Failure of a Participating Site
If upon recovery a participating site Sk finds a <commit
T> record in the log then Sk executes redo(T). (The
reason: all "intentions" of the transaction will be in the
log but possibly some changes to the target data did not get
flushed to stable storage before the failure.)
Sk performs undo(T) if there is an <abort T> record in
the log.
If Sk contains a <ready T> record but no <commit
T> or <abort T> then Sk promised to obey the
directive of the coordinator C. We need to find out what C
said to do. If C can tell us, then this is handled as one
of the two cases above (same as finding a commit or abort
record.)
If C cannot answer then Sk may poll all other sites to see
if any of them committed or aborted T. If it finds one that
committed it executes redo(T). If it finds one that
aborted, it executes undo(T).
If it cannot get the information immediately Sk must ask the
other sites from time to time until one answers. (At least
C should be able to answer eventually.)
If there is no <abort T>, <commit T> or
<ready T> record then Sk did not send a
ready(T) to C. In this case Sk makes sure
never to send a ready(T) to C. Sk performs an
undo(T). (C will eventually tell the all sites to abort --
it may have done so already.)
- 17.3.2.2 Failure of the Coordinator
If one of the participating sites contains a <commit
T> in its log then T must be committed everywhere. (C
decided to commit.)
If one of the participating sites contains a <abort T>
in its log then T must be aborted everywhere. (C decided to
abort.)
If a participating site Sk does not contain a
<ready T> in its log then Sk can't have sent a
ready(T) message to C, so C can't have decided to
commit. C may have decided to abort. Whether or not C
decided to abort, everything will be consistent if Sk now
decides to abort. In this case Sk decides to never
send a ready(T) message to C.
The only remaining possibility is that all active sites have
a <ready T> in their logs and no <commit T> or
<abort T>. In that case it is impossible to tell
whether C has or will decide to commit or abort. The sites
have "solemnly promised" to obey C's decision. There's no
choice but to wait for C to recover. Problem: the pending
transaction may tie up resources.
- 17.3.2.3 Failure of the Network
If the link between Sk and Si fails, then Sk can take the
same actions it would take if Si had failed, and conversely
for Si.
- Section 17.4 -- Concurrency Control
- 17.4.1 Locking Protocols
Recall the following points that were made in chapter 7 (Process
Synchronization).
- When we want to insure atomicity of some transaction, we may
not have to treat the whole section as a single critical
section to be protected by a single lock or semaphore.
- It is enough to insure serializability -- to insure that
when two transactions execute concurrently the effect
on the data is the same as if one transaction was
carried out completely first and then the other.
- If we use a lock for each data item and require transactions
to follow a locking protocol, we can ensure serializability.
- The so-called two-phase locking protocol may be used.
- The transaction may obtain but not release locks during
the growing phase.
- The transaction may release locks but not obtain any new
locks during the shrinking phase.
- The two-phase locking protocol ensures conflict
serializability but does not ensure freedom from
deadlock.
- There are conflict-serializable schedules that cannot
be obtained through two-phase locking.
We can use the two-phase locking protocol in a distributed
environment. However, we have to consider how the lock manager
will function in the distributed system. Five different schemes
are presented.
- 17.4.1.1 Non-replicated Scheme
If there is no replicated data it is simple to use a lock
manager at each site. A process executing a transaction may
lock data at various sites simply by communicating with the
respective lock managers using a request-wait-grant-release
paradigm.
This approach makes handling deadlock more complicated. (See
section 17.5)
- 17.4.1.2 Single-Coordinator Approach
If there is one system-wide lock manager then administrating
locks on duplicated data is no great task. Deadlock can be
handled as on a centralized system. However the lock
manager can be a bottleneck and a single point of failure.
A compromise is to have a multiplicity of lock managers.
Each manager is responsible for locks on only some of
the data. Arrange it so that all replicas of any
particular datum are managed by the same lock
manager. That way a process has to talk to only one manager
to lock any particular piece of data.
This approach makes handling deadlock more complicated. (See
section 17.5)
- 17.4.1.3 Majority Protocol
Put a lock manager at each site, responsible for all the
data at that site (only).
A transaction makes requests of at least half the managers
of replicas of the desired data. The transaction has to
wait until a majority of the lock managers have granted the
lock request.
The scheme otherwise has the standard
request-wait-grant-release pattern.
Complex implementation
2*((N/2)+1) messages required to acquire a lock and (N/2)+1
messages to release (Formulas use "integer division.")
This approach makes handling deadlock more complicated. (See
section 17.5) Deadlock can happen even if processes are only
trying to lock one datum.
- 17.4.1.4 Biased Protocol
This scheme is like the majority protocol except that shared
and exclusive locks are handled differently.
To get a shared lock all you need is permission from the
manager of one of the replicas.
To get an exclusive lock you must get permission from the
managers of all the replicas.
Low overhead on reading but high on writing.
This approach makes handling deadlock more complicated. (See
section 17.5)
- 17.4.1.5 Primary Copy
Designate a primary copy of each datum.
To get a lock, just get permission from the manager of the
primary copy.
Simple design.
If the manager at the primary site fails then we can't lock
the data -- even if replicas are still accessible.
- 17.4.2 Timestamping
- 17.4.2.1 Generation of Unique Timestamps
The discussion in this part of the text is a little confusing.
What it boils down to is that we can use the event-ordering
scheme that was developed in section 17.1.
- 17.4.2.2 Timestamp-Ordering Scheme
This section is basically an exercise in the text which I am
skipping this term in the interest of saving time :-).
- Section 17.5 -- Deadlock Handling
- 17.5.1 Deadlock Prevention
In a distributed system:
- We can perform the resource ordering deadlock prevention
scheme by defining a global ordering of the resources. This
algorithm is simple and has low overhead. However it requires
that we get all sites and processes to agree on the resource
ordering.
- If we designate one process to be the banker, we can perform
deadlock avoidance using the banker's algorithm, but apparently
the banker would inevitably be a severe bottleneck.
This section develops new methods of deadlock prevention based on
numbering schemes similar to the resource ordering scheme. The
difference is that the new methods do not depend on "compliance" on
the part of the participating processes.
The Wait-Die Scheme: Each process gets a unique timestamp
before it starts to execute. A younger process that attempts
to wait for a resource held by an older process is rolled
back. (The young process "dies.") This implies that age decreases
monotonically along all chains in the wait-for graph of the system.
Therefore there can be no cycles in the wait-for graph.
The Wound-Wait Scheme: This algorithm is similar to wait-die,
except it works like this: if an older process attempts to
wait for a resource held by a younger process the younger
process is rolled back and the older process gets the resource. (The
young process is "wounded.") In this scheme, age increases
monotonically along all chains in the wait-for graph.
When processes are rolled back and restarted, they keep their old
timestamps. This prevents starvation.
Wound-wait preempts resources, but wait-die does not.
In the wait-die scheme, after a young process Y dies because it
tried to wait on an old process O, the operating system is likely to
quickly restart Y, whereupon Y is likely to try to wait on O again.
Thus, time and resources may be wasted by repeated roll-back and
restart.
In the wound-wait scheme, suppose a young process Y is rolled back
because an older process O wounds it. When Y is restarted it may
immediately attempt to acquire the resource O took from it. If O
still holds the resource then Y will be allowed to wait for it.
Thus, depending on other conditions, wound-wait may result in fewer
rollbacks than wait-die.
We may quibble over which scheme causes more rollbacks. However it
is a significant problem if there are any rollbacks.
Processing time is lost. It is difficult to decide algorithmically
(program) what to do with stateful resources held by the process
that is rolled-back.
- 17.5.2 Deadlock Detection
To eliminate unnecessary rollbacks and preemption of resources we
can utilize a deadlock detection algorithm.
(To keep details in this discussion simple, we assume there is just
one instance of each resource type, and we use a wait-for graph to
keep track of the system state.)
Since the processes and resources are scattered all over the
distributed system there doesn't seem to be an obvious answer to the
question of how the wait-for graph should be represented.
One way to handle the problem is to store parts of the graph on each
local system. Let G be the graph. Denote by Gs that part of G which
is stored at site S. Gs has a node for each process (local or not)
that either holds or is waiting for a resource located at S.
G is the union of all the Gs. There can be a cycle in G
even if there is no cycle in any of the Gs's.
- 17.5.2.1 Centralized Approach
In the centralized approach we choose a
deadlock-detection coordinator to accept
copies of the local graphs and construct a union Gc.
The constructed graph Gc is seldom if ever the real wait-for
graph G of the system because the system is constantly in flux
and there is communication lag involved in creating Gc.
Unfortunately, looking at examples we can see it is possible
that Gc will contain false cycles
Using our distributed global event-ordering methodology (c.f.
17.1) we can use an algorithm based on the following ideas to
avoid detecting false cycles:
- When a process Pi at site S requests a resource held by
process Pj at the same site S then the system at site S
inserts an edge [Pi-->Pj] into the local wait-for graph.
- When process Pi at site S1 requests a resource held by
process Pj at a different site S2, Pi generates a
timestamp TS and sends the request and timestamp to S2.
- The system at S1 inserts a labelled edge [TS,Pi-->Pj] into
the local wait-for graph.
- S2 inserts a copy of [TS,Pi-->Pj] into its local wait-for
graph if and only if S2 cannot immediately grant
the requested resource when the request arrives.
- When the deadlock-detection coordinator decides to check
for cycles it sends a "let's do it" message to all the
sites in the system.
- When a site gets the "let's do it" message from the
coordinator, the site sends the coordinator its copy of
the local wait-for graph.
- After the coordinator receives the expected reply from
every local wait-for graph it constructs a "union" graph
like this:
- It makes one vertex for each process found in any
of the local graphs
- Into the graph it puts all edges that have the form
Pi-->Pj, where Pi and Pj reside at the same site.
- If an edge of the form [TS,Pi-->Pj] is found in more
than one local wait-for graph, the coordinator puts
that edge in the constructed union.
The construction has the property that if the constructed
graph has a cycle then the (actual) system is deadlocked
Also if the constructed graph has no cycle then the system
was not deadlocked when the "let's do it" message was sent
out.
- 17.5.2.2 Fully Distributed Approach
Our text gives an overview of a fully distributed
deadlock-detection scheme published by Obermarck
(R. Obermarck, "Distributed Deadlock Detection Algorithm,"
ACM Transactions on Database Systems, Volume 7,
Number 2 (1982), pp. 187-208).
In a "fully distributed approach" all sites in the
distributed system participate equally in the work of
determining if a deadlock has occurred. Instead of
designating one single process to play the role of the
deadlock-detection coordinator, there is a
deadlock-detection coordinator at every site.
In this scheme the local wait-for graph contains the usual
nodes and edges corresponding to local processes waiting for
local resources.
Additionally, a special node P-ex may be in the local
wait-for graph at any site S. If a process P is waiting for
a a resource external to S that is held by a process Q, then
the local wait-for graph at S has an edge of the form
P-->P-ex. Similarly if a process P' at some site other than S
is waiting for a resource held by a process Q' at S, then
the local wait-for graph at S has an edge of the form
P-ex-->Q'.
Obviously the system is deadlocked if there is a cycle not
involving P-ex in any of the local wait-for graphs.
Obviously the system is not deadlocked if there are no
cycles at all in any of the local wait-for graphs.
If a local wait-for graph has a cycle containing P-ex, but no
cycles without P-ex, then the system may be deadlocked
but further information is required to find out.
Through the use of a numbering scheme, one particular site S
is "elected" to investigate further.
The wait-for graph at S contains a (simple) cycle in which
one of the edges is of the form P-->P-ex. P is waiting for
a resource at a site S'. S sends the information about the
cycle to S' and S' forms the union of the cycle with its
local wait-for graph. S' examines the union. If S' finds a
cycle not containing P-ex, then the system is deadlocked.
If S' only finds a cycle involving P-ex, then it ships it
off to another site and the algorithm "recurs." Eventually
a site will either find no cycle -- in which case we
conclude there is no deadlock, or find a cycle not
containing P-ex -- in which case we conclude that there
is a deadlock.
- Section 17.6 -- Election Algorithms
Many of the schemes described in chapter 17 depend on the existence of
a coordinator performing a service at one of the sites. Section 17.6
provides information about what can be done if the coordinator process
fails (e.g. its platform computer crashes.)
Specifically, section 17.6 presents election algorithms that can be
used to choose one unique site where a new coordinator will be
started.
In this section we go back to assuming that there is one process at
each site. Also we assume that each site has a unique id number, and
that the coordinator is supposed to be the "living" process with the
highest number. Thus basically the problem is for the "surviving"
processes to collectively determine which of them has the highest
number.
- 17.6.1 The Bully Algorithm
If the coordinator does not respond to a process Pi for a
sufficiently long time then Pi decides to run for the office of
the coordinator.
Pi "starts an election" by sending an "I am running " election
message to every Pj such that j>i. If there are no replies
within a timeout period then Pi assumes the role of coordinator
and sends "I am coordinator" messages to all Pk where k<i.
However if Pi gets an "I'm bigger than you" reply from some Pj
with j>i then Pi waits for some process to send it an "I am
coordinator" message.
If Pi does not get that message before a timeout expires it will
have to start another election. (All higher processes may have
failed.)
Here is a list telling how a process must handle a couple of kinds
of messages:
- If a process Pi gets an "I am coordinator" message from a
process Pj with j>i then Pi should record the information
and try to use Pj as the coordinator in the future.
- If a process Pi gets an "I am running" message from Pk where
k<i then Pi responds to Pk with an "I am bigger than you"
message. Next Pi starts an election, unless it is already
running one.
When a failed process restarts, it starts an election. It will
win and become the coordinator if it has the highest number.
- 17.6.2 Ring Algorithm
If a process Pi decides that the coordinator may be down, it
creates an initially empty active list and sends an
"elect(Pi)" message to its neighbor on the right. It then adds
itself ('i') to its active list.
If Pi gets an elect(j) message from the process on the left
- If this is the first elect message Pi has received or sent,
Pi creates a new active list, puts i and j in it and sends an
elect(i) message followed by an elect(j) message to the
process on the right;
- otherwise if i != j then Pi adds j to its active list and
passes the elect(j) message to the right;
- otherwise i == j and now Pi has the id numbers of all active
processes. Pi computes the max and in the future tries to
use that process as the coordinator.
- Section 17.7 -- Reaching Agreement
- 17.7.1 Unreliable Communications
- 17.7.2 Faulty Processes
- Section 17.8 -- Summary