(Latest Revision:
Sun Sep 19, 2005
)
Chapter Eighteen
--
Distributed Coordination
--
Lecture Notes
- Introduction
This chapter is about two problems:
- How do we synchronize concurrent processes in a distributed
system?
- How do we handle deadlock in a distributed system?
- Section 18.1 -- Event Ordering
One obvious way to settle competition over resources is to grant requests
in first come first served (FCFS) order.
You can do that if you can tell who asked first.
If all processes share the same system clock, then it's easy to get
that information.
It is more of a challenge in a distributed system.
- 18.1.1 The Happened-Before Relation
How do we determine which of two events in a distributed system
happened first? The logic of it tells us that
the following three laws are true for any three events X, Y, and
Z:
- Local "before" is global "before": If X and Y are
events in the same (single-threaded) process, and X was
executed before Y (going by the local system clock), then
X-->Y
(X happened before Y globally -- within the context of the
entire distributed system.)
- Causality: If X is the event of sending a message by
one process and Y is the event of receiving that message by
another process, then X-->Y.
- Transitivity: If X-->Y and Y-->Z
then X-->Z.
If we can establish that
E1-->E2 by using the rules above then
it is possible that event E1 affected event E2
causally.
On the other hand it is possible that two events are simply not
related by the happened-before relation. In that case
neither event could have affected the other causally. (Neither
could have executed when information about the other was
available.) Such pairs of events are said to be concurrent
events
This is illustrated by figure 18.1 on page 665. The figure depicts
three processes and some messages going between processes. We can
use rules 1-3 to conclude that, for example, p0 happened before r4.
However, there is no path in the graph from q0 to p2, or from p2 to
q0.
Suppose we want to order all the events in figure 18.1 on a single
timeline. Suppose we want all the processes to agree to act as
though that timeline indicates the actual order in which all the
events happened.
We must make the timeline "respect" the happened-before
relation, else we cannot guarantee that on the timeline events
come after the events that caused them.
On the other hand, we are free to arbitrarily choose the relative
order of two concurrent events. Whichever order we choose, it
will be consistent with a possible reality -- a way things
could have happened. There will be no paradoxical logical
consequences.
- 18.1.2 Implementation
This section is about how we can create the kind of global timeline
discussed above. (To keep the details of this discussion simple, we
assume that there is just one process running at each site -- each
processor -- in the distributed system.)
- Each process Pi starts with a counter Ci
initialized to zero.
- Each Pi increments Ci after each
significant event X and assigns the timestamp Ci to
event X.
- When a process Pm sends a message it puts the
current value Cm of its counter in the message.
- When a process Pi receives a message it checks the
counter value Cm in the message. Pi sets
the value of the counter to
C=1+max(Cm,Ci). C becomes the timestamp
of the event of receiving the message.
The overall effect is that if X and Y are any two events throughout
the global distributed system, and if X-->Y then the
timestamp of X is less than the timestamp of Y.
Note that according to this scheme we can assign a unique
timestamp to each and every event of possible interest.
Therefore, the timestamping procedure implements a global
timeline.
Given any pair W, Z of concurrent events, each will have a timestamp
that defines its position on the timeline. Thus the timestamping
procedure chooses a relative ordering for each concurrent pair. All
these choices are consistent with causality. Therefore the timeline
produced is a possible reality.
If two events have the same timestamp then we can break the tie
using the process id numbers.
All in all then, we have a way to impose a total ordering of all
events. Given any two events, we can say that one of them is
"earlier" than the other and we can determine which is which just
by looking at timestamps (and pid's if necessary to break a tie).
- Section 18.2 -- Mutual Exclusion
This section discusses three methods of implementing mutual exclusion in
a distributed system. All three algorithms achieve mutual
exclusion, satisfy bounded waiting, and are deadlock- and
starvation-free. All three violate the progress requirement.
- 18.2.1 Centralized Approach
There is a mutex coordinator. When a client wants exclusive
access it sends a request message to the coordinator. After the
coordinator replies the client may enter its critical section
(CS). After the client leaves the CS it sends a release message
to the coordinator.
The coordinator maintains a FIFO queue of clients. (A client is
at the front of the queue while it executes in its CS.) When the
coordinator receives a request from a client it puts the client
in the queue. If the queue now has one element the coordinator
sends a reply to the client. When the coordinator receives a
release message it removes the front client from the queue. If
the queue is not empty now, it send a reply to the client at the
front of the queue.
This simple method requires only three messages per access --
request, reply, release.
The disadvantage is that the coordinator can become a bottleneck.
If the coordinator fails "we're in trouble" -- however the remaining
processes can "elect" a new coordinator.
- 18.2.2 Fully Distributed Approach
- When a process Pi wants to enter its CS it generates
timestamp TSi and sends request(Pi,
TSi) to all peers.
- When Pi has received a reply message from all
peers it may enter its CS.
- If a process P receives a request(Pj,
TSj):
- If P is in its CS P queues and defers the reply.
- If P is not in its CS and not interested it replies
immediately.
- If P is not in its CS but desires to be in its CS then if
P's request timestamp TS>TSj (Pj
asked first) then P sends a reply immediately, else P
queues and defers its reply.
- When a process P leaves the CS it replies to all deferred
requests.
This algorithm requires 2*(N-1) messages per access to the CS.
For this algorithm to work all the processes have to know about
each other. There must be "introductions all around" when a new
process joins the group.
- 18.2.3 Token-Passing Approach
Pass a token around a logical ring -- the process with the token
may enter the CS.
If the token is lost the processes can hold an election to
generate a new token.
If a process fails, the remaining processes can hold an election
to form a new ring.
This algorithm is efficient when processes "almost always" want to
enter the CS -- there is one message per access if all processes
always want to enter.
- Section 18.3 -- Atomicity
- 18.3.1 The Two-Phase Commit Protocol
The 2PC is one way to assure that a transaction commits on all
involved sites in a distributed system, or aborts on all sites.
Typically each site S has a coordinator C which will be in charge of
a transaction initiated at S. C takes care of assigning other sites
to perform parts of the transaction. The participating sites
perform the required actions but defer writing their
records to their logs. (See discussion of log-based recovery in
chapter 6 (Process Synchronization). Assume all logs mentioned
below are on stable storage.). Instead of committing (or aborting)
their subset of the transaction they send a message to C informing
it that they have "completed" their part of the transaction.
Phase 1: The site coordinator C then puts a <prepare
T> record in its log. Then C sends a prepare(T)
message to all the sites where T executed.
Sites may decide to abort or commit. If abort, they write <no
T> to the log and then send abort(T) to C. If commit,
they put <ready T> in the log and send a ready(T)
message to C. This is a "solemn promise" to obey C when it gives
the order to commit or abort. (A marriage is similar to a 2PC.
The parties say "I do" instead of "ready.")
Phase 2: If C gets ready(T) messages from all the
sites within a certain time-out period it writes <commit T>
to log and sends a commit(T) message to all sites.
Otherwise C writes <abort T> to log and sends an
abort(T) message to all sites.
The receiving sites then obediently write the commit or abort
record to their logs.
- 18.3.2 Failure Handling in 2PC
- 18.3.2.1 Failure of a Participating Site
If upon recovery a participating site Sk finds a
<commit T> record in the log then Sk executes
redo(T). (The reason: all "intentions" of the transaction will
be in the log but possibly some changes to the target data did
not get flushed to stable storage before the failure.)
Sk performs undo(T) if there is an <abort T>
record in the log.
If Sk contains a <ready T> record but no
<commit T> or <abort T> then Sk promised
to obey the directive of the coordinator C. We need to find
out what C said to do. If C can tell us, then this is handled
as one of the two cases above (same as finding a commit or
abort record.)
If C cannot answer then Sk may poll all other sites
to see if any of them committed or aborted T. If it finds one
that committed it executes redo(T). If it finds one that
aborted, it executes undo(T).
If it cannot get the information immediately Sk must
ask the other sites from time to time until one answers. (At
least C should be able to answer eventually.)
If there is no <abort T>, <commit T> or <ready
T> record then Sk did not send a ready(T)
to C. In this case Sk makes sure never to
send a ready(T) to C. Sk performs an undo(T).
(C will eventually tell the all sites to abort -- it may have
done so already.)
- 18.3.2.2 Failure of the Coordinator
If one of the participating sites contains a <commit
T> in its log then T must be committed everywhere. (C
decided to commit.)
If one of the participating sites contains a <abort T>
in its log then T must be aborted everywhere. (C decided to
abort.)
If a participating site Sk does not contain a
<ready T> in its log then Sk can't have sent a
ready(T) message to C, so C can't have decided to
commit. C may have decided to abort. Whether or not C decided
to abort, everything will be consistent if Sk now
decides to abort. In this case Sk decides to
never send a ready(T) message to C.
The only remaining possibility is that all active sites have
a <ready T> in their logs and no <commit T> or
<abort T>. In that case it is impossible to tell
whether C has or will decide to commit or abort. The sites
have "solemnly promised" to obey C's decision. There's no
choice but to wait for C to recover. Problem: the pending
transaction may tie up resources.
- 18.3.2.3 Failure of the Network
If the link between Sk and Si fails, then
Sk can take the same actions it would take if
Si had failed, and conversely for Si.
- Section 18.4 -- Concurrency Control
- 18.4.1 Locking Protocols
Recall the following points that were made in chapter 6 (Process
Synchronization).
- When we want to insure atomicity of some transaction, we may
not have to treat the whole section as a single critical
section to be protected by a single lock or semaphore.
- It is enough to insure serializability -- to insure that
when two transactions execute concurrently the effect
on the data is the same as if one transaction was
carried out completely first and then the other.
- If we use a lock for each data item and require transactions
to follow a locking protocol, we can ensure serializability.
- The so-called two-phase locking protocol may be used.
- The transaction may obtain but not release locks during
the growing phase.
- The transaction may release locks but not obtain any new
locks during the shrinking phase.
- The two-phase locking protocol ensures conflict
serializability but does not ensure freedom from
deadlock.
- There are conflict-serializable schedules that cannot
be obtained through two-phase locking.
We can use the two-phase locking protocol in a distributed
environment. However,
we have to consider how the lock manager will function in the
distributed system. Five different schemes are presented.
- 18.4.1.1 Non-replicated Scheme
If there is no replicated data it is simple to use a lock
manager at each site. A process executing a transaction may
lock data at various sites simply by communicating with the
respective lock managers using a request-wait-grant-release
paradigm.
This approach makes handling deadlock more complicated. (See
section 18.5)
- 18.4.1.2 Single-Coordinator Approach
If there is one system-wide lock manager then administrating
locks on duplicated data is no great task. Deadlock can be
handled as on a centralized system. However the lock
manager can be a bottleneck and a single point of failure.
A compromise is to have a multiplicity of lock managers.
Each manager is responsible for locks on only some of
the data. Arrange it so that all replicas of any
particular datum are managed by the same lock
manager. That way a process has to talk to only one manager
to lock any particular piece of data.
This approach makes handling deadlock more complicated. (See
section 18.5)
- 18.4.1.3 Majority Protocol
Put a lock manager at each site, responsible for all the
data at that site (only).
A transaction makes requests of at least half the managers
of replicas of the desired data. The transaction has to
wait until a majority of the lock managers have granted the
lock request.
The scheme otherwise has the standard
request-wait-grant-release pattern.
Complex implementation
2*((N/2)+1) messages required to acquire a lock and (N/2)+1
messages to release (Formulas use "integer division.")
This approach makes handling deadlock more complicated. (See
section 18.5) Deadlock can happen even if processes are only
trying to lock one datum.
- 18.4.1.4 Biased Protocol
This scheme is like the majority protocol except that shared
and exclusive locks are handled differently.
To get a shared lock all you need is permission from the
manager of one of the replicas.
To get an exclusive lock you must get permission from the
managers of all the replicas.
Low overhead on reading but high on writing.
This approach makes handling deadlock more complicated. (See
section 18.5)
- 18.4.1.5 Primary Copy
Designate a primary copy of each datum.
To get a lock, just get permission from the manager of the
primary copy.
Simple design.
If the manager at the primary site fails then we can't lock
the data -- even if replicas are still accessible.
- 18.4.2 Timestamping
- 18.4.2.1 Generation of Unique Timestamps
The discussion in this part of the text is a little confusing.
What it boils down to is that we can use the event-ordering
scheme that was developed in section 18.1.
- 18.4.2.2 Timestamp-Ordering Scheme
This section is basically an exercise in the text which I am
skipping this term in the interest of saving time :-).
- Section 18.5 -- Deadlock Handling
- 18.5.1 Deadlock Prevention
In a distributed system:
- We can perform the resource ordering deadlock prevention
scheme by defining a global ordering of the resources. This
algorithm is simple and has low overhead. However it requires
that we get all sites and processes to agree on the resource
ordering.
- If we designate one process to be the banker, we can perform
deadlock avoidance using the banker's algorithm, but apparently
the banker would inevitably be a severe bottleneck.
This section develops new methods of deadlock prevention based on
numbering schemes similar to the resource ordering scheme.
The difference is that the new methods do not depend on "compliance"
on the part of the participating processes.
The Wait-Die Scheme: Each process gets a unique timestamp
before it starts to execute. A younger process that attempts
to wait for a resource held by an older process is rolled
back. (The young process "dies.") This implies that age
decreases monotonically as we go forward along any chain in
the wait-for graph of the system. Therefore there can be no cycles
in the wait-for graph.
The Wound-Wait Scheme: This algorithm is similar to wait-die,
except it works like this: if an older process attempts to
wait for a resource held by a younger process the younger
process is rolled back and the older process gets the resource. (The
young process is "wounded.") In this scheme, age increases
monotonically as we go forward along any chain in the wait-for
graph.
When processes are rolled back and restarted, they keep their old
timestamps. This prevents starvation.
Wound-wait preempts resources, but wait-die does not.
In the wait-die scheme, after a young process Y dies because it
tried to wait on an old process O, the operating system is likely to
quickly restart Y, whereupon Y is likely to try to wait on O again.
Thus, time and resources may be wasted by repeated roll-back and
restart.
In the wound-wait scheme, suppose a young process Y is rolled back
because an older process O wounds it. When Y is restarted it may
immediately attempt to acquire the resource O took from it. If O
still holds the resource then Y will be allowed to wait for it.
Thus, depending on other conditions, wound-wait may result in fewer
rollbacks than wait-die.
We may quibble over which scheme causes more rollbacks. However it
is a significant problem if there are any rollbacks.
Processing time is lost. It is difficult to decide algorithmically
(program) what to do with stateful resources held by the process
that is rolled-back.
- 18.5.2 Deadlock Detection
To eliminate unnecessary rollbacks and preemption of resources we
can utilize a deadlock detection algorithm.
(To keep details in this discussion simple, we assume there is just
one instance of each resource type, and we use a wait-for graph to
keep track of the system state.)
Since the processes and resources are scattered all over the
distributed system there doesn't seem to be an obvious answer to the
question of how the wait-for graph should be represented.
One way to handle the problem is to store parts of the graph on each
local system. Let G be the graph. Denote by Gs
that part of G which is stored at site S. Gs has a node
for each process (local or not) that either holds or is waiting for
a resource located at S.
G is the union of all the Gs. There can be a
cycle in G even if there is no cycle in any of the Gs's.
- 18.5.2.1 Centralized Approach
In the centralized approach we choose a
deadlock-detection coordinator to accept copies
of the local graphs and construct a union Gc.
The constructed graph Gc is seldom if ever the real
wait-for graph G of the system because the system is constantly
in flux and there is communication lag involved in creating
Gc.
Unfortunately, looking at examples we can see it is possible
that Gc will contain false cycles
Using our distributed global event-ordering methodology (c.f.
18.1)
we can use an algorithm based on the following ideas to
avoid detecting false cycles:
- When a process Pi requests a resource at site S
held by process Pj at the same site S then the
system at site S inserts an edge
[Pi-->Pj] into the local wait-for
graph.
- When process Pi, at site S1,
requests a resource held by process Pj at a
different site S2, Pi generates a
timestamp TS and sends the request and timestamp to
S2.
- The system at S1 inserts a labelled edge [TS,
Pi-->Pj] into the local wait-for
graph.
- S2 inserts a copy of [TS,
Pi-->Pj] into its local wait-for
graph if and only if S2 cannot
immediately grant the requested resource when the request
arrives.
- When the deadlock-detection coordinator decides to check
for cycles it sends a "let's do it" message to all the
sites in the system.
- When a site gets the "let's do it" message from the
coordinator, the site sends the coordinator its copy of
the local wait-for graph.
- After the coordinator receives the expected reply from
every local wait-for graph it constructs a "union" graph
like this:
- It makes one vertex for each process found in any
of the local graphs
- Into the graph it puts all edges that have the form
Pi-->Pj, where Pi
and Pj reside at the same site.
- If an edge of the form [TS,
Pi-->Pj] is found in more than
one local wait-for graph, the coordinator puts that
edge in the constructed union.
The construction has the property that if the constructed
graph has a cycle then the (actual) system is deadlocked
Also if the constructed graph has no cycle then the system
was not deadlocked when the "let's do it" message was sent
out.
- 18.5.2.2 Fully Distributed Approach
Our text gives an overview of a fully distributed
deadlock-detection scheme published by Obermarck
(R. Obermarck, "Distributed Deadlock Detection Algorithm,"
ACM Transactions on Database Systems, Volume 7, Number 2
(1982), pp. 187-208).
In a "fully distributed approach" all sites in the
distributed system participate equally in the work of
determining if a deadlock has occurred. Instead of
designating one single process to play the role of the
deadlock-detection coordinator, there is a
deadlock-detection coordinator at every site.
In this scheme the local wait-for graph contains the usual
nodes and edges corresponding to local processes waiting for
local resources.
Additionally, a special node Pex may be in the local
wait-for graph at any site S. If a process P at site S is
waiting for a resource external to S that is held by a process
Q, then the local wait-for graph at S has an edge of the form
P-->Pex. Similarly if a process P' at some site
other than S is waiting for a resource held by a process Q' at
S, then the local wait-for graph at S has an edge of the form
Pex-->Q'.
Obviously the system is deadlocked if there is a cycle not
involving Pex in any of the local wait-for graphs.
Obviously the system is not deadlocked if there are no
cycles at all in any of the local wait-for graphs.
If a local wait-for graph has a cycle containing
Pex, but no cycles without Pex, then the
system may be deadlocked but further information is
required to find out.
Through the use of a numbering scheme, one particular site S
is "elected" to investigate further.
The wait-for graph at S contains a (simple) cycle in which one
of the edges is of the form P-->Pex. P is waiting
for a resource at a site S'. S sends the information about the
cycle to S' and S' forms the union of the cycle with its local
wait-for graph. S' examines the union. If S' finds a cycle
not containing Pex, then the system is deadlocked.
If S' only finds a cycle involving Pex, then it
ships it off to another site and the algorithm "recurs."
Eventually a site will either find no cycle -- in which case we
conclude there is no deadlock, or find a cycle not containing
Pex -- in which case we conclude that there
is a deadlock.
- Section 18.6 -- Election Algorithms
Many of the schemes described in chapter 18 depend on the existence of
a coordinator performing a service at one of the sites. Section 18.6
provides information about what can be done if the coordinator process
fails (e.g. its platform computer crashes.)
Specifically, section 18.6 presents election algorithms that can be used
to choose a unique site where a new coordinator will be started.
In this section we go back to assuming that there is one process at each
site. Also we assume that each site has a unique id number, and that the
coordinator is supposed to be the "living" process with the highest
number. Thus basically the problem is for the "surviving" processes to
collectively determine which of them has the highest number.
- 18.6.1 The Bully Algorithm
If the coordinator does not respond to a process Pi for a
sufficiently long time then Pi decides to run for the
office of the coordinator.
Pi "starts an election" by sending an "I am running"
election message to every Pj such that j>i. If there
are no replies within a timeout period then Pi assumes
the role of coordinator and sends "I am coordinator" messages to all
Pk where k<i.
However if Pi gets an "I'm bigger than you" reply from
some Pj with j>i then Pi waits for some
process to send it an "I am coordinator" message.
If Pi does not get that message before a timeout expires it will
have to start another election. (All higher processes may have
failed.)
Here is a list telling how a process must handle a couple of kinds
of messages:
- If a process Pi gets an "I am coordinator" message
from a process Pj with j>i then Pi
should record the information and try to use Pj as
the coordinator in the future.
- If a process Pi gets an "I am running" message from
Pk where k<i then Pi responds to Pk with an "I am
bigger than you" message. Next Pi starts an
election, unless it is already running one.
When a failed process restarts, it starts an election. It will
win and become the coordinator if it has the highest number.
- 18.6.2 Ring Algorithm
If a process Pi decides that the coordinator may be down,
it creates an initially empty active list and sends an
"elect(Pi)" message to its neighbor on the right. It then
adds itself ('i') to its active list.
If Pi gets an elect(j) message from the process on the left
- If this is the first elect message Pi has received or
sent, Pi creates a new active list, puts i and j in
it and sends an elect(i) message followed by an elect(j) message
to the process on the right;
- otherwise if i != j then Pi adds j to its active list
and passes the elect(j) message to the right;
- otherwise i == j and now Pi has the id numbers of all
active processes. Pi computes the max and in the
future tries to use that process as the coordinator.
- Section 18.7 -- Reaching Agreement
- 18.7.1 Unreliable Communications
- 18.7.2 Faulty Processes
- Section 18.8 -- Summary