(Latest Revision:
Sun Sep 19, 2005
)
Chapter Seventeen
--
Distributed File Systems
--
Lecture Notes
- Introduction
- In a distributed file system (DFS) files that reside in
locations scattered all over a network are organized logically so
they appear to comprise an ordinary filesystem like those found
on personal computers and workstations.
- This chapter is about DFS design and implementation issues.
- Section 17.1 -- Background
- In the words of the text:
- "A DFS is a file system whose clients, servers, and storage
devices are dispersed among the machines of a distributed
system," and
- "The DFS is characterized by multiplicity and autonomy of
file servers and clients."
- A
List of Distributed File Systems
- Typically there exist multiple independent storage devices.
- Transparency is desired: From the client point of view there
should be no difference between interfaces to local or remote
files.
- The speed of all distributed file service should be comparable to
the speed of ordinary local file service. This is quite a
challenge. Networking can introduce many forms of delay.
- Definition: The component unit of a DFS is the
smallest set of files that can be stored on a single machine,
independently from other units. The exact nature of a component
unit varies among different distributed file systems.
- Section 17.2 -- Naming and Transparency
- 17.2.1 Naming Structures
- Definition: Location transparency: The name of
a file does not reveal any hint of the file's physical
storage location.
- Definition: Location independence: The name of
a file does not need to be changed when the file's physical
storage location changes.
- Location independence implies location transparency, but not
vice-versa.
- Most current DFS's have location transparency but not
location independence. Therefore files cannot automatically
migrate from one physical location to another in most DFS's.
- The Andrew File System (AFS) supports location independence
and file mobility.
- Location independence supports the principle that the name
of a file should denote its contents, not it's location.
- Location independence makes a system more free to locate
files in a manner that utilizes available storage
efficiently and optimizes the speed of file service. For
example, the system is not obliged to store files in
accordance with the layout of the filesystem hierarchy (tree
of path names).
- 17.2.2 Naming Schemes
- The Ibis system allowed users to refer to remote files using
names with the form "host:local-path-name." The name
"alcyone:/etc/resolv.conf" would refer to the
/etc/resolv.conf file on host alcyone, for example.
- NFS allows a local host to attach remote directories to
arbitrary mount points in the local directory tree.
- NFS itself contains no mechanism for making the overall
"picture" of the file system coherent. The system
administrators are responsible for setting up the "logic" of
which directories are exported and which are imported, by
which clients, and where the mount points shall be.
- A "third approach" would be to make all files in the system
visible to all hosts as the same unified directory tree.
There are problems with doing this in a heterogeneous
distributed system. A host system is designed with certain
assumptions as to what files exist in the directory system,
and what the path names are. (For example a host may expect
the OS kernel to be /vmunix, may expect the local disk
device to be /dev/disk0s10, and may expect the ls command to
be in /bin/ls.)
- 17.2.3 Implementation Techniques
- Map component units to locations, but do not map files to
location at a finer granularity.
- Do the mapping in two levels.
- At level one translate textual filenames into location
independent (numerical) file identifiers that indicate
to which component unit the file belongs.
- At level two map component units to physical storage
locations
- The level-one mappings may be replicated and cached widely
and freely for performance reasons. Cache consistency
problems will be minimal because the level-one mapping
information is location independent -- it does not need to
change when the physical location of a file or component
unit changes.
- The second level "needs a simple yet consistent update
mechanism."
- It is typical to implement the low-level
location-independent file identifiers as bit strings in
which some prefix denotes the component unit and the rest of
the bits denote the particular file within the unit. In
other words, they have the form:
component-number:file-number.
- Section 17.3 -- Remote File Access
- In a DFS, file blocks must be transferred between local and
remote hosts. We can implement the required data transfer
operations by using a client-server RPC scheme.
- We can do caching to cut down on the number of time-consuming
I/O and network accesses.
- 17.3.1 Basic Caching Scheme
- The basic scheme works much like virtual memory.
- A large unit of caching raises the hit ratio but commits the
system to transferring data in larger units.
- Implementors should consider the maximum amount of data the
network can transfer in one packet because larger sizes will
require assembly/dissasembly overhead at some level of the
network protocol software.
- 17.3.2 Cache Location
- A file cache on disk has the advantage of being
non-volatile.
- To support diskless clients there should be the ability to
file-cache in primary memory on the client side.
- The use of a primary memory cache helps cut down greatly on
time-consuming disk I/O.
- It is no longer very expensive to give computers lots of
main memory for caches.
- NFS and Sprite employ some caching in primary memory.
- NFS employs both server- and client-side primary memory
caching. Recent Solaris implementations have an option to
add client-side disk caching (cachefs).
- 17.3.3 Cache-Update Policy
- Obviously writes can't just be done to the cache.
Eventually writes have to be copied to all permanent copies
of the file.
- The simplest policy is write-through: it is reliable
but it performs sluggishly.
- Delayed-write with update "once in a while" is an
alternative. Primary memory caches perform better with
delayed-write but data can be lost in a crash.
- NFS practices write-through on metadata (directory and
file-attribute data) and delayed-write on ordinary file
blocks.
- Write-on-close is another possibility -- AFS uses it.
- 17.3.4 Consistency
- When a user of a file does a write, it can invalidate the
contents of cached copies maintained by other users of the
file.
- One approach to maintaining consistency is for each client
to check with the server to ask if the cache is still valid
(client-initiated approach) . For this to be completely
reliable there would have to be a write-through cache-update
policy and clients would have to check with the server
before each access. This would negate much of the benefit
of caching.
- Another approach is for the server to notify the clients that
caching is disabled whenever a file is opened by two or more
clients in conflicting modes.
- 17.3.5 A Comparison of Caching and Remote Services
- Definition: Remote service refers to remote
file service performed simply by turning every local file
operation into a remote procedure call. There is no caching
involved in remote service.
- Caching has the obvious potential to cut down on
time-consuming disk I/O and transfer of data across the
network.
- If we use caching instead of remote service it may result in
"chunking" the data in a manner that conserves bandwidth on
the network and the server's disks.
- Caching schemes tend to be efficient when writes are
infrequent and inefficient when writes are frequent.
- It does not make good sense to perform caching on a diskless
machine with a small memory.
- If caching is done then extra software has to be added to
the system to implement the data "chunking" of the caching
scheme. This makes caching systems more complex to program
than systems which rely only on remote service for
distributed file service.
- Section 17.4 -- Stateful Versus Stateless Service
- With stateful file service the client must open the
file. In response to the open operation the server caches state
information about the file session in memory and returns a file
handle to the client. The client uses the handle each time it
makes a subsequent request for a read, write, or other operation
on the file. The cached information on the server side stores
certain information such as a pointer to the current position in
the file. The client does not have to send much information in
its requests. For example the client can ask the server to send
the "next" block instead of saying: "Now give me block 258921 of
the file named /var/mail/john"
- AFS is stateful.
- With stateless file service every request is
self-contained. The client specifies the filename and offset
desired with every request for a read or write. The server need
not do anything in response to open or close operations.
- NFS is stateless.
- The advantage of stateful service is that it makes for greater
efficiency -- e.g. cached index information. However after a
server crash how do we restore state? Also what is done about
"missing" close operations when clients crash with files open?
- Stateless service has few thorny implementation problems. If a
server crashes and fails to answer a client the client will just
ask again when the server has recovered.
- Stateless service is incompatible with server-initiated cache
validation.
- Section 17.5 -- File Replication
- In a DFS files must be replicated in some fashion, or else how
shall there be fault tolerance?
- Striping and RAID solutions have their place in the hierarchy of
solutions. However methods such as striping and RAID cannot
practically meet all needs for replication of files in a DFS.
- There are reasons for locating separate physical copies of files in
different locations around the network:
- Generally access times for clients will be shorter if clients
can access files from servers that are not very far away.
- If links go down and the network is partitioned, then it is
useful if copies of files exist in all the partitions, so that
file service can continue.
- Even if file service is completely centralized, clients will
cache files or parts of files for performance reasons.
- For fault tolerance, different replicas should be kept on different
machines with independent failure modes.
- The system should not require users who have no need of such
knowledge to be aware of the multiplicity of replicas.
- Replicas have to be kept consistent. This problem is identical in
essence to the cache consistency problem.
- Locus sacrifices consistency for availability if there is a
partition of the network.
- Section 17.6 -- An Example: AFS
- The Andrew File System (AFS) began at (Andrew) Carnegie and
(Andrew) Mellon University in Pittsburg (See
- Implementations of AFS are available free for many platforms.
- Some features of AFS are:
- uniform name space
- location-independent file sharing
- client-side caching with cache consistency
- secure authentication with Kerberos
- server-side caching (with replicas)
- automatic switchover to replicas
- high scalability
- 17.6.1 Overview
- There are client machines and server machines
- Clients see a small local name space and a large
shared name space
- A group of servers called Vice makes the shared name
space available to the clients.
- All clients see the same shared name space in the same
location within their local name space.
- The client machines (workstations) run software called
Virtue to communicate with Vice.
- The WAN is organized into clusters. Each cluster contains
some workstations and a cluster server (one of the Vice file
servers). A client workstation is supposed to get most of its
file service from its local cluster server.
- The server CPU's are the system bottleneck. To help provide
relief files are cached in 64 KB chunks.
- No client software is allowed to execute on servers. Servers
are considered secure. Servers and clients authenticate
each other's identities and their communications are
encrypted.
- AFS has access lists (ACL's) for protecting access to
directories, plus the standard unix style protection bits on
files. The ACL's provide a finer granularity of protection --
allow/deny lists for read, write, lookup, insert, administer,
lock, and delete.
- 17.6.2 The Shared Name Space
- AFS component units are called volumes. They are
smaller than the average DFS component unit, but may comprise
several partitions. Typically a volume corresponds to the
files of one client (workstation).
- Each directory entry maps to a low-level location independent
identifier called a fid. -- three 32 bit components in
a fid -- volume number -- vnode number -- uniquifier.
- Vnode numbers map via a table lookup to inode numbers.
- There is a volume-location database replicated on each
server that keeps track of the physical location of each
volume (component unit)
- For balancing disk utilization there is an automatic and
atomic volume migration operation. The implementation
involves a temporary forwarding mechanism at the source server
which allows the source server to handle updates while the
volume is in transit.
- Replication of read-only volumes is supported for some
volumes.
- 17.6.3 File Operations and Consistency Semantics
- Clients cache whole files. Sometimes a client has to interact
with Vice to open or close a plain file, but no other
operations on plain files require interaction with Vice.
- A special process called Venus takes care of managing the file
cache on a client.
- If a client modifies a file, Venus will send the file to the
server after the client closes it.
- If Venus has a cached copy of a file it will open the copy for
the client without contacting a server unless the server has
removed the callback on the file.
- A server will remove the callback when another client has
requested that the file be modified.
- The semantics of AFS are basically session semantics.
The changes a user makes to a file are not seen by other users
of the file until the file is closed. (However changes to such
things as directory protection ACL's are visible everywhere
immediately.)
- Venus also caches directory contents and symbolic links to
facilitate pathname translation.
- Modifications to directories are done directly on the server.
- 17.6.4 Implementation
- Venus can ask a server for the location of a volume. It will
cache such information.
- On at least some implementations both servers and clients use
a unix file system for low-level storage.
- Venus and servers access files directly by i-node number as an
optimization because the unix function that does pathname
translation (namei) has considerable overhead.
- The servers use lightweight processes to service multiple
client requests concurrently. There is a pool of persistent
service threads.
- Section 17.7 -- Summary