7th ed. chapter 17

(Latest Revision: Sun Sep 19, 2005 )

Chapter Seventeen -- Distributed File Systems -- Lecture Notes

Introduction
- In a distributed file system (DFS) files that reside in locations scattered all over a network are organized logically so they appear to comprise an ordinary filesystem like those found on personal computers and workstations.
- This chapter is about DFS design and implementation issues.
Section 17.1 -- Background
- In the words of the text:
  - "A DFS is a file system whose clients, servers, and storage devices are dispersed among the machines of a distributed system," and
  - "The DFS is characterized by multiplicity and autonomy of file servers and clients."
  - A List of Distributed File Systems
- Typically there exist multiple independent storage devices.
- Transparency is desired: From the client point of view there should be no difference between interfaces to local or remote files.
- The speed of all distributed file service should be comparable to the speed of ordinary local file service. This is quite a challenge. Networking can introduce many forms of delay.
- Definition: The component unit of a DFS is the smallest set of files that can be stored on a single machine, independently from other units. The exact nature of a component unit varies among different distributed file systems.
Section 17.2 -- Naming and Transparency
- 17.2.1 Naming Structures
  - Definition: Location transparency: The name of a file does not reveal any hint of the file's physical storage location.
  - Definition: Location independence: The name of a file does not need to be changed when the file's physical storage location changes.
  - Location independence implies location transparency, but not vice-versa.
  - Most current DFS's have location transparency but not location independence. Therefore files cannot automatically migrate from one physical location to another in most DFS's.
  - The Andrew File System (AFS) supports location independence and file mobility.
  - Location independence supports the principle that the name of a file should denote its contents, not it's location.
  - Location independence makes a system more free to locate files in a manner that utilizes available storage efficiently and optimizes the speed of file service. For example, the system is not obliged to store files in accordance with the layout of the filesystem hierarchy (tree of path names).
- 17.2.2 Naming Schemes
  - The Ibis system allowed users to refer to remote files using names with the form "host:local-path-name." The name "alcyone:/etc/resolv.conf" would refer to the /etc/resolv.conf file on host alcyone, for example.
  - NFS allows a local host to attach remote directories to arbitrary mount points in the local directory tree.
  - NFS itself contains no mechanism for making the overall "picture" of the file system coherent. The system administrators are responsible for setting up the "logic" of which directories are exported and which are imported, by which clients, and where the mount points shall be.
  - A "third approach" would be to make all files in the system visible to all hosts as the same unified directory tree. There are problems with doing this in a heterogeneous distributed system. A host system is designed with certain assumptions as to what files exist in the directory system, and what the path names are. (For example a host may expect the OS kernel to be /vmunix, may expect the local disk device to be /dev/disk0s10, and may expect the ls command to be in /bin/ls.)
- 17.2.3 Implementation Techniques
  - Map component units to locations, but do not map files to location at a finer granularity.
  - Do the mapping in two levels.
    - At level one translate textual filenames into location independent (numerical) file identifiers that indicate to which component unit the file belongs.
    - At level two map component units to physical storage locations
  - The level-one mappings may be replicated and cached widely and freely for performance reasons. Cache consistency problems will be minimal because the level-one mapping information is location independent -- it does not need to change when the physical location of a file or component unit changes.
  - The second level "needs a simple yet consistent update mechanism."
  - It is typical to implement the low-level location-independent file identifiers as bit strings in which some prefix denotes the component unit and the rest of the bits denote the particular file within the unit. In other words, they have the form: component-number:file-number.
Section 17.3 -- Remote File Access
- In a DFS, file blocks must be transferred between local and remote hosts. We can implement the required data transfer operations by using a client-server RPC scheme.
- We can do caching to cut down on the number of time-consuming I/O and network accesses.
- 17.3.1 Basic Caching Scheme
  - The basic scheme works much like virtual memory.
  - A large unit of caching raises the hit ratio but commits the system to transferring data in larger units.
  - Implementors should consider the maximum amount of data the network can transfer in one packet because larger sizes will require assembly/dissasembly overhead at some level of the network protocol software.
- 17.3.2 Cache Location
  - A file cache on disk has the advantage of being non-volatile.
  - To support diskless clients there should be the ability to file-cache in primary memory on the client side.
  - The use of a primary memory cache helps cut down greatly on time-consuming disk I/O.
  - It is no longer very expensive to give computers lots of main memory for caches.
  - NFS and Sprite employ some caching in primary memory.
  - NFS employs both server- and client-side primary memory caching. Recent Solaris implementations have an option to add client-side disk caching (cachefs).
- 17.3.3 Cache-Update Policy
  - Obviously writes can't just be done to the cache. Eventually writes have to be copied to all permanent copies of the file.
  - The simplest policy is write-through: it is reliable but it performs sluggishly.
  - Delayed-write with update "once in a while" is an alternative. Primary memory caches perform better with delayed-write but data can be lost in a crash.
  - NFS practices write-through on metadata (directory and file-attribute data) and delayed-write on ordinary file blocks.
  - Write-on-close is another possibility -- AFS uses it.
- 17.3.4 Consistency
  - When a user of a file does a write, it can invalidate the contents of cached copies maintained by other users of the file.
  - One approach to maintaining consistency is for each client to check with the server to ask if the cache is still valid (client-initiated approach) . For this to be completely reliable there would have to be a write-through cache-update policy and clients would have to check with the server before each access. This would negate much of the benefit of caching.
  - Another approach is for the server to notify the clients that caching is disabled whenever a file is opened by two or more clients in conflicting modes.
- 17.3.5 A Comparison of Caching and Remote Services
  - Definition: Remote service refers to remote file service performed simply by turning every local file operation into a remote procedure call. There is no caching involved in remote service.
  - Caching has the obvious potential to cut down on time-consuming disk I/O and transfer of data across the network.
  - If we use caching instead of remote service it may result in "chunking" the data in a manner that conserves bandwidth on the network and the server's disks.
  - Caching schemes tend to be efficient when writes are infrequent and inefficient when writes are frequent.
  - It does not make good sense to perform caching on a diskless machine with a small memory.
  - If caching is done then extra software has to be added to the system to implement the data "chunking" of the caching scheme. This makes caching systems more complex to program than systems which rely only on remote service for distributed file service.
Section 17.4 -- Stateful Versus Stateless Service
- With stateful file service the client must open the file. In response to the open operation the server caches state information about the file session in memory and returns a file handle to the client. The client uses the handle each time it makes a subsequent request for a read, write, or other operation on the file. The cached information on the server side stores certain information such as a pointer to the current position in the file. The client does not have to send much information in its requests. For example the client can ask the server to send the "next" block instead of saying: "Now give me block 258921 of the file named /var/mail/john"
- AFS is stateful.
- With stateless file service every request is self-contained. The client specifies the filename and offset desired with every request for a read or write. The server need not do anything in response to open or close operations.
- NFS is stateless.
- The advantage of stateful service is that it makes for greater efficiency -- e.g. cached index information. However after a server crash how do we restore state? Also what is done about "missing" close operations when clients crash with files open?
- Stateless service has few thorny implementation problems. If a server crashes and fails to answer a client the client will just ask again when the server has recovered.
- Stateless service is incompatible with server-initiated cache validation.
Section 17.5 -- File Replication
- In a DFS files must be replicated in some fashion, or else how shall there be fault tolerance?
- Striping and RAID solutions have their place in the hierarchy of solutions. However methods such as striping and RAID cannot practically meet all needs for replication of files in a DFS.
- There are reasons for locating separate physical copies of files in different locations around the network:
  1. Generally access times for clients will be shorter if clients can access files from servers that are not very far away.
  2. If links go down and the network is partitioned, then it is useful if copies of files exist in all the partitions, so that file service can continue.
  3. Even if file service is completely centralized, clients will cache files or parts of files for performance reasons.
- For fault tolerance, different replicas should be kept on different machines with independent failure modes.
- The system should not require users who have no need of such knowledge to be aware of the multiplicity of replicas.
- Replicas have to be kept consistent. This problem is identical in essence to the cache consistency problem.
- Locus sacrifices consistency for availability if there is a partition of the network.
Section 17.6 -- An Example: AFS
- The Andrew File System (AFS) began at (Andrew) Carnegie and (Andrew) Mellon University in Pittsburg (See
  - this Archive of Papers about AFS -- a CMU web page ) or
  - the CMU Web page
- Implementations of AFS are available free for many platforms.
  - Information about OpenAFS
- Some features of AFS are:
  - uniform name space
  - location-independent file sharing
  - client-side caching with cache consistency
  - secure authentication with Kerberos
  - server-side caching (with replicas)
  - automatic switchover to replicas
  - high scalability
- 17.6.1 Overview
  - There are client machines and server machines
  - Clients see a small local name space and a large shared name space
  - A group of servers called Vice makes the shared name space available to the clients.
  - All clients see the same shared name space in the same location within their local name space.
  - The client machines (workstations) run software called Virtue to communicate with Vice.
  - The WAN is organized into clusters. Each cluster contains some workstations and a cluster server (one of the Vice file servers). A client workstation is supposed to get most of its file service from its local cluster server.
  - The server CPU's are the system bottleneck. To help provide relief files are cached in 64 KB chunks.
  - No client software is allowed to execute on servers. Servers are considered secure. Servers and clients authenticate each other's identities and their communications are encrypted.
  - AFS has access lists (ACL's) for protecting access to directories, plus the standard unix style protection bits on files. The ACL's provide a finer granularity of protection -- allow/deny lists for read, write, lookup, insert, administer, lock, and delete.
- 17.6.2 The Shared Name Space
  - AFS component units are called volumes. They are smaller than the average DFS component unit, but may comprise several partitions. Typically a volume corresponds to the files of one client (workstation).
  - Each directory entry maps to a low-level location independent identifier called a fid. -- three 32 bit components in a fid -- volume number -- vnode number -- uniquifier.
  - Vnode numbers map via a table lookup to inode numbers.
  - There is a volume-location database replicated on each server that keeps track of the physical location of each volume (component unit)
  - For balancing disk utilization there is an automatic and atomic volume migration operation. The implementation involves a temporary forwarding mechanism at the source server which allows the source server to handle updates while the volume is in transit.
  - Replication of read-only volumes is supported for some volumes.
- 17.6.3 File Operations and Consistency Semantics
  - Clients cache whole files. Sometimes a client has to interact with Vice to open or close a plain file, but no other operations on plain files require interaction with Vice.
  - A special process called Venus takes care of managing the file cache on a client.
  - If a client modifies a file, Venus will send the file to the server after the client closes it.
  - If Venus has a cached copy of a file it will open the copy for the client without contacting a server unless the server has removed the callback on the file.
  - A server will remove the callback when another client has requested that the file be modified.
  - The semantics of AFS are basically session semantics. The changes a user makes to a file are not seen by other users of the file until the file is closed. (However changes to such things as directory protection ACL's are visible everywhere immediately.)
  - Venus also caches directory contents and symbolic links to facilitate pathname translation.
  - Modifications to directories are done directly on the server.
- 17.6.4 Implementation
  - Venus can ask a server for the location of a volume. It will cache such information.
  - On at least some implementations both servers and clients use a unix file system for low-level storage.
  - Venus and servers access files directly by i-node number as an optimization because the unix function that does pathname translation (namei) has considerable overhead.
  - The servers use lightweight processes to service multiple client requests concurrently. There is a pool of persistent service threads.
Section 17.7 -- Summary