(Latest Revision: 
May 14, 2016
Thu Dec  8 01:15 PST 2016
)  
Chapter Twelve -- File-System Implementation -- Lecture Notes
-  12.0 Objectives 
      
     -  Describe details of implementing local file systems 
          and directory structures.
     
 -  Describe the implementation of remote file systems.
     
 -  Discuss block allocation and free-block algorithms 
          and trade-offs.
     
 
 -  12.1 File-System Structure 
     
     -  For efficiency reasons, I/O operations between disk and 
          primary memory are performed in units of blocks.  A block
          is N sector's worth of data, where N >= 1 is a (fixed) small
          positive integer.  
      -  Designers must decide on the logical view users will get of 
          a file system - what file types there will be, what their
          attributes will be, what the directory structure will be,
          and what means users will have for organizing files. 
          
      -  Designers also must decide on the algorithms and data structures
          that will be used to implement the desired logical view of the 
          file system.  
      -  At the lowest level (above the hardware) the file 
          system implementation consists of I/O control 
          - device drivers and interrupt handlers that implement 
          commands like "retrieve block 123."
          
      -  Next above the level of I/O control is the 
          basic file system, which issues commands
          to the I/O control level to access physical blocks 
          by giving drive, cylinder, track, and sector numbers 
          as arguments.  The basic file system also manages memory
          buffering and caching, of file data and metadata.
          Typical metadata consists of directory information 
          and attributes.  
      -  The next level up is the file-organization module,
          which is concerned with free-space management and with
          translating logical block addresses (in some 1..N or 0..N-1 range) into 
          physical addresses on disk (in other words, drive, cylinder, track, 
          and sector numbers). 
      -  Next up comes the  logical file system 
          which manages the metadata, like directories, and 
          per-file file control block structures that contain file
          attributes.   
      -  An operating system can support many different file systems,
          and typically multiple file systems can utilize the same
          I/O control module.  Perhaps multiple file systems can also 
          have large portions of their basic file system modules in common.  
          Such sharing and avoidance of duplication of code is an advantage
          of a layered design.  However overhead introduced at each layer
          can affect file system efficiency negatively, which is a major
          concern because of the fact that delays caused by access to 
          secondary storage are often a major bottleneck 
          in a computing system. 
     
 
 -  12.2 File-System Implementation 
     
     -  12.2.1 Overview 
           
          -   Various metadata structures on secondary memory 
                are utilized to implement file systems.  Here are
                examples: 
                
                -    per volume boot control block  - 
                      typically the first block on the volume.  On a
                      bootable volume it has info the computer uses
                      to boot the OS from the volume. (examples: UFS boot
                      block and NTFS partition boot sector)
                
 -    volume control block  - info like 
                      the number of blocks on the volume, the block
                      size, the free-block count, pointer(s) to free
                      blocks, free-FCB count, and pointer(s) to free 
                      FCBs (examples: UFS superblock and NTFS master file
                      table)
                
 -    directory structure 
                
 -    per-file file control block (FCB) 
                
 
 
           -   Other metadata structures are utilized 
                in primary memory, for aspects of file system 
                management, and to facilitate file access
                and enhance efficiency.  Examples include: 
                
                
                -    mount table  - info about
                      each mounted volume
                
 -   directory-structure cache
                      - info on recently-accessed directories
                
 -    system-wide open-file table 
                      - copies of the FCBs of open files, and
                      other info
                
 -    per-process open-file table 
                      - contains some per-process info and a 
                      pointer to an entry in the system-wide open-file
                      table
                
 -    Buffers  to hold file blocks when
                      read from or written to secondary memory
                
 
 
           -   To create a file the OS must allocate an FCB and
                add an entry to the appropriate directory.
                
           -   When a process opens a file, the OS adds an entry 
                to the per-process open-file table.  If no process
                already has the file open, the OS also creates a
                new entry in the system-wide open-file table. 
                There is a counter in each entry of the system-wide 
                open-file table to keep track of how many processes 
                have the file open.  The system call that opens a file
                returns a pointer to the appropriate entry in the 
                per-process open-file table. (examples: unix file 
                descriptor and Windows file handle)
                
           -   When a process closes a file, the OS deletes the 
                entry in the per-process file table and decrements
                the counter in the corresponding entry in the system-wide 
                open-file table.  If the counter goes to zero, the OS
                deletes the entry in the system-wide table, after copying
                any modified metadata back to disk.
                
           -   The system-wide open-file table may also be used for
                managing objects with file-like interfaces, such as 
                devices and network connections. 
           -   The caching of file system metadata is very important, 
                since it helps greatly reduce the delays caused
                by file system interaction, which is a major bottleneck
                in most computing systems.  
          
 
 
      -  12.2.2 Partitions and Mounting 
           
          -   A disk drive or similar device may have multiple partitions,
                each of which may be raw or contain a file system 
           -   A boot loader may be utilized as part of the boot sequence,
                thus providing the capability to boot a choice of operating
                systems. 
           -   Typically a  root partition  containing
                the OS is mounted immediately at boot time, 
                and other partitions are mounted later.
          
 
 
      -  12.2.3 Virtual File Systems 
           
          -  A virtual file system design can be used to provide support
               for multiple types of file systems and integrate them into a 
               unified directory structure. 
           -  At a high level, there is a file-system interface that supports
               generic open(), read(), 
               write(), and close() calls on file
               descriptors.   
           -   Below that there is a  virtual file system (VFS)
                layer that examines the object to which the descriptor
                refers, and then chooses the appropriate operation (method)
                for implementing the generic operation requested.  The VFS
                can distinquish among local file objects belonging to 
                different file system types, and between 
                local and remote file objects. 
          
 
 
      
 -  12.3 Directory Implementation 
     
     -  12.3.1 Linear List 
           
          -  The simplest kind of directory implementation to 
               code would be a linear list of 
               (file name, pointer to disk location) 
               pairs.  
           -  However, it is well known that such a structure does
               not well support the complete set of symbol table 
               operations. One is forced to settle either for 
               sequential searching or large amounts of data 
               movement during insertions and deletions. A balanced 
               binary search tree with threading to support inorder
               and other common traversal orders would be good for
               larger directories, but not worth the overhead for 
               smaller directories.  
          
 
 
      -  12.3.2 Hash Table 
           
          -  One compromise is a hash table that uses chaining for 
               collision resolution. That gives good performance for
               search, insertion, and deletion.  The use of chaining
               mitigates problems stemming from the fixed size of
               the hash table.  
           -  Of course, the hash table structure does not support traversal 
               in key order for listing operations, except brute-force
               methods.
          
 
      
 -  12.4 Allocation Methods 
     
     -  The trick to disk space allocation for file systems is to
          get two things at once: good storage utilization
          and fast file access. Contiguous, linked, 
          and indexed allocation are the three main methods in use.
          
      -  12.4.1 Contiguous Allocation 
           
          -  Contiguous allocation requires that a set of contiguous
               blocks on secondary memory be allocated for each file. 
               
           -  A major advantage is that all blocks of the file can be
               accessed with a minimal number of seeks, and with 
               minimal seek time.  
           -  To keep track of which blocks are allocated to a file,
               the OS only has to store two numbers in the FCB, the 
               address of the first block, and the number of blocks
               in the file.  
           -  When accessing the file, it is simple to compute the physical
               address of a file block from its logical address.  Typically
               the logical block numbers are just a range 
               of the form 0 .. N, and
               the physical block number of a logical block 
               is the sum of the logical block
               number and the physical block number of the base block.
               
           -  Because there is a quick constant-time method to calculate
               the physical location corresponding to any logical block 
               address, contiguous allocation easily supports both sequential
               and direct file accesses.  
           -  Unfortunately contiguous allocation has a very serious 
               drawback.  The problem of allocating  contiguous storage 
               on secondary memory is an instance of the now-familiar 
               dynamic storage allocation problem which 
               means that a very significant amount of the storage 
               could become unusable eternal fragmentation.
               
           -  Compaction of hard disks can take hours.  
           -  It is also a problem to decide how much space should be
               allocated for a file.  One often does not know how large
               a given file will need to grow.  Adding more contiguous
               space to an existing file may seldom be possible.
               
           -  If users routinely overestimate the storage requirements 
               for files there will be significant amounts of internal
               fragmentation.  
           -   The limitations of contiguous allocation motivate efforts
                to use forms of linked or indexed allocation. 
          
 
 
      -  12.4.2  Linked Allocation 
           
          -  With linked allocation, each file is a linked list
               of file blocks.    
           -  Linked allocation does not suffer from external fragmentation
               at all. Any block anywhere on the volume can be used in a
               file. It is easy to add more blocks to a file any time. 
               Compaction is never necessary. 
           -   The directory contains pointers to the first and 
                last blocks of the file. 
           -   One way of implementing the rest of the pointers
                is to place in each data block a pointer to the 
                next block. 
           -   The major problem with linked allocation is that it
                supports direct access extremely poorly.  Generally,
                in order to find the physical address of block #k, 
                it is necessary to first traverse all k-1 pointers 
                from the beginning of the file up to the pointer 
                to block #k-1.  
           -  Also, since each file block can be anywhere on the volume,
               the average seek time required per block access can be
               much greater than is the case with contiguous allocation.
               
           -  If we link contiguous clusters of blocks instead of blocks
               there will be fewer pointers to follow and the proportion
               of space on disk used by pointers will be smaller.  On the 
               other hand, the average amount of internal fragmentation
               will increase.   
           -  Reliability is a problem, since the consequences of a lost
               or corrupted pointer are potentially great. 
           -  The concept of a file allocation 
               table (FAT) is a useful variation on linked 
               allocation.  The FAT is a table stored at the beginning
               of the volume, having an entry for each physical block.
               It is in effect an array indexed by physical block number.
               The pointers used to link files together are stored
               in the FAT instead of in the data blocks of the file. 
               
           -  As an example of how the FAT is utilized, suppose we want to
               access logical block #2 of file X.  First we consult X's
               FCB to learn the physical block number of X's logical block
               0. Let's say that number is 123.  We then examine the contents 
               of entry 123 in the FAT.  Let's say 876 is stored there.
               That means that 876 is the physical block number of X's 
               logical block 1.  We then go to entry 876 of the FAT, and
               find there the physical address of X's logical block #2.
               Let's say that number is 546.  All that is required now is
               to request physical block 546.
               
           -  The FAT may also be used to keep track of free blocks,
               by using a special value in its entry, instead of a
               physical block number.  
           -  The use of the FAT makes for less seeking during file
                access, because all the pointers are relatively close
                to each other. Also, parts of the FAT may be cached in
                primary memory, and so traversing those parts of the FAT
                would not require any seeks.  It is also more straight 
                forward to enhance the reliability of the pointers
                by making backup copies of the FAT.
          
 
      -  12.4.3  Indexed Allocation 
           
          -   Indexed allocation utilizes a per-file table
                where all the physical addresses of the data blocks of the 
                file are stored. 
           -   This table, like a FAT, is basically an array 
                of physical block numbers.  However, the indexes
                into the array are logical block numbers.
                (The indexes into a FAT are physical 
                block numbers.)  
 
          -   To find logical block #k of the file, we simply
               consult the kth entry in the table and read off the 
               physical address of the block. 
          -   The table is usually called "the index" 
          -   The directory contains the address of the index.
          -   The indexing scheme works pretty much the same way that
               paging works in primary memory management. 
          -   With indexed allocation, there is no external fragmentation
               and both sequential and direct access are supported 
               efficiently. 
          -   Internal fragmentation occurs in the last data block of files
               and in the unused portions of the indexes. 
          -   The entire index of a file can be cached in a relatively 
               small amount of memory and the physical address of any 
               block in the file can be found by looking at a single
               entry of the index.  In contrast, it is likely that 
               there will not be enough memory to cache an entire
               FAT, and the average number of FAT entries that must
               be probed to find the physical address of a file 
               block is proportionate to the size of the file.  
           -  When using indexed allocation, each file block can be 
               anywhere on the volume, so there can be a long
               average seek time required for accessing 
               blocks of the file one after another.  
               That same problem of long seeks between block accesses
               happens with sequential access too.  
               With contiguous allocation, the blocks of a file are close
               together, so when accessing one file block
               after another, the average seek time tends to be shorter 
               than is the case with sequential allocation or indexed
               allocation.
               
          -  To accommodate large files, the system may resort to using
              a linked list of index blocks, or a multilevel index in which
              one master index points to multiple second-level indexes
              that point to file data blocks.   
          -  The unix inode utilizes a variation on multilevel indexing.
              The first few entries in the inode point to file data blocks.
              One entry of the inode points to a single indirect block,
              which is a block of pointers to file blocks. Another entry
              of the inode points to a double indirect block - a block of 
              pointers to single indirect blocks.  A third entry of an inode
              points to a triple indirect block, which is a block of pointers 
              to double indirect blocks.  "Under this method, the number of
              blocks that can be allocated to a file exceeds the amount of 
              space addressable by the 4-byte pointers used by many 
              operating systems" (4 GB).
          
 
 
      -  12.4.4 Performance 
           
          -  Sequential allocation is appropriate for files that will be 
               accessed sequentially, but not for files to be accessed 
               directly.
          
 -  Contiguous allocation performs well supporting both 
               direct access and sequential access. 
          
 -  The performance of indexed allocation depends on 
               implementation details - usually it is somewhat better 
               than the performance of sequential allocation, but not 
               as good as that of contiguous allocation.
          
 -  Some systems use more than one allocation method and try
               to match the allocation method to the size of files 
               and/or the ways that the files are used.
          
 
 
      
 -  12.5 Free-Space Management 
     
     -  Some sort of data structure is required to keep track of
          which file blocks are allocated, and which are free. 
          
      -  12.5.1  Bit Vector 
           
          -   A bit map or bit vector is a sequence of bits.  
                the ith bit represents the ith physical block.  If
                the ith physical block is free, the ith bit in the vector
                is 1, else it is 0. 
           -   It is simple to implement bit vectors and devise algorithms
                for locating free blocks and runs of contiguous free blocks.
                Instructions that might be used: "ISZERO" and bit-shift.
                
           -   However, bit vectors are not efficient to use unless they
                are cached entirely in primary memory.  It is fairly common
                nowadays (the year 2016) for a laptop computer 
                to have a terabyte disk
                and 8GB of primary memory.  If the disk has 4KB 
                blocks or clusters, the bit vector would need about 32 MB
                of physical memory, which is about 0.4% of the 8GB.
          
 
 
      -  12.5.2 Linked List 
           
          -   If contiguous blocks are not needed, then simply 
                storing a link to another free block in each free 
                block is a reasonable approach. 
          
 
 
      -  12.5.3 Grouping  
           
          -  In this variant of linking, the first block in the free
               list structure contains n-1 pointers to free blocks, and 
               finally one pointer to another block like itself, which
               points to n-1 more free blocks and another block like
               itself, and so on. 
           -  This structure makes it possible to find a large number of 
               free blocks quickly. 
          
 
 
      -  12.5.4 Counting  
           
          -  Make a list of contiguous runs of free blocks by
               storing pairs of the form (base address, # of blocks) 
               
           -   The list will be compact if most runs are longer than
                1 block. 
           -  Store these records in a balanced tree for efficient
               search, insertion, and deletion. 
          
 
 
      -  12.5.5 Space Maps  
           
          -   ZFS uses a scheme that divides the volume into areas with
                separate free lists. 
           -   ZFS logs allocation and freeing activity and then uses
                the log to update in-memory copies of free lists with
                batches of changes.
          
 
 
      
 -  12.6 Efficiency and Performance 
     
     -  
      -  12.6.1  Efficiency 
           
          -  A unix optimization is to locate file data blocks close
               to the inode to reduce seek times for file accesses. 
           -  Also unix uses variably sized clusters.  Smaller clusters
               are for the last cluster of a file to lessen internal 
               fragmentation.  
           -  "Generally, every data item associated with a file needs to
               be considered for its effect on efficiency and performance."
          
 
 
      -  12.6.2  Performance 
           
          -  A unified buffer cache (process pages and file data) 
               enhances performance, using virtual memory techniques,
               and avoiding double-caching.
               
           -  Frame allocation must be balanced so that processes
               can't take away too many frames that are needed for 
               file caching, or vice-versa.  
           -  When writes to file blocks are synchronous, processes must
               wait for data to be flushed to secondary storage before
               continuing.  
           -  On the other hand, when processes write asynchronously,
               delays tend to be short - often significantly shorter 
               than delays for reads, since the OS can return control
               to the process immediately after caching the write data
               in primary memory.  
           -  When a file is accessed sequentially it is often a good
               idea to use free-behind 
               and read-ahead.   
           -  Free-behind: After page buffers are accessed, they 
               should be freed almost immediately
               because they will probably not be accessed again.  This is
               contrary to LRU, but makes sense in this context.
           -   Read-ahead: Similarly, when reading 
                a block into buffers from a file which is being 
                accessed sequentially, several
                of the blocks that follow in the file should also be brought
                in from disk because they will probably be accessed very soon
                and it saves on disk access time to batch them this way.
          
 
 
      
 -  12.7 Recovery 
     
     -  A crash or other failure can interrupt the OS while it is making
          changes to meta-data, which can leave a file system in a damaged
          and/or inconsistent state. Such damage can also occur if the OS
          crashes before it is able to flush cached meta-data changes to
          non-volatile secondary storage.  Other things, such as bugs in 
          software and hardware, can result in damage to a filesystem.
          
          The OS must take measures to protect the system from loss of data
          and to recover from failures. 
      -  12.7.1  Consistency Checker 
           
          -   The OS may utilize a consistency checker 
                such as the unix fsck to resolve 
                inconsistencies between
                the directory structure and the data blocks.   
           -   Details of the file system implementation, such as
                the allocation algorithm and free-space management
                algorithm, determine what kinds of problems a consistency
                checker can detect and correct.  
           -  For example, if there is linked allocation, a file can be
               reconstructed from its data blocks.
          
 
 
      -  12.7.2 Log-Structured File System  
           
          -  The use of log-based recovery algorithms is becoming
               common, enhancing protection of file systems. 
           -  The basic idea is to write changes to metadata to a log 
               first, and then replay the log entries across the 
               actual file system metadata structures that need to be
               changed.  
           -  If the system crashes, the OS can use the log to resolve
               any inconsistencies during a recovery phase. 
          
 
 
      -  12.7.3 Other Solutions  
           
          -   WAFL and ZFS file systems never overwrite blocks with
                new data. 
           -   Changes are written to new blocks and then pointers
                to old blocks are updated to point to new blocks. 
                
           -   Old pointers and blocks may be saved in order to provide
                snapshots of previous states of the file system. 
           -   ZFS employs checksums for all metadata and file blocks,
                further reducing chances of inconsistency.
          
 
 
      -  12.7.4 Backup and Restore 
           
          -  Backup schedules involve some combination/rotation of
               full and incremental backups.  
           -  Some backups should be saved "forever," in case a user
               discovers a file was damaged or deleted long ago. 
               
           -  Protect backup media by storing it where it will be safe,
               and make sure to always have backups on media that
               is in good condition, and not worn out.  
           -  Verify the condition of backups on a careful schedule.
          
 
      
 -  12.8 NFS 
     
     -  12.8.1  Overview 
           
          -  NFS allows a client to mount an arbitrary directory on a
               remote server over a directory (a mount point) 
               in its local file system.  
           -  After the client system performs the mount operation, 
               users on the client can access the remote files 
               transparently (without needing to make any reference to
               the network, the client, or the server) with normal file
               accesses that utilize the pathname of the mount point.  
               
           -  If a collection of clients mounts home directories from a 
               server via NFS, users can access their homes by logging in
               to any of the client machines.
               
           -  The implementation of NFS uses RPC and XDR.
          
 
 
      -  12.8.2 The Mount Protocol  
           
          -  There is a separate protocol for mounting a remote
               directory. 
          
 
 
      -  12.8.3 The NFS Protocol  
           
          -  NFS is essentially a stateless protocol. 
               The server does not keep track of what the client is doing
               with the file - nothing like an open file table.  Files
               are not opened or closed by the NFS protocol.
               Each client request has to identify the file and
               the byte offset being accessed. 
           -  An advantage of the statelessness of NFS is that it is robust
               across server crashes.  The client need only repeat a request
               to a server that has rebooted. 
           -  For protection of data and metadata, NFS is required to
               flush writes to the server secondary memory synchronously.
               The server must wait until a write is complete before 
               sending the return value of a client write requests.
               
           -  The NFS protocol does not provide for the locking required for
               concurrent file acceses.  The OS may provide that. 
           -  NFS is integrated into the virtual file system of Solaris, as 
               one of the layers below the VFS interface. 
          
 
 
      -  12.8.4 Path-Name Translation 
           
          -  When a path name is given as an argument of an NFS operation,
               each component must be checked individually.  Any of them
               may be a mount point.  Once a mount point is crossed, the
               appropriate server must be consulted, via an NFS operation, 
               to look up the vnode of the directory or file. 
          
 
 
      -  12.8.5 Remote Operations 
           
          -  NFS utilizes client-side caching of inode-information and 
               file blocks.   
           -  Clients check with servers to determine whether cached contents
               remain valid.  
           -  NFS does not preserve the unix semantics that call for 
               writes to be visible immediately to all processes 
               that have a file open.  
           -  The semantics of NFS are not the session semantics 
               of the Andrew file system either.
          
 
 
      
 -  12.9 Example: The WAFL File System 
     
     -  WAFL is a file system optimized for random writes, and designed
          for servers exporting files under NFS or CIFS. 
      -  WAFL is similar to the Berkely Fast File system, but with many
          differences. 
      -  All metadata is stored in ordinary files. 
      -  WAFL never writes over blocks, and can provide snapshots of 
          previous states using old root nodes.  Going forward
          the changes in the file system are recorded in copy-on-write
          fashion. 
      -  Writes can always occur at the free block nearest the current
          head location.