7th ed. chapter 09

(Latest Revision: Dec 02, 2008 )

Chapter Nine -- Virtual Memory -- Lecture Notes

9.0 Objectives
- Describe benefits of a VM system
- Explain demand paging
- Explain page-replacement
- Explain frame allocation
- Discuss the Working-Set Model
9.1 Background
- Virtual memory
  - allows a partially-loaded program to execute,
  - allows processes to be larger than physical memory,
  - allows more programs to execute concurrently,
  - tends to reduce program load and swap time,
  - facilitates page sharing,
  - is not easy to implement, and
  - can decrease performance.
9.2 Demand Paging
- The idea of demand paging is to load only the pages of the program that are actually needed - a "lazy" form of swapping.
- 9.2.1 -- Basic Concepts
  - Support from the hardware is required to implement virtual memory. A set "valid bit" in a page-table entry can signify that the page number is a legal address for the process and that the page is currently resident in main memory.
  - As an aid to the OS, the page table entry of a non-resident page may contain the on-disk address of the page.
  - If the hardware encounters a set invalid bit in a page-table entry during address translation then a page-fault is initiated. To service a page-fault the OS must:
    1. check to see if the logical address is in-range for the process
    2. if not in range terminate the process, else
    3. find a free frame
    4. schedule a disk read to load the missing page
    5. after the page is loaded, mark it present in the page table and other internal data structure(s)
    6. restart the instruction that caused the page-fault
  - Pure demand paging would start the process with no resident pages and load no page unless and until the process faults on it. In practice an OS typically performs some pre-fetching in an attempt to reduce the number of page-faults.
  - Because of locality of reference, page-fault rates are not nearly as high as they would be if addresses were generated "randomly" by the executing process.
  - Hardware Support Required for Demand Paging:
    - Page table
    - Swap space on high-speed backing store
    - Instruction architecture in which any instruction can be restarted after a page-fault. (complexities are discussed on pp. 322-323 of the seventh edition.)
- 9.2.2 -- Performance of Demand Paging
  - The page-fault rate of a process is the number of page-faults the process gets during its execution divided by the number of memory accesses it performs.
  - A normal memory access takes somewhere between 10 and 200 nanoseconds.
  - If a page-fault occurs during an attempted memory access, it may require about 8 milliseconds to complete that memory access, mainly because the page has to be loaded from secondary storage.
  - there are 40,000 200-nanosecond periods in an 8 millisecond period. This is like the difference in magnitude between one second and eleven hours -- eleven hours is about 40,000 seconds.
  - Under the assumptions above, if we get just one page-fault in every 40,000 page accesses then the effective memory access time could be double the 200ns figure -- 400ns.
  - To assure an effective memory access time below 220ns we would need to have fewer than one page-fault in every 400,000 memory accesses.
  - The point here is that page-faults are very detrimental to effective memory access time and so we must keep the page-fault rate very, very low.
  - It is faster to page or swap to the swap space, rather than the file system, so an OS does not usually page to the file system.
  - It may be beneficial to perform a certain amount of paging from the file system.
9.3 Copy-on-Write
- Windows 2000, Linux, and Solaris 2 use copy-on-write with fork() -- only pages actually written by the parent or child are copied.
- Another variant used by unix is vfork() -- here the parent sleeps while the child uses the address space of the parent. The child is supposed to *not* write the parent's space. The child is expected to soon use a form of exec that gives the child a new address space of its own -- separate and distinct from that of the parent. When the parent wakes it *will* see any changes the child made to the parent's address space, so caution is critical. This is considered dangerous and inelegant but efficient.
9.4 Page Replacement
- Physical memory may become over-allocated -- there may be no free frames when a page-fault occurs!
- 9.4.1 -- Basic Page Replacement
  - If there is no free frame to service a page-fault then take one away from some process that does not appear to need it (badly).
  - Write the contents of the "victim" frame to swap space first if it has been modified (as recorded in the "dirty bit" of the page table entry) otherwise *don't* write -- save time. (If for some reason the page is missing from swap space then write it anyway.)
  - Update entries in page tables and data structures (e.g. frame allocation table) associated with both the "victim" and the "aggressor" processes.
  - We need to solve the frame-allocation problem: how many frames are allocated to each process?
  - We need to choose a page-replacement algorithm: which frame should be the "victim"? We want the replacement algorithm that gives the lowest page-fault rate.
  - We evaluate a page replacement algorithm by trying it out on some reference strings.
  - Suppose you make a list L of all the logical memory addresses referenced by a process P during its lifetime, in the order the references are made. Suppose you make a list L' from L by just using the page numbers of each address from L. Suppose you then "consolidate" L' into a third list L" by "collapsing" all runs of the same page number into just one "copy" of that page number. The result would be the reference string of the process P.
    
    The reference string of a process is the sequence of page numbers of "new" pages referenced by the process, in order of reference. (Every time the process stops referencing one page and starts referencing a different page, a new page number is appended to the reference string.)
  - The length of a reference string is just the number of terms in the sequence.
  - We can evaluate a page-replacement algorithm by counting how many page faults it gets on a reference string. The fewer page faults, the lower will be the page fault rate.
- 9.4.2 -- FIFO Page Replacement
  - The operating system can link records representing all the occupied frames into a FIFO queue. When a page is loaded into a frame the corresponding record goes into the back of the queue. When a victim for page replacement is needed the frame at the front of the queue is selected. This is called FIFO page replacement.
  - FIFO page replacement is easy to understand and implement but it does not tend to do very well at keeping the page-fault rate to a minimum.
  - FIFO replacement requires updates to data structures only when a frame is loaded or a victim frame is selected.
  - FIFO page replacement is subject to Belady's anomaly: the page fault rate for the same reference string can sometimes go *up* when you increase the number of frames available!
- 9.4.3 -- Optimal Page Replacement
  - Interestingly, there is a page replacement policy that is known to be "as good as possible."
  - It gives the lowest possible page-fault rate, guaranteed.
  - In essence, the algorithm is to "replace the page that will not be used for the longest period of time."
  - The algorithm can be restated as: "replace the page with the latest next use."
  - This algorithm is sometimes called "OPT" or "MIN." It might also be called "LNU" for "latest next use."
  - Sometimes the algorithm is referred to as "the oracle method" because the OS has to see into the future to apply the rule.
  - Usually we don't have prior knowledge of what the reference string of a process will be, and so it is usually not possible to implement the OPT algorithm -- too bad. (Remember that the reference string of a process generally depends on the outcomes of branch instructions, which in turn generally depend on what data is input to the process.)
  - However we can test other page-replacement algorithms by comparing their performance with that of OPT on pre-selected reference strings.
- 9.4.4 -- LRU Page Replacement
  - The idea of LRU is to replace the page that is "least recently used" -- i.e. the page that has not been accessed for the longest time.
  - Because LRU is similar to OPT -- it just looks "back" instead of "ahead" -- it seems plausible that LRU may tend to have low page fault rates. (LRU might be called ELU for "Earliest Last Use." This underscores the fact that LRU is the "mirror image" of the OPT algorithm: "Latest Next Use.")
  - LRU works well generally but it can fail miserably in some situations.
  - LRU replacement requires very frequent updates to data structures -- each time there is a memory reference.
  - In theory we could implement LRU by arranging for hardware to increment a counter after every memory reference, to write the value of the counter into the page table entry (PTE) every time a page is accessed, and to search page tables for the lowest value of the counter whenever a page needs replacement.
  - Another method -- one that seems a little more practical -- would be for the hardware to perform updates to an array of records indexed by page number or frame number. The records would contain pointer fields which would be used to implement a stack as a doubly-linked list. Each time a reference is made to a page the hardware would have to index into the array and place the referenced frame at the top of the stack by manipulating pointers (no more than six). The LRU frame for replacement would just be the frame on the bottom of the stack.
  - OPT and LRU are stack algorithms -- the set of pages in memory for an allocation of N frames is a subset of the set of pages in memory for an allocation of N+1 frames.
  - To see that LRU is a stack algorithm, note that when doing LRU with N frames, the pages in memory are the N most recently referenced pages.
  - Stack algorithms do not suffer from Belady's anomaly.
- 9.4.5 -- LRU Approximation Page Replacement
  - Most computers do not have hardware support for true LRU page-replacement
  - Most systems do have a reference bit associated with each page table entry.
  - When a reference is made to an address in a page the hardware sets the corresponding reference bit to indicate that the page has been accessed.
  - The operating system is able to clear reference bits.
  - We can implement page replacement algorithms that are approximations of pure LRU by using manipulation of reference bits.
  - 9.4.5.1 -- Additional-Reference-Bits Algorithm
    - In this scheme the OS keeps a table containing (say) one byte of memory associated with each page. Periodically an interrupt gives the OS control and the OS rolls the value of each reference bit into the most significant bit (msb) of its corresponding byte. After rolling a reference bit, the OS clears it to prepare for the next cycle. Examples:
      - If the byte is 0000 0000 then the page has not been referenced for eight periods in a row.
      - If the byte is 1111 1111 then the page has been referenced in all of the last eight periods.
      - If the byte is 0100 1000 then the page was referenced two periods ago, and also five periods ago.
      - If the byte is 1000 0100 then the page was referenced in the latest period and also six periods previously.
    - Note that if we just view these bytes as unsigned integers then the smaller values correspond to the pages that are less recently used.
    - When the OS needs to choose a victim for page-replacement it picks a pages with minimal byte-value. This algorithm approximates LRU.
  - 9.4.5.2 -- Second-Chance Algorithm
    - In the typical implementation of the 2nd-chance algorithm there is a circular linked list of records. Each record represents one physical frame.
    - An external pointer points to the current element of the list.
    - When the OS needs a victim frame it executes the following algorithm:
      1. examine the current element.
      2. If the reference bit of the frame is 0 (the frame has not been referenced lately) then make the frame the victim. Copy the new page into the frame and advance the pointer to the next frame.
      3. Otherwise if the reference bit of the frame is 1 (the frame was accessed recently) then do not make the frame a victim (give it a second chance). Clear the reference bit. Advance the pointer to the next frame. Go to step 1.
  - 9.4.5.3 -- Enhanced Second-Chance Algorithm
    - The enhanced version is like straight 2nd chance except we keep going around until we find an un-referenced clean victim. (or until we come back around to where we started.)
    - If we do come back around without getting a victim then go around again until finding any unreferenced frame.
    - If there is other process activity going on concurrently with this, the OS may have to go around again and choose a referenced clean frame, or even go around a fourth time and choose a referenced dirty frame.
    - It is said that the method above was used in a version of the Macintosh OS. I don't know if this is still in use with OS X.
- 9.4.6 -- Counting-Based Page Replacement
  - If we can arrange for a counter (or an approximation) to track how many references have been made to each frame then we can implement a least-frequently-used or most-frequently-used page-replacement algorithms. These algorithms don't generally tend to have very good performance, and the overhead of the implementation is high.
- 9.4.7 -- Page-Buffering Algorithms
  - As an optimization the system may keep a pool of clean free frames -- when a dirty victim is selected the new page is immediately copied into a clean free frame from the pool and can be used right away. The dirty victim is then written back to swap space. After that the now clean victim becomes part of the pool. (As in any buffering scheme, there is always the danger that the size of the pool will shrink to zero due to excessive requests.)
  - Another optimization is for the OS to keep the pool of free frames and also keep track of what is on the freed pages. If a process faults on a page that happens to be in the pool then the OS just gives the process back the frame from the pool and thereby avoids loading that page from disk (this is sometimes called "reclaim from free.")
  - The OS can also keep track of which (allocated) pages are dirty and "in its spare time" write dirty pages to swap space. This way when a page has to be replaced it is more likely to be clean, and hence more likely to be replaced more quickly. If the OS is maintaining a pool of free frames, this idea of writing back allocated frames is still worthwhile because it tends to help keep excessive requests from draining the pool.
- 9.4.8 -- Applications and Page Replacement
  - This section seems kind of out-of-place here ...
9.5 Allocation of Frames
- How many frames should be made available to a process? Should it be allowed to take frames away from other processes?
- Should frames for user pages and frames for file buffers and heap storage all go into the same pool or should there be separate free lists for the different usages?
- Should the system allow free lists or pools of free memory to drain completely, or should some minimum size be maintained at all costs?
- 9.5.1 -- Minimum Number of Frames
  - A process cannot execute an instruction unless the instruction and all the data that the instruction accesses are entirely resident in physical memory.
  - If an instruction straddles two pages and the data it acts on straddles two or more pages then it may be necessary to have four or more pages resident in memory in order for that instruction to execute.
  - For every computer architecture there is some worst case scenario. There is some largest number N such that a process may need N resident pages in order to execute an instruction.
  - That being the case, the operating system must be set-up to allow any process to have at least N frames. In the worst case, the process would not be able to execute if it could not get N frames allocated to it simultaneously.
- 9.5.2 -- Allocation Algorithms
  - It's possible to give nearly equal numbers of free frames to all processes -- equal allocation
  - Probably it makes more sense to allocate free frames to processes in proportion to their "need." Need might be measured in various ways. larger processes or higher priority processes may be judged to be more needy.
- 9.5.3 -- Global Versus Local Allocation
  - When it replaces a page, the OS selects a frame used by process X, and loads it with a page needed by process Y.
  - Under a global page-replacement policy, X and Y don't have to be the same process.
  - Under a local page-replacement policy X and Y do have to be the same process.
  - Global replacement policies are more common, perhaps because they allow the number of frames allocated to a process to change according to changing need.
  - With global replacement ideally there is a "Robin Hood" effect: the operating system steals frames from "rich" processes (that don't need them) and gives them to "poor" processes (that do need them).
  - If everything works out perfectly then each process has the frames it needs to maintain a low page-fault rate. This should result in high average throughput.
  - On the other hand, there are other ways to dynamically alter the number of frames allocated to each process. If we use such an allocation strategy in conjunction with a local page-replacement policy, might we allow processes finer control over their own page-fault rates?
9.6 Thrashing
- When the number of frames allocated to a process is low enough, it will get page-faults so often that the paging activity of the process will take up much more time than the execution of instructions.
- This process which spends more time paging than executing is said to be thrashing.
- 9.6.1 -- Cause of Thrashing
  - Suppose the degree of multi-programming is very low -- say there are only one or two active processes. In that case it will be likely that both processes will be waiting for I/O at the same time quite often. At those times the CPU will be idle. CPU utilization is low in such a system. Assuming there is adequate memory available, it will probably help utilization if we increase the degree of multiprogramming.
  - On the other hand if the degree of multi-programming is very high, and if physical memory is over-allocated then it is likely that there will be a lot of thrashing going on. In that case too the CPU utilization may be quite low because often all processes will be blocked waiting for paging to complete. In this case it will only make things worse to increase the degree of multiprogramming.
  - According to the principle of locality at any given time a process is accessing just a small subset of its text and data. In comparison to memory access time, the contents of the locality changes very slowly. According to this view the process will not thrash as long as it has enough frames to hold all or most of the current locality.
- 9.6.2 -- Working Set Model
  - We can pick a number &Delta (delta) and arrange for the system to count or approximate the number of (distinct) pages referenced during the last &Delta memory references. For example, we may set &Delta = 10,000 and estimate the number of pages accessed in the last 10,000 memory references. (This will require some combination of actions performed by hardware and the OS -- e.g. keeping a history of values of reference bits.)
  - The set of pages referenced during the last &Delta references is called the working set. It is a representation of the locality of the process. The aim would be to keep each process supplied with enough frames to hold its working set.
  - If the OS uses this approach, and if there are sufficient jobs available, it seems reasonable that the OS will be able to keep the degree of multiprogramming high enough to attain good CPU utilization and low enough to prevent thrashing.
  - When all processes have enough frames for the current size of their working sets, and when there is enough additional memory, the OS would bring another job into memory -- say from swap space.
  - When memory is in short supply and processes can't get enough for their working sets, the OS would swap a process out and divide its allocation among the remaining processes that need more frames.
- 9.6.3 -- Page-Fault Frequency (PFF)
  - It is more direct and simple to work with PFF than a working set model.
  - Establish an acceptable PFF: call it f.
  - If the PFF of a process exceeds f, give it more frames until the PFF is less than f.
  - If the PFF of a process falls well below f then take frames away from the process until the PFF is close to f.
  - When there is a need to give frames to processes and there are not enough frames, swap a process out and free its frames.
  - When all processes have acceptable PFF's and there is memory to spare, increase the level of multiprogramming -- swap a process in.
  - One possible drawback to the PFF approach is that it is not sensitive to the difference between a change in size and a transition of the working set. If the working set is not changing in size, but merely transitioning from one locality to another, it may be better for performance not to give more frames to the process, but just to let page faults replace the pages that are being "vacated" in favor of pages new to the working set.
9.7 Memory-Mapped Files
- VM techniques can be used to reduce the need for time-consuming system calls and disk accesses for file I/O.
- 9.7.1 -- Basic Mechanism
  - The initial access to a file is handled like a page fault. After a file block is mapped into VM, subsequent accesses are routine memory accesses.
  - Writes through to disk can be done when the file is closed, or as part of a periodic interrupt routine.
  - When memory-mapping of files is available, it can be used to implement the sharing of a file by a group of processes, shared memory, and copy-on-write functionality.
- 9.7.2 -- Shared Memory in the Win32 API
- 9.7.3 -- Memory-Mapped I/O
9.8 Allocating Kernel Memory
- The OS has special needs. In order to conserve memory there may be special memory allocation techniques and methods available to the kernel.
- 9.8.1 -- Buddy System Allocation Method
  - This model facilitates allocation of memory in chunks in an amount equal to any power of two - up to some limit.
  - Using the buddy system it's easy to coalesce some adjacent deallocated chunks of equal size into single free chunks.
  - Problem: internal fragmentation can be up to 50% of total memory.
- 9.8.2 -- Slab Allocation
  - The idea of the slab allocator is to use a groups of pages as an array of one specific kind of data structure: say process descriptors, file objects, semaphores, and so on. Each slab can be managed as with "fixed-sized partitioning" This is very simple and there is 'no fragmentation' -- in that the allocated objects are always the exact size required.
  - The claim of 'no fragmentation' should be taken with a grain of salt -- in some sense the unallocated portions of the slabs are 'waste'. Slabs size is a multiple of the frame size.
9.9 Other Considerations
- 9.9.1 -- Prepaging
  - Under pure demand paging, we expect a large number of page faults initially as a process begins execution, or as a process which has been swapped out starts being swapped back in.
  - It may help overall performance if the OS performs prepaging.
  - The idea of prepaging is to load extra pages when servicing a page fault. Instead of simply loading the page on which the process faulted, load additional pages that the process is likely to fault on soon.
  - When swapping in a process or servicing a page fault, if the OS has some representation of the current working set of the process, it can use that as a "hint" as to which additional pages it should pre-fetch.
  - One simple tactic is to do clustering -- each page is a member of a cluster of say 4-8 pages, and whenever we page in a member of a cluster we always page in any other members of the cluster that are not currently resident.
- 9.9.2 -- Page Size
  - Normally the architecture of the computer hardware determines the page size (or some range of allowable page sizes.)
  - The trend in computer design has been to make page sizes larger and larger over the past decade or so. Current page sizes range upwards of 4K bytes per page.
  - There is no agreement on what is the "best" page size.
  - Arguments in support of BIG pages:
    1. When pages are bigger, page tables can be smaller.
    2. When pages are bigger it takes less time per byte to load pages into primary memory.
    3. When pages are bigger we have fewer page faults.
    4. When pages are bigger we have more TLB cache hits.
  - Arguments in support of SMALL pages:
    1. The internal fragmentation caused when a process does not end on a page-boundary will be less if page sizes are smaller
    2. With a smaller page size we have better resolution. Less unneeded material is paged into memory. Consequently there is less total I/O and less waste in the allocation of physical memory.
- 9.9.3 -- TLB Reach
  - TLB hardware is expensive and it uses a lot of power.
  - TLB reach is the number of addresses in memory that can be accessed through the TLB.
    TLB reach == (number of TLB entries) * (page size).
  - Greater TLB reach tends to increase the cache hit ratio, which improves effective memory access time, turnaround time, and throughput.
  - To improve reach, one can make the TLB larger, but increasing the page size will also improve reach.
  - Given a fixed average process size, internal fragmentation increases with the page size - because the percentage of memory wasted by the average process is
```
 page-size/2
______________
 process-size
```
  - Many modern systems are able to use more than one page size -- for example Solaris uses 8KB pages for most processes, and 4MB pages for very large processes. A field in the TLB entry indicates the size of the page.
- 9.9.4 -- Inverted Page Table
  - In a system with virtual memory and/or swapping, the OS must maintain information telling it where all the non-resident pages are located on disk.
  - If the system uses an inverted page table for ordinary logical-to-physical address translation, typically there are per-process external page tables where that location information is kept.
  - Under these conditions,the per-process external page table is needed only when the process gets a page-fault.
    
    The external tables may be left paged-out to swap space most of the time.
    
    Therefore, through the use of an inverted page table in a system with virtual memory, it is possible to reduce the total amount of physical memory allocated to page tables.
  - Bear in mind, however, that when a page-fault occurs the OS may need to page in part of the external page table in addition to the page on which the process faulted.
- 9.9.5 -- Program Structure
  - The text cites an example where an array is laid out in row-major form, each row on a separate page. A program that initializes each array entry mayincur many more page faults if it accesses the array a column at at time instead of a row at a time.
  - This illustrates that details of program layout and address-referencing patterns can affect process performance in a system with virtual memory.
- 9.9.6 -- I/O Interlock
  - Hardware may provide a lock bit for each frame. When the lock bit is set it means that this frame should not be replaced.
  - It is useful to lock kernel pages, user process pages with pending I/O, and pages newly brought in but as yet unused.
9.10 Operating-System Examples
- 9.10.1 -- Windows XP
  - Demand paging with clustering - whole clusters are fetched together following a page fault.
  - A process is assigned a working set minimum and maximum.
  - If a process faults when at its maximum the OS performs a local page replacement.
  - When free memory is scarce, the OS will take frames away from processes that have more than their working set minimum.
- 9.10.2 -- Solaris
  - Global replacement
  - A faulting thread receives a new frame.
  - If free memory falls below lotsfree then the pageout process starts running a two-handed clock algorithm.
  - The front hand 'forgives' - clears the reference bit. The back hand 'reaps' - frees the frames that have not been referenced, and writes them if dirty.
  - "Reclaim from free" is possible.
  - The scanrate and handspread can vary.
  - If free memory drops below a certain level, swapping is initiated.
  - Recent releases of Solaris do priority paging. For example, frames belonging to shared libraries may be 'excused' from becoming victims.