(Latest Revision:
Sun Sep 19, 2005
)
Chapter Nine
--
Virtual Memory
--
Lecture Notes
- Intro -- Virtual Memory:
- allows a partially-loaded process to execute,
- allows programs to be larger than physical memory,
- is not easy to implement, and
- can decrease performance.
- Section 9.1 -- Background
- With virtual memory, the level of multiprogramming can be higher,
and hence CPU utilization and throughput can be higher
- With virtual memory, parts of the program that are not used are
not loaded, so there is a potential to save time on loading and
swapping I/O.
- One may think of virtual memory as "automated overlaying."
- Virtual memory may be implemented by demand-paging or
demand-segmentation.
- Section 9.2 -- Demand Paging
- One may think of the pager daemon as a "lazy swapper."
- Section 9.2.1 -- Basic Concepts
- Support from the hardware is required to implement virtual
memory. A set "valid bit" in a page-table entry can signify
that the page number is a legal address for the process
and that the page is currently resident in main
memory.
- As an aid to the OS, the page table entry of a non-resident
page may contain the on-disk address of the page.
- If the hardware encounters a set invalid bit in a page-table
entry during address translation then a
page-fault is initiated. To service a
page-fault the OS must:
- check to see if the logical address is in-range for the
process
- if not in range terminate the process, else
- find a free frame
- schedule a disk read to load the missing page
- when the page has loaded mark it present in the page
table and other internal data stucture(s)
- restart the instruction that caused the page-fault
- Pure demand paging would start the process with no resident
pages and all pages would be loaded as a result of a
page-fault. In practice an OS performs pre-fetching in an
attempt to reduce the number of page-faults.
- Because of locality of reference, page-fault rates are not
nearly as high as they would be if addresses were generated
"randomly" by the executing process.
- Hardware Suppport Required for Demand Paging:
- Page table
- Swap space in secondary memory
- Instruction architecture in which any instruction can
be restarted after a page-fault. (complexities are
discussed on pp. 324-325 of the sixth edition.)
- Section 9.2.2 -- Performance of Demand Paging
- The page-fault rate of a process is the number of
page-faults the process gets during its execution divided by
the number of memory accesses it performs.
- A normal memory access takes somewhere between 10 and 200
nanoseconds -- say the average is about 100 nanoseconds.
- If a page-fault occurs during an attempted memory access, it
will require about 25 milliseconds to complete that memory
access since the page has to be loaded from secondary
storage.
- there are 250,000 100-nanosecond periods in a 25 millisecond
period. This is like the difference between a second and 3
days -- 3 days is about 250,000 seconds.
- Under the assumptions above, if we get just one page-fault
in every 250,000 page accesses then the effective memory
access time could be double the 100ns figure -- 200ns.
- To assure the effective memory access time will be below
110ns we would need to have fewer than one page-fault in
every 2,500,000.
- The point here is that page-faults are very
detrimental to effective memory access time and so we must
keep the page-fault rate very, very low.
- It is faster to page or swap to the swap space,
rather than the file system, so an OS does not usually page
to the file system.
- Section 9.3 -- Process Creation
- Section 9.3.1 -- Copy-on-Write
- Windows 2000, Linux, and Solaris 2 use copy-on-write with
fork() -- only pages actually written by the parent or child
are copied.
- Another variant used by unix is vfork() -- here the parent
sleeps while the child uses the address space of the parent.
The child is supposed to *not* write the parent's space.
The child is expected to soon use a form of exec that gives
the child a new address space of its own -- separate and
distinct from that of the parent. When the parent wakes it
*will* see any changes the child made to the parent's
address space, so caution is critical. This is considered
dangerous and inelegant but efficient.
- Section 9.3.2 -- Memory-Mapped Files
- The OS may map files into memory and use demand paging to
load incrementally.
- "File semantics" may not be identical to disk-based file I/O
-- for example writes to the memory-resident file may not
"go through" to the disk immediately.
- Solaris 2 memory maps all file I/O -- to user memory if
mmap() is used else to kernel "buffer" memory.
- The same basic scheme will support file sharing by multiple
processes and copy-on-write sharing.
- Note that processes sharing memory may use synchronization
techniques of chapter 7 to solve critical section problems.
- Section 9.4 -- Page Replacement
- Physical memory may become over-allocated -- there may be no free
frames when a page-fault occurs!
- Section 9.4.1 -- Basic Scheme
- If there is no free frame to service a page-fault then take
one away from some process that does not appear to need it
(badly).
- Write the contents of the "victim" frame to swap space first
if it has been modified (as recorded in the "dirty bit" of
the page table entry) otherwise *don't* write -- save time.
- Update page table and data structures associated with both
the "victim" and the "aggressor" processes.
- We need to solve the frame-allocation problem: how many
frames are allocated to each process?
- We need to choose a page-replacement algorithm: which frame
should be the "victim"? We want the replacement algorithm
that gives the lowest page-fault rate.
- We evaluate a page replacement algorithm by trying it out on
some reference strings.
- Suppose you make a list L of all the logical memory
addresses referenced by a process P during its lifetime, in
the order the references are made. Suppose you make a list
L' from L by just using the page numbers of each address
from L. Suppose you then "consolidate" L' into a third list
L" by "collapsing" all runs of the same page number into
just one "copy" of that page number. The result would be
the reference string of the process P.
The reference string of a process is the sequence of
page numbers of "new" pages referenced by the process, in
order of reference. (Every time the process stops
referencing one page and starts referencing a different
page, a new page number is appended to the reference
string.)
- The length of a reference string is just the number
of terms in the sequence.
- Section 9.4.2 -- FIFO Page Replacement
- The OS can link records representing all the occupied
frames into a FIFO queue. When a page is loaded into a
frame the corresponding record goes into the back of the
queue. When a victim for page replacement is needed the
frame at the front of the queue is selected. This is called
FIFO page replacement.
- FIFO page replacement is easy to understand and implement
but it does not tend to do very well at keeping the
page-fault rate to a minimum.
- FIFO page replacement is subject to Belady's anomaly: the
page fault rate for the same reference string can sometimes
go *up* when you increase the number of frames available!
- Section 9.4.3 -- Optimal Page Replacement
- Interestingly, there is a page replacement policy that is
known to be "as good as possible."
- It gives the lowest possible page-fault rate, guaranteed.
- In essence, the algorithm is to "replace the page that will
not be used for the longest period of time."
- The algorithm can be restated as: "Replace the page with the
latest next use."
- This algorithm is sometimes called "OPT" or "MIN." It might
also be called "LNU" for "latest next use."
- Sometimes the algorithm is referred to as "the Oracle
method" because the OS has to see into the future to apply
the rule.
- Usually we don't have prior knowledge of what the reference
string of a process will be, and so it is usually not
possible to implement the OPT algorithm -- too bad.
(Remember that the reference string of a process generally
depends on the outcomes of branch instructions, which in
turn generally depend on what data is input to the process.)
- However we can test other page-replacement algorithms by
comparing their performance with that of OPT on pre-selected
reference strings.
- Section 9.4.4 -- LRU Page Replacement
- The idea of LRU is to replace the page that is "least
recently used" -- i.e. the page that has not been accessed
for the longest time.
- Because LRU is similar to OPT -- it just looks "back"
instead of "ahead" -- it seems plausible that LRU may tend
to have low page fault rates.
- LRU is considered to be "good" although it can fail
miserably in some situations.
- In theory we could implement LRU by arranging for
hardware to increment a counter after every memory
reference, to write the value of the counter into the page
table entry (PTE) every time a page is accessed, and to
search page tables for the lowest value of the counter
whenever a page needs replacement.
- Another method -- one that seems a little more practical --
would be for the hardware to keep records
representing the frames (doubly) linked into a stack. Each
time a reference is made to a page the hardware would
have to place the referenced frame at the top of the stack
by manipulating pointers. (This can be done quickly if the
underlying list is doubly linked.) The LRU frame for
replacement would just be the frame on the bottom of the
stack.
- OPT and LRU are stack algorithms -- the set of pages
in memory for an allocation of N frames is a subset of the
set of pages in memory for an allocation of N+1 frames.
- To see that LRU is a stack algorithm, note that when doing
LRU with N frames, the pages in memory are the N most
recently referenced pages.
- Stack algorithms do not suffer from Belady's anomaly.
- Section 9.4.5 -- LRU Approximation Page Replacement
- Most computers do not have hardware support for true LRU
page-replacement
- Most systems do have a reference bit associated with each
physical frame.
- When a reference is made to an address in a frame the hardware
sets the corresponding reference bit to indicate that the
frame has been accessed.
- The operating system is able to clear reference bits if it
wants to.
- We can implement page replacement algorithms that are
approximations of pure LRU by using manipulation of reference
bits.
- Section 9.4.5.1 -- Additional-Reference-Bits Algorithm
- In this scheme the OS keeps (say) one byte of memory
associated with each frame. Periodically an interrupt
gives the OS control and the OS rolls the value of each
reference bit into the most significant bit (msb) of the
corresponding byte. After rolling each reference bit, the
OS clears it. Examples:
- If the byte is 0000 0000 then the frame has not been
referenced for eight periods in a row.
- If the byte is 1111 1111 then the frame has been
referenced in all of the last eight periods.
- If the byte is 0100 1000 then the frame was
referenced two periods ago, and also five periods
ago.
- If the byte is 1000 0100 then the frame was
referenced in the latest period and also six periods
previously.
- Note that if we just view these bytes as unsigned
integers then the smallest values correspond to the
frames that are least recently used.
- When the OS needs to choose a victim for page-replacement
it picks frames with minimal byte-value. This algorithm
approximates LRU.
- Section 9.4.5.2 -- Second-Chance Algorithm
- The idea of the 2nd-chance algorithm is to link
representations of all physical frames into a circular
list.
- An external pointer points to the current element of the
list.
- When the OS needs a victim frame it executes the
following algorithm:
- examine the current element.
- If the reference bit of the frame is 0 (the frame
has not been referenced lately) then make the frame
the victim. Copy the new page into the frame and
advance the pointer to the next frame.
- Otherwise if the reference bit of the frame is 1
(the frame was accessed recently) then do not make
the frame a victim (give it a second chance). Clear
the reference bit. Advance the pointer to the next
frame. Go to step 1.
- Section 9.4.5.3 -- Enhanced Second-Chance Algorithm
- The enhanced version is like straight 2nd chance except
we keep going around until we find an un-referenced
clean victim. (or until we come back around to
where we started.)
- If we do come back around without getting a victim then
go around again until finding any unreferenced
frame.
- If there is other process activity going on concurrently
with this, the OS may have to go around again and choose
a referenced clean frame, or even go around a fourth time
and choose a referenced dirty frame.
- The method above was used in a Mac OS. I don't know if
this is still in use with OS X.
- Section 9.4.6 -- Counting-Based Page Replacement
- If we can arrange for a counter (or an approximation) to track
how many references have been made to each frame then we can
implement least frequently used or most frequently used
page-replacement algorithms.
- Section 9.4.7 -- Page-Buffering Algorithm
- As an optimization the system may keep a pool of clean free
frames -- when a dirty victim is selected the new page is
immediately copied into a clean free frame from the pool and
can be used right away. The dirty victim is then written back
to swap space. After that the now clean victim becomes part of
the pool. (There is always the danger that the size of the
pool will shrink to zero due to excessive requests.)
- Another optimization is for the OS to keep the pool of free
frames and also keep track of what is on the freed pages. If
a process faults on a page that happens to be in the pool then
the OS just gives the process back the frame from the pool and
thereby avoids loading that page from disk (this is sometimes
called "reclaim from free.")
- The OS can also keep track of which (allocated) pages are
dirty and "in its spare time" write dirty pages to swap
space. This way when a page has to be replaced it is more
likely to be clean, and hence more likely to be replaced
more quickly. If the OS is maintaining a pool of free
frames, this idea of writing back allocated frames is still
worthwhile because it tends to help keep excessive requests
from draining the pool.
- Section 9.5 -- Allocation of Frames
- How many frames should be made available to a process? Should it
be allowed to take frames away from other processes?
- Should frames for user pages and frames for file buffers and heap
storage all go into the same pool or should there be separate free
lists for the different usages?
- Should the system allow free lists or pools of free memory to drain
completely, or should some minimum size be maintained at all costs?
- Section 9.5.1 -- Minimum Number of Frames
- A process cannot execute an instruction unless the
instruction and all the data that the instruction accesses
are entirely resident in physical memory.
- If an instruction straddles two pages and the data it acts on
straddles two or more pages then it may be necessary to have
four or more pages resident in memory in order for that
instruction to execute.
- For every computer architecture there is some worst case
scenario. There is some largest number N such that a process
may need N resident pages in order to execute an instruction.
- That being the case, the operating system must be set-up to
allow any process to have at least N frames. In the worst
case, the process would not be able to execute if it could not
get N frames allocated to it simultaneously.
- Section 9.5.2 -- Allocation Algorithms
- It's possible to give equal numbers of free frames to all
processes -- equal allocation
- Probably it makes more sense to allocate free frames to
processes in proportion to their "need." Need might be
measured in various ways. larger processes or higher priority
processes may be judged to be "needier."
- Section 9.5.3 -- Global Versus Local Allocation
- When it replaces a page, the OS selects a frame used by
process X, and loads it with a page needed by process Y.
- Under a global page-replacement policy, X and Y don't have
to be the same process.
- Under a local page-replacement policy X and Y do have to be
the same process.
- Global replacement policies are more common, perhaps because
they allow the number of frames allocated to a process to
change according to changing need.
- With global replacement ideally there is a "Robin Hood"
effect: the operating system steals frames from "rich"
processes (that don't need them) and gives them to "poor"
processes (that do need them).
- If everything works out perfectly then each process has the
frames it needs to maintain a low page-fault rate. This
should result in high average throughput.
- On the other hand, there are other ways to dynamically alter
the number of frames allocated to each process. If we use
such an allocation strategy in conjunction with a local
page-replacement policy, might we allow proceses finer
control over their own page-fault rates?
- Section 9.6 -- Thrashing
- When the number of frames allocated to a process is low enough, it
will get page-faults so often that the paging activity of the
process will take up much more time than the execution of
instructions.
- This process which spends more time paging than executing is said
to be thrashing.
- Section 9.6.1 -- Cause of Thrashing
- Suppose the degree of multi-programming is very low -- say
there are only one or two active processes. In that case it
will be likley that both processes will be waiting for I/O at
the same time quite often. At those times the CPU will be
idle. CPU utilization is low in such a system. Assuming
there is adequate memory available, it will probably help
utilization if we increase the degree of multiprogramming.
- On the other hand if the degree of multi-programming is very
high, and if physical memory is over-allocated then it is
likely that there will be a lot of thrashing going on. In
that case too the CPU utilization may be quite low because
often all processes will be blocked waiting for paging to
complete. In this case it will only make things worse to
increase the degree of multiprogramming.
- According to the principle of locality at any given
time a process is accessing just a small subset of its text
and data. In comparison to memory access time, the contents
of the locality changes very slowly. According to this view
the process will not thrash as long as it has enough frames
to hold all or most of the current locality.
- Section 9.6.2 -- Working Set Model
- We can pick a number D (delta) and arrange for the system to
count or approximate the number of (distinct) pages referenced
during the last D memory references. (This will require some
combination of actions performed by hardware and the OS -- e.g.
keeping a history of values of reference bits.)
- The set of pages referenced during the last D references is
called the working set. It is a representation of the locality
of the process. The aim would be to keep each process supplied
with enough frames to hold its working set.
- If the OS uses this approach, and if there are sufficient
jobs available, it seems reasonable that the OS will be able
to keep the degree of multiprogramming high enough to attain
good CPU utilization and low enough to prevent thrashing.
- When all processes have enough frames for the current size of
their working sets, and when there is enough additional
memory, the OS would bring another job into memory -- say
from swap space.
- When memory is in short supply and processes can't get enough
for their working sets, the OS would swap a process out and
divide its allocation among the remaining processes that need
more frames.
- Section 9.6.3 -- Page-Fault Frequency (PFF)
- It is more direct and simple to work with page-fault
frequency than a working set model.
- Establish an acceptable page fault rate R.
- If a process exceeds R, give it more frames until the
rate is less than R.
- If the page-fault rate of a process falls well below R then
take frames away from the process until the page-fault rate
is close to R.
- When there is a need to give frames to processes and there
are not enough frames, swap a process out and free its
frames.
- When all processes have acceptable page-fault rates and there
is memory to spare, increase the level of multiprogramming --
swap a process in.
- Section 9.7 -- Operating Systems Examples
- Section 9.7.1 -- Windows NT
- NT uses demand paging with clustering (whole clusters of
several pages are pre-paged).
- A process is assigned a min and max "working set" -- number
of frames.
- If a process faults and it currently has less than the max
number of frames, then NT will allocate it a new frame, else
it uses a local replacement algorithm.
- When free frames are scarce NT will trim processes back to
their minimum numbers of frames.
- Versions of NT running on a uniprocessor use a
page-replacement algorithm that is a variant of the clock
algorithm.
- On a multiprocessor NT uses FIFO page replacement there are
cache consistency problems involved with clearing reference
bits.
- Section 9.7.2 -- Solaris 2
- Solaris always gives a faulting thread a free frame.
- If the supply of free frames falls below "lotsfree" then the
pageout daemon starts running the two-handed-clock algorithm.
This takes frames way from processes that have not used them
recently and places them on the free list.
- As free frames become more scarce Solaris increases the
number of pages scanned per second by the clock algorithm.
- Solaris starts swapping processes if the amount of free
memory falls below a certain level.
- Section 9.8 -- Other Considerations
- Section 9.8.1 -- Prepaging
- Under pure demand paging, we expect a large number of page
faults initially as a process begins execution, or as a
process which has been swapped out starts being swapped back
in.
- It may help overall performance if the OS
performs prepaging.
- The idea of prepaging is to load extra pages when servicing a
page fault. Instead of simply loading the page on which the
process faulted, load additional pages that the process is
likely to fault on soon.
- If the OS has some representation of the current working set
of the process, it can use that as a "hint" as to which
additional pages it should pre-fetch.
- One simple tactic is to do clustering -- each page is a
member of a cluster of say 4-8 pages, and whenever we page in
a member of a cluster we always page in any other members of
the cluster that are not currently resident.
- Section 9.8.2 -- Page Size
- Normally the architecture of the computer hardware determines
the page size (or some range of allowable page sizes.)
- The trend in computer design has been to make page sizes
larger and larger over the past decade or so. Current page
sizes range upwards of 4K bytes per page.
- Arguments in support of BIG pages:
- When pages are bigger, page tables can be
smaller.
- When pages are bigger it takes less time per byte to
load pages into primary memory.
- When pages are bigger we have fewer page faults.
- Arguments in support of SMALL pages:
- The internal fragmentation caused when a process does
not end on a page-boundary will be less if page sizes
are smaller
- With a smaller page size we have better
resolution. Less unneeded material is paged into
memory. Consequently there is less total I/O and less
waste in the allocation of physical memory.
- Section 9.8.3 -- TLB Reach
- TLB hardware is expensive and it uses a lot of power.
- TLB reach is the number of addresses in memory that can be
accessed through the TLB.
- A larger page size implies a larger TLB reach.
- Section 9.8.4 -- Inverted Page Table
- In a system with virtual memory, the OS must maintain
information telling it where all the non-resident pages are
located on disk.
- Typically there is a per-process external page table where
that information is kept.
- If the system employs an inverted page table for ordinary
logical-to-physical address translation then the per-process
external page table will be needed only when the process gets
a page-fault.
In that case the external tables may stay paged-out to swap
space most of the time.
Therefore, through the use of an inverted page table in a
system with virtual memory, it is possible to lessen the
total amount of physical memory allocated to page tables.
- It must be borne in mind however that when a page-fault
occurs the OS may need to page in part of the external page
table in addition to the page on which the process faulted.
- Section 9.8.5 -- Program Structure
- The text cites an example where an array is laid out in
row-major form, each row on a separate page. A program that
initializes each array entry will incur many more page faults
if it accesses the array a column at at time instead of a row
at a time.
- This illustrates that details of program layout and
address-referencing patterns can affect process performance
in a system with virtual memory.
- Section 9.8.6 -- I/O Interlock
- Hardware may provide a lock bit for each frame. When
the lock bit is set it means that this frame should not be
replaced.
- It is useful to lock kernel pages, user process pages with
pending I/O, and pages newly brought in but as yet unused.
- Section 9.8.7 -- Real-Time Processing
- A real time process must not be subjected to delays caused by
paging activity.
- Solaris 2 allows privileged users to require pages to be
locked into memory. This is support for real-time computing.
- Section 9.9 -- Summary
- Virtual memory automates what programmers used to do with
overlays -- allowing processes that are bigger than physical
memory to execute.
- In theory the degree of multiprogramming can be higher in a
system with virtual memory, resulting in a system with higher
CPU utilization.