allows processes to be larger than physical memory,
allows more programs to execute concurrently,
tends to make processes faster
by reducing time spent on loading and swapping,
facilitates page sharing,
is not easy to implement, and
can decrease performance.
9.2 Demand Paging
The idea of demand paging is to
load only the pages of the program that are actually needed
- a "lazy" form of swapping.
9.2.1 -- Basic Concepts
Support from the hardware is required to implement virtual
memory. A set
in a page-table entry can signify
that the page number is a legal address for the process
and that the page is currently resident in main memory.
As an aid to the OS, the page table entry of a non-resident
page may contain the on-disk address of the page.
What if the hardware encounters an
in a page-table entry during address translation? That
triggers a trap to the OS.
The OS then handles
(aka services) this page-fault. To service
a page-fault, the OS must:
check an internal table to see if the logical
address is valid for the process.
if not valid, terminate the process,
the page is valid but not resident in primary
find a free frame,
schedule a disk read to
load the missing page
into the frame,
after the page is loaded, mark it valid in the page
table and the internal table, then
restart the instruction
that caused the page-fault.
pure demand paging scheme, the OS
would start the process with no resident
pages and would not load any page unless and until
the process faults on that page. In practice, an OS
typically performs some
an attempt to reduce the number of page-faults.
Because of locality of reference, page-fault rates are not
nearly as high as they would be if addresses
"randomly" by the executing process.
Hardware Support Required for Demand Paging:
Swap space on high-speed backing store
Instruction architecture in which any instruction can
be restarted after a page-fault. (complexities are
discussed on pp. 404-405 of the ninth edition.)
9.2.2 -- Performance of Demand Paging
of a process is the number of
page-faults the process gets during its execution divided by
the number of memory accesses it performs.
A normal memory access takes somewhere between 10 and 200
If a page-fault occurs during an attempted memory access, it
may require about 8 milliseconds to complete that memory
access, mainly because the page has to be loaded from secondary
storage. It may require additional increments of 8 milliseconds
due to time waiting to reach the 'front' of the device queue.
there are 40,000 200-nanosecond periods in an 8 millisecond
period. This is like the difference in magnitude between one
second and eleven hours -- eleven hours is about 40,000
Under the assumptions above,
if we get just one page-fault
in every 40,000 page accesses then the effective memory
access time could be double the 200ns figure -- 400ns.
To assure an effective memory access time below 220ns we would
need to have fewer than one page-fault in every 400,000 memory
The point here is that page-faults are very
detrimental to effective memory access time and so
keep the page-fault rate very, very low.
It is faster to page or swap to the swap space, rather
than the file system, so an OS does not usually page to the
It may be beneficial to perform a certain amount of paging
from the file system.
Windows XP, Linux, and Solaris use copy-on-write with fork() --
only pages actually written by the parent or child are copied.
Another variant used by unix is vfork() -- here the parent sleeps
while the child uses the address space of the parent. The child is
supposed to *not* write the parent's space. The child is expected
to soon use a form of exec that gives the child a new address space
of its own -- separate and distinct from that of the parent. When
the parent wakes it *will* see any changes the child made to the
parent's address space, so caution is critical. This is considered
dangerous and inelegant but efficient.
9.4 Page Replacement
Physical memory may become over-allocated --
there may be no free
frames when a page-fault occurs!
9.4.1 -- Basic Page Replacement
If there is no free frame to service a page-fault then
one way to handle the problem is to
use a frame that is NOT free,
but which does not appear to be needed
by the process that is currently using it.
The OS writes the contents of the "victim" frame to swap space first,
if it has been modified
(as recorded in the "dirty bit" of the
page table entry) otherwise the OS *does not* write the frame
-- so as to save time. (If for some reason the page is missing
from swap space and needs to be there, then write it anyway.)
Update entries in page tables and data structures (e.g. frame
allocation table) associated with both the "victim" process,
and the process that takes the frame. (It's possible that one
process is both the victim and the taker.)
We need to solve the frame-allocation problem:
how many frames are allocated to each process?
We need to choose a page-replacement algorithm:
which frame should be the "victim"?
We want the replacement algorithm
that gives the lowest page-fault rate.
We evaluate a page replacement algorithm by trying it out on
some reference strings.
Suppose you make a list L of all the logical memory
addresses referenced by a process P during its lifetime, in
the order the references are made. Suppose you make a list
L' from L by just using the page numbers of each address
from L. Suppose you then "consolidate" L' into a third list
L" by "collapsing" all runs of the same page number into
just one "copy" of that page number. The result would be
the reference string of the process P.
The reference string of a process is the sequence of
transitions it makes from referencing one page
to referencing another page. A page fault can happen only
when the process makes such a transition.
The length of a reference string is just the number
of terms in the sequence.
evaluate a page-replacement algorithm by counting how
many page faults it gets on a reference string.
The fewer page
faults it has on the reference string of a process,
the lower will be the page fault rate it has on the
original list L of address references of the process.
9.4.2 -- FIFO Page Replacement
The operating system can link records
representing all the occupied frames into a FIFO queue. When a
page is loaded into a frame the corresponding record goes into
the back of the queue. When a victim for page replacement is
needed the frame at the front of the queue is selected. This
is called FIFO page replacement.
FIFO page replacement is easy to understand and implement but
it does not tend to do very well at keeping the page-fault rate
to a minimum.
FIFO replacement requires updates to data structures only when
a frame is loaded or a victim frame is selected.
FIFO page replacement is subject to Belady's anomaly: the page
fault rate for the same reference string can sometimes go *up*
when you increase the number of frames available!
9.4.3 -- Optimal Page Replacement
there is a page replacement policy that is
known to be "as good as possible."
It gives the lowest possible page-fault rate, guaranteed.
In essence, the algorithm is to
"replace the page that will not
be used for the longest period of time."
We can think of it as the LNU algorithm:
"replace the page with the
latest next use."
This algorithm is also called "OPT" or "MIN."
Sometimes this algorithm is referred to as
the oracle method
because the OS has to know the future of the process
in order to execute the algorithm.
Usually we don't have prior knowledge of what the reference
string of a process will be, and so
it is usually not
possible to implement the OPT algorithm -- too bad.
(Remember that the reference string of a process generally
depends on the outcomes of branch instructions, which in
turn generally depend on what data is input to the process.)
we can evaluate other page-replacement algorithms by
comparing their performance with that of OPT on pre-selected
9.4.4 -- LRU Page Replacement
The idea of LRU is to
replace the page that is "least recently
used" -- i.e. the page that has not been accessed for the
Because LRU is similar to OPT -- it just looks "back" instead
of "ahead" -- it seems plausible that LRU may tend to have low
page fault rates. (LRU might be called EPU for "Earliest Previous
Use." This underscores the fact that
LRU is the "mirror image"
of the OPT algorithm: "Latest Next Use.")
LRU works well generally, but it can fail miserably in some
LRU replacement requires very frequent updates to data
structures -- each time there is a memory reference.
In theory we could implement LRU by arranging for
hardware to perform a kind of time-stamping.
Hardware would increment a counter after every
memory reference, and write the value of the counter into the
page table entry (PTE) every time a page is accessed. The OS
would search page tables for the lowest value of the counter
to find the LRU page for replacement.
Another method -- one that seems a little more practical --
would be for the hardware to perform updates to
an array of records indexed by page number or frame number.
The records would contain pointer fields which would be used to
implement a stack as a doubly-linked list.
Each time a
reference is made to a page the hardware would
have to index into the array and place the referenced frame at
the top of the stack by manipulating pointers (no more than
six). The LRU frame for replacement would just be the frame on
the bottom of the stack.
OPT and LRU are stack algorithms -- the set of pages
in memory for an allocation of N frames is a subset of the
set of pages in memory for an allocation of N+1 frames.
To see that LRU is a stack algorithm, note that when doing
LRU with N frames, the pages in memory are the N most
recently referenced pages.
Stack algorithms do not suffer from Belady's anomaly.
9.4.5 -- LRU Approximation Page Replacement
Most computers do not have hardware support for true LRU
Most systems do have a reference bit associated with
each page table entry.
When a reference is made to an address in a page the
hardware sets the corresponding reference bit
to indicate that the page has been accessed.
The operating system is able to clear reference bits.
We can implement page replacement algorithms that are
approximations of pure LRU by using manipulation of reference
In this scheme the OS keeps a table containing
one byte of memory associated with each page.
Periodically an interrupt gives the OS control and
the OS rolls the value of each reference bit into
the most significant bit (msb) of its corresponding byte.
After rolling a reference bit, the OS clears it to
prepare for the next cycle. Examples:
If the byte is 0000 0000 then the page has not been
referenced for eight periods in a row.
If the byte is 1111 1111 then the page has been
referenced in all of the last eight periods.
If the byte is 0100 1000 then the page was referenced
two periods ago, and also five periods ago.
If the byte is 1000 0100 then the page was referenced
in the latest period and also six periods previously.
Note that if we just view these bytes as unsigned integers
then the smaller values correspond to the pages that are
less recently used.
When the OS needs to choose a victim for
page-replacement it picks a page with minimal byte-value.
This algorithm approximates LRU.
18.104.22.168 -- Second-Chance Algorithm
In the typical implementation of the 2nd-chance algorithm
there is a circular linked list of records. Each record
represents one physical frame.
pointer points to the current element of the list.
When the OS needs a victim
frame it executes the
examine the current element.
If the reference bit of the frame is 0 (the frame
has not been referenced lately) then make the frame
Copy the new page into the frame and advance the
pointer to the next frame.
(the reference bit of the frame is 1 because it
was accessed recently) do not make
the frame a victim
(give it a second chance). Clear the reference bit.
Advance the pointer
to the next frame. Go to step 1.
Enhanced Second-Chance Algorithm
The enhanced version is like straight 2nd chance except
we keep going around until we find an un-referenced
(or until we come back around to
where we started.)
If we do come back around without getting a victim then go
around again until finding any unreferenced frame.
If there is other process activity going on concurrently
with this, the OS may have to go around again and choose a
referenced clean frame, or even go around a fourth time
and choose a referenced dirty frame.
It is said that the method above was used in a version of
the Macintosh OS. I don't know if this is still in use
with OS X.
9.4.6 -- Counting-Based Page Replacement
If we can arrange for a counter (or an approximation) to track
how many references have been made to each frame then we can
implement a least-frequently-used or most-frequently-used
page-replacement algorithms. These algorithms don't generally
tend to have very good performance, and the overhead of the
implementation is high.
9.4.7 -- Page-Buffering Algorithms
As an optimization the system may keep a pool of clean free
-- when a dirty victim is selected the new page is
immediately copied into a clean free frame from the pool and
can be used right away. The dirty victim is then written back
to swap space. After that, the now clean victim becomes part of
the pool. (Note that it will still contain its original content,
unless something special, like zero-filling, is done.)
Another optimization is for the OS to keep the pool of free
frames, and also to remember the page for which each frame was
previously used. If a process faults on a page whose old frame
happens to still be in the pool, then the OS just gives back
the frame to the process. That way the OS avoids the work of reloading
the page from disk (this is sometimes called
"reclaim from free.")
The OS can also keep track of which (allocated) pages are dirty
and "in its spare time" write dirty pages to swap space. This
way, when a page has to be replaced it is more likely to be
clean, and hence more likely to be replaced quickly. If
the OS is maintaining a pool of free frames, this idea of
writing back allocated frames is still worthwhile because it
tends to help keep excessive requests from draining the pool.
9.4.8 -- Applications and Page Replacement
This section points
out that some applications, because they have special
characteristics and needs, should manage their own
primary and secondary memory to the greatest extent possible
and NOT rely on operating system algorithms and structures.
By doing so, those applications will function much
more efficiently. Databases and data warehouses
are examples of these kinds of applications.
9.5 Allocation of Frames
How many frames should be made available to a process? Should it be
allowed to take frames away from other processes?
Should frames for user pages and frames for file buffers and heap
storage all go into the same pool or should there be separate free
lists for the different usages?
Should the system allow free lists or pools of free memory to drain
completely, or should some minimum size be maintained at all costs?
9.5.1 -- Minimum Number of Frames
A process cannot execute an instruction unless the instruction
and all the data that the instruction accesses are entirely
resident in physical memory.
If an instruction straddles two pages and the data it acts on
straddles two or more pages then it may be necessary to have
four or more pages resident in memory in order for that
instruction to execute.
For every computer architecture there is some worst case
scenario. There is some largest number N such that a process
may need N resident pages in order to execute an instruction.
That being the case, the operating system must be set up to
allow any process to have at least N frames. In the worst
case, the process would not be able to execute if it could not
get N frames allocated to it simultaneously.
9.5.2 -- Allocation Algorithms
It's possible to give nearly equal numbers of free frames to
all processes -- equal allocation
Probably it makes more sense to allocate free frames to
processes in proportion to their "need."
Need might be
measured in various ways. larger processes or higher priority
processes may be judged to be more needy.
9.5.3 -- Global Versus Local Allocation
When it replaces a page, the OS selects a frame used by process
X, and loads it with a page needed by process Y.
Under a global page-replacement policy, X and Y
don't have to be the same process.
Under a local page-replacement policy X and Y do
have to be the same process.
Global replacement policies are more common,
they allow the number of frames allocated to a process to
change according to changing need.
With global replacement, ideally there is a "Robin Hood"
the operating system steals frames from "rich"
processes (that don't need them) and gives them to "poor"
processes (that do need them).
If everything works out perfectly then each process has the
frames it needs to maintain a low page-fault rate. This
should result in high average throughput.
On the other hand, there are other ways to dynamically alter
the number of frames allocated to each process. If we use
such an allocation strategy in conjunction with a local
page-replacement policy, might we allow processes finer
control over their own page-fault rates?
9.5.4 -- Non-Uniform Memory Access
Many multiprocessor systems have multiple system boards,
each with multiple CPUs and some primary memory. This
usually means that a CPU can access the memory on its own
system board more quickly than the memory on other boards.
This is non-uniform memory access (NUMA)
On NUMA systems, it is desirable for the OS to allocate primary
memory for a process on one particular system board, and to
insure that the process executes on a CPU on that same board.
The result is likely to be high cache hit rates and short memory
If the process is multithreaded, this manner of scheduling is more
challenging. Solaris is one example of an OS that has an approach
to solving the problem. Solaris tries to schedule all threads of
a process and allocate all its memory within one lgroup,
which is a group of CPUs and memory areas that are mutually close,
meaning that the latency between each pair of elements is low.
When the number of frames allocated to a process is low enough, it
will get page-faults so often that the paging activity of the
process will take up much more time than the execution of
A process that spends more time paging than executing is said
to be thrashing.
9.6.1 -- Cause of Thrashing
Suppose the degree of multi-programming is very low -- say
there are only one or two active processes. In that case it
will be likely that both processes will be waiting for I/O at
the same time quite often. At those times the CPU will be
idle. CPU utilization is low in such a system. Assuming
there is adequate memory available, it will probably help
utilization if we increase the degree of multiprogramming.
On the other hand if the degree of multi-programming is very
high, and if physical memory is over-allocated then it is
likely that there will be a lot of thrashing going on. In
that case too the CPU utilization may be quite low because
often all processes will be blocked waiting for paging to
complete. In this case it will only make things worse to
increase the degree of multiprogramming.
According to the principle of locality the
typical process usually spends relatively long periods
of time accessing a relatively small locality.
A locality is a subset of its text and data that
the process has been accessing recently.
Relative to memory access time, the locality of
the processes tends to change very slowly.
According to this view
the process will not thrash as long as it has enough frames
to hold all or most of its current locality.
As an example of a process remaining in one locality for a long
time, consider how a process acts while it executes a loop. The
only instructions the process accesses are the instructions of
the loop body and the instructions of the loop control. Quite
possibly the process will only access a small portion of its data
while executing the loop. It is a very common thing for a process
to spend a long time executing in a loop. So that's an example of
a process staying in a small locality for a long time.
9.6.2 -- Working Set Model
We can pick a number Δ (delta) and arrange for the system
to count or approximate the number of (distinct) pages
referenced during the last Δ memory references. For
example, we may set Δ = 10,000 and estimate the number of
pages accessed in the last 10,000 memory references. (This will
require some combination of actions performed by hardware and
the OS -- e.g. keeping a history of values of reference bits.)
The set of pages referenced during the last Δ references is
called the working set. It is an approximation of the locality
of the process. The goal of the OS would be to use this
approximation to help keep each process supplied with enough
frames to hold its working set.
If the OS uses this approach, and if there are sufficient
jobs available, it seems reasonable that the OS will be able
to keep the degree of multiprogramming high enough to attain
good CPU utilization and low enough to prevent thrashing.
When all processes have enough frames for the current size of
their working sets, and when there is enough additional
memory, the OS would bring another job into memory -- say
from swap space.
When memory is in short supply and processes can't get enough
for their working sets, the OS would swap a process out and
divide its allocation among the remaining processes that need
9.6.3 -- Page-Fault Frequency (PFF)
To control thrashing,
it is more direct and simple to work with PFF than a working set
Establish an acceptable PFF: call it f.
If the PFF of a process exceeds f, give it more frames
until the PFF is less than f.
If the PFF of a process falls well below f then take
frames away from the process until the PFF is close to f.
When there is a need to give frames to processes and there are
not enough frames, swap a process out and free its frames.
When all processes have acceptable PFF's and there is memory to
spare, increase the level of multiprogramming -- swap a process
One possible drawback to the PFF approach is that it is not
sensitive to the difference between a change in size and a
transition of the working set. If the working set is not
changing in size, but merely transitioning from one locality to
another, it may be better for performance not to give more
frames to the process, but just to let page faults replace the
pages that are being "vacated" in favor of pages new to the
Also, it would be good if the system keeps track of the
working set of each process, to help with
9.6.4 -- Concluding Remarks
Virtual memory has a lot of advantages. It's helpful and useful.
However, its ability to compensate for a lack of physical memory
is quite limited. In order to get good performance,
very important to provision systems with generous
amounts of physical memory.
9.7 Memory-Mapped Files
VM techniques can be used to reduce the need for time-consuming
system calls and disk accesses for file I/O.
9.7.1 -- Basic Mechanism
The initial access to a file is handled like a page fault.
After a file block is mapped into VM, subsequent accesses are
routine memory accesses.
Writes through to disk can be done when the file is closed, or
as part of a periodic interrupt routine.
When memory-mapping of files is available, it can be used to
implement the sharing of a file by a group of processes, shared
memory, and copy-on-write functionality.
9.7.2 -- Shared Memory in the Win32 API
9.7.3 -- Memory-Mapped I/O
9.8 Allocating Kernel Memory
An OS kernel typically has special memory allocation needs.
The OS may require kernel code and data to be entirely
resident in physical memory at all times, even though
virtual memory is implemented for user processes.
The OS may need large numbers of data structures that have sizes
that are not multiples of a page size.
Some hardware devices may be configured so that they have to
interact with a set of contiguous frames of physical memory.
To facilitate conservation of memory, usually there are special
memory-allocation techniques and methods
available to the kernel.
9.8.1 -- Buddy System Allocation Method
This model facilitates allocation of memory in contiguous
chunks of size equal to any power of two
- up to some limit.
Using the buddy system it's easy to coalesce some adjacent
deallocated chunks of equal size into single free chunks.
The coalescence can 'cascade' easily - leading to the
reformation of larger and larger chunks, which then become
available for reallocation.
Problem: internal fragmentation can be up to 50% of total
9.8.2 -- Slab Allocation
The idea of the slab allocator is to use groups of pages
as an array of one specific kind of data structure: say
process descriptors, file objects, semaphores, and so on.
Each slab can be managed as with "fixed-sized partitioning"
This is very simple and there is 'no fragmentation' -- in that
the allocated objects are always the exact size required.
The claim of 'no fragmentation' should be taken with a grain
of salt -- in some sense the unallocated portions of the slabs
are 'waste'. Slab size is a multiple of the frame size.
9.9 Other Considerations
9.9.1 -- Pre-paging
Under pure demand paging, we expect a large number of page
faults initially as a process begins execution, or as a process
which has been swapped out starts being swapped back in.
It may help overall performance if the OS performs pre-paging.
The idea of pre-paging is to load extra pages when servicing a
page fault. Instead of simply loading the page on which the
process faulted, load additional pages that the process is
likely to fault on soon.
When swapping in a process or servicing a page fault, if the OS
has some representation of the current working set of the
process, it can use that as a "hint" as to which additional
pages it should pre-fetch.
One simple tactic is to do clustering -- each page is a member
of a cluster of say 4-8 pages, and whenever we page in a member
of a cluster we always page in any other members of the cluster
that are not currently resident.
9.9.2 -- Page Size
Normally the architecture of the computer hardware determines
the page size (or some range of allowable page sizes.)
The trend in computer design has been to make page sizes larger
and larger over the past three decades or so.
Current page sizes
range upwards of 4K bytes per page, even on mobile
There is no agreement on what is the "best" page size.
Arguments in support of BIG pages:
When pages are bigger, page tables can be smaller.
When pages are bigger it takes less time per byte to load
pages into primary memory.
When pages are bigger we have fewer page faults.
When pages are bigger we have more TLB cache hits.
Arguments in support of SMALL pages:
The internal fragmentation caused when a process does not
end on a page-boundary will be less if page sizes are
With a smaller page size we have better resolution.
Less unneeded material is paged into memory. Consequently
there is less total I/O and less waste in the allocation
of physical memory.
9.9.3 -- TLB Reach
TLB hardware is expensive and it uses a lot of power.
TLB reach is the number of addresses in memory that can be
accessed through the TLB.
TLB reach == (number of TLB entries) * (page size).
Greater TLB reach tends to increase the cache hit ratio, which
improves effective memory access time, turnaround time, and
To improve reach, one can make the TLB larger, but
the page size will also improve reach.
Given a fixed average process size, internal fragmentation
increases with the page size - because the percentage of memory
wasted by the average process is
Many modern systems are able to use more than one page size --
for example Solaris uses 8KB pages for most processes, and 4MB
pages for very large processes. A field in the TLB entry
indicates the size of the page.
9.9.4 -- Inverted Page Table
In a system with virtual memory and/or swapping, the OS must
maintain information telling it where all the non-resident
pages are located on disk.
If the system uses an inverted page table for ordinary
logical-to-physical address translation, typically there are
per-process external page tables where that location
information is kept.
Under these conditions, the per-process external page table is
needed only when the process gets a page-fault.
The external tables may be left paged-out to swap space most of
through the use of an inverted page table in a
system with virtual memory, it is possible to reduce the total
amount of physical memory allocated to page tables.
Bear in mind, however, that when a page-fault occurs the OS may
need to page in part of the external page table in addition to
the page on which the process faulted.
9.9.5 -- Program Structure
The text cites an example where an array is laid out in
row-major form, each row on a separate page. A program that
initializes each array entry may incur many more page faults
if it accesses the array a column at at time instead of a row
at a time.
This illustrates that
details of program layout and
address-referencing patterns can affect process performance in
a system with virtual memory.
9.9.6 -- I/O Interlock
Hardware may provide a lock bit for each frame. When the
lock bit is set it means that this frame should not be
It is useful to lock kernel pages; user process pages with
pending I/O; and pages newly loaded, but as yet unused.
Commonly, all or a part of the OS kernel is locked into memory.
As an example, consider what might happen if a page X of
the kernel memory-management module was not resident
in memory. What if a kernel thread faults on page X,
and X contains some of the code for handling the page fault?
The OS may lock a newly-replaced frame F to make sure that the
process P that owns F is allowed to use F at least once.
If F is not locked, the OS might select F as a victim
for page replacement again before P has a chance to use F.
9.10 Operating-System Examples
9.10.1 -- Windows
Demand paging with clustering - whole clusters are fetched
together following a page fault.
A process is assigned a working set minimum and maximum.
If a process faults when at its maximum the OS performs a
local page replacement.
When free memory is scarce, the OS will take frames away from
processes that have more than their working set minimum.
9.10.2 -- Solaris
A faulting thread receives a new frame.
If free memory falls below lotsfree (about 1/64th of
physical memory) then the pageout process (page demon)
starts running a two-handed clock algorithm.
The front hand 'forgives' - clears the reference bit. The back
hand 'reaps' - frees the frames that have not been referenced,
and writes them if dirty.
Initially frames "freed" by the pageout process are placed in
a cache, from which they can be "reclaimed from free" if the
process from which they were taken faults on them.
After freed frames are moved from the cache to the free
list, "reclaim from free" is no longer possible.
The scanrate (# pages scanned per second)
and handspread (# pages between hands) can vary.
If free memory drops below a certain level, swapping is
Pages belonging to libraries being shared by several processes
are skipped during the pageout process.