(Latest Revision:
Thu Nov 3, 2016
)
Chapter Eight -- Main Memory -- Lecture Notes
- 8.0 Objectives
- Describe ways of organizing memory hardware
- Discuss techniques for allocating memory to processes
- Discuss how paging works in contemporary computing systems
- 8.1 Background
- We assume the primary memory is an array of individually-addressable bytes.
- The instruction-execution cycle generates a stream of memory addresses.
The MMU has no ability to detect the purpose of an address.
- 8.1.1 -- Basic Hardware
- The only general-purpose storage that a CPU can access
is primary (main) memory or registers. In
particular, a CPU cannot directly access disk or other peripheral
storage.
- CPU access to primary memory is many times slower than access to
registers, but caches help speed things up.
- Hardware must provide a mechanism to protect the memory
allocated to each process.
The performance penalty would be very extreme if designers assigned
this task to the operating system - it would simply not be practical.
- A simple example of a way to protect memory - with base and limit
registers:
- The OS allocates one contiguous span of primary memory
to a process P.
- The base register contains the lowest address allocated
to P.
- The limit register contains the number of bytes in
the allocation.
- Using the values in the base and limit registers,
hardware checks every address generated
in user mode.
- Any attempt in user mode to access memory out of bounds
results in a trap.
- Changing base or limit registers are privileged instructions.
- The kernel gets unrestricted access to all memory -
a necessity for performing system tasks such as loading
jobs and fetching parameters of system calls.
- 8.1.2 -- Address Binding
- Addresses in the source program are generally symbolic --
e.g. count
- Typically the compiler binds symbolic addresses to relocatable,
relative addresses, given as offsets from the base address of
the program or the containing module.
- The relative addresses may be converted to absolute
addresses by the linkage editor or loader.
- To allow the relocation at run time of programs
from one area of memory to another, contemporary computing
systems utilize special techniques that require
hardware and operating system support.
- 8.1.3 -- Logical- versus Physical-Address Space
- The addresses seen by the CPU are logical addresses, aka
virtual addresses.
- The addresses seen by the memory address register (MAR) are
physical addresses, aka hardware addresses
- These two address spaces can be identical, but under
execution-time binding (the dominant paradigm), they are
separate. While a
process is executing, the memory management unit (MMU)
hardware is responsible for the mapping
from logical address to physical address required by executing
processes during their fetch/decode/execute cycle.
- The user program deals with the logical addresses exclusively.
The (MMU) hardware translates a logical address only
when a memory access is performed.
- In a simple example situation, the MMU hardware might
translate logical addresses in the range 0 ... max
to the range R ... R+max, where R is the value
stored in a relocation register, which is similar to
a base register.
- 8.1.4 -- Dynamic Loading
- Under dynamic loading, a routine is not loaded until it is
called.
- Each routine has a disk image represented in a relocatable
load format
- Routines that are never called are never loaded. This may
result in considerable savings in memory usage.
- Dynamic loading can be implemented just with user processes.
There is no need for any special assistance from hardware or
OS. However the OS may provide library routines that implement
dynamic loading.
- 8.1.5 -- Dynamic Linking and Shared Libraries
- The in-memory program text originally contains a stub for each
reference that the program has to a library routine. The stub is a
piece of code that tells where in memory or on disk to locate
the library routine.
- When the program first executes the stub, the stub
replaces itself with with the address of the routine
and executes it. (If need be, it first loads the routine.)
- All processes share the same copy of each library routine.
- Because of memory management, user processes need help from the
OS to check on the memory locations of routines.
The sharing of library routines requires help from the
hardware and the OS - shared memory.
- 8.2 Swapping
- 8.2.1 -- "Standard Swapping"
- Some older multiprogramming systems often performed swapping
as part of a context switch. The operating system would
swap out the current process, swap in another process, and then
put that process in the CPU to execute. This was a useful approach
on some systems at a time when memories were too small to
contain very many processes.
- (When you read about "standard swapping" don't let it get you
confused about the difference between swapping and context
switching. They are different, and they are independent of
one another - operating systems can, and do, perform
context switches without performing swapping, and an OS
can swap processes in or out without
involving them in a context switch.)
- It isn't practical to do "standard swapping" in a modern interactive
system. Unless the time slice is much larger than the total swap
time - on the order of several seconds - it won't be possible
to swap processes in and out fast enough to give them all
their turns in the CPU. A quantum of several seconds would
make the system very sluggish in its response to
interactive users.
- It is common for an OS to swap out one or more processes when the
system has begun to run out of physical memory.
- Windows 3.1 used a form of swapping. When a user clicked on a
window, the associated process would be swapped in,
if it was not already in memory.
- 8.2.2 -- Swapping on Mobile Systems
- Mobile systems don't typically perform swapping, because
- Swapping requires large amounts of secondary storage, and
mobile systems lack that.
- Throughput between main memory and secondary (flash) memory
is slow.
- Swapping would tend to quickly use up the limited number
of write operations that secondary flash memory can support.
- Instead of swapping, iOS asks applications to relinquish unneeded
memory, and iOS may terminate processes that don't free up enough
memory.
- Android acts in a manner similar to iOS, except before killing a
process for overuse of memory, it will save its state on secondary
memory so the application can be restarted quickly.
- Because of the conditions described above, programmers
for mobile environments have to incorporate conservation-minded
memory allocation/deallocation procedures in the applications
they write.
- 8.3 Contiguous Memory Allocation
- Contiguous memory allocation was a common memory allocation
scheme used during an earlier time in the evolution of operating
systems. It's a good idea for today's students to learn about contiguous
memory allocation - to get an introduction to the design issues that are
important, and to help the student appreciate the advantages of more
recently developed methods of memory allocation, like paging and
segmentation.
- In a contiguous memory allocation set-up, each process resides in
some contiguous address range in memory (e.g. in the L addresses
from base address B to address B+L-1). The interrupt vector and
OS would typically reside in low memory, and user processes in
higher-memory locations.
- 8.3.1 -- Memory Protection
- A scheme similar to the base-limit registers idea discussed
in chapter two will suffice to keep track of and enforce
memory allocations.
- In this scheme, there are both logical and physical address
spaces. A user process works with, say, L legal addresses: the
contiguous range from 0 to L-1. The MMU hardware
checks every logical address generated by the user process,
to make sure it is within the legal range. The MMU maps
each legal (aka valid) logical address to a corresponding
physical address by adding the value of the
relocation (aka base) register.
- By changing the values of the relocation and limit registers,
the OS can keep track of processes as it relocates and/or
resizes them. The OS can change its own size too.
- 8.3.2 -- Memory Allocation
- Fixed-size partitioning is a very simple
memory allocation methodology. The OS partitions user
memory into M subsets (partitions) of equal size. Each
partition is a contiguous range of memory. If a process
needs to run, and a partition is available, the OS
allocates one partition to the process. If the
process larger than the partition size, it will be
impossible to run the process. When a process exits,
it releases its partition. The OS puts the partition
on a list of free partitions, to be allocated to another
process later.
- Variable-sized partitioning
is more flexible than fixed-size partitioning.
- The OS maintains a free-list of available
"holes" in memory.
- When a process needs to be loaded into memory, the OS
finds, if possible, a hole in the free-list that
is big enough, removes it from the free-list,
and places the process into an initial contiguous section
of the hole. Any unused remainder of the hole
is a new hole that the OS puts on the free-list.
- When a process terminates, it releases its memory
allocation. The OS checks to see if the freed memory can
be merged with adjacent free holes to form a larger free
hole. The resulting hole is inserted into the list.
(Note: holes in the list that are merged with the new hole
have to be deleted from the list.)
- The job of allocating the memory under these conditions is
known as the dynamic storage allocation problem:
"... how to satisfy a request of size N from a list of
free holes"
- The strategy of searching for a hole may affect
performance. First fit,
best fit, and worst fit are
possibile strategies.
- Definitions:
- First fit: Choose the first hole
found that is big enough.
- Best fit: Choose the smallest hole
that is big enough - the one that leaves the smallest
left-over hole.
- Worst fit: Choose the biggest hole
- the one that leaves the biggest left-over hole.
- If we order the list of holes by size, we can decrease
the time required to find a suitable hole for a
process, but keeping the list in order requires extra
time.
- Simulations show:
- First fit is generally faster than best fit.
- Both first fit and best fit are better than worst
fit in terms of storage utilization and speed.
- 8.3.3 -- Fragmentation
- Fragmentation can be external or internal.
- External fragmentation is memory that is available but
unusable. (especially a collection of holes, each of
which is too small to use for anything, but which would
be enough to accommodate a process if it were possible to
combine them together into one hole.)
- Severe external fragmentation invariably occurs when
contiguous memory is allocated using first fit, best fit, or
worst fit algorithms. For example, 1/3 of the memory
may be wasted (unusable) after a large number of allocations
and deallocations have happened.
- Internal fragmentation is memory that
is allocated but not
used. (The allocation method may require that processes
sometimes get more memory than they need. For example,
there may be a
minimum allocation, or allocations may be made in chunks of a
specific size.)
- If processes are dynamically relocatable then the OS can move
them around to compact external fragmentation into
usable holes. PROBLEM WITH THIS: it can take a long time if
done all at once, and if tried 'piecemeal' becomes difficult
to do correctly and efficiently.
- We will see further along in this chapter that
it is possible to do an "end run"
around the external fragmentation problem by allowing
the memory allocation of a process to consist of
fixed-size, non-contiguous chunks of physical memory
called page frames.
- 8.4 Segmentation
- 8.4.1 -- Basic Method
- Programmers tend to think of their programs as a collection of
named functions, modules, and data structures -- not arranged
in any particular order.
- Maybe it is not "natural" to think of a process as
occupying a linear array of bytes starting at address 0
and running to some upper limit.
- The segmentation memory management scheme views the
memory allocation of the process as an unordered collection
of variably-sized units called segments.
- A logical address consists of a segment 'name'
(actually, for convenience, a segment number) followed
by an offset within the segment.
- 8.4.2 -- Segmentation Hardware
- A segmentation addressing system requires support for
translating the logical addresses
seen by the CPU into the physical addresses used
for transfers between the CPU and main memory.
- The hardware performs this routine address translation
with the help of a segment table for each process.
- A segment table is indexed by segment number (name). The i-th
entry of the table contains the base address and the limit
(length) of the i-th segment of the process.
- To translate a logical address (s, d),
the hardware compares d with the limit
value in the s-th segment table entry.
If d is non-negative and less than the limit,
the hardware computes the physical address as the sum of
the segment base plus the offset d. Otherwise the
hardware traps to the OS, which then handles the illegal address
error.
- Since segments are variable in size, segmentation memory allocation
schemes, like contiguous allocation with variable-sized partitions,
are instances of the dynamic storage allocation
problem, and therefore severe problems with external
fragmentation are common.
- 8.5 Paging
- When swapping is done in conjunction with variable-size
partitioning or (pure, variable-size) segmentation, there is
typically a dynamic storage allocation
problem to solve on the swap space device in addition to the
problem in main memory. Backing stores are very slow compared to
main memories so compaction is not a realistic option.
- Paging is a popular alternative memory allocation scheme that eliminates
external fragmentation.
- 8.5.1 -- Basic Method
- For purposes of this discussion, let's assume that the
smallest addressable unit of primary memory is a
byte. It should be obvious how to apply the
concepts developed here to situations in which there is a
different word size.
- The hardware has a given page size such as 4Kbytes (in other
words, 4096 bytes). We divide primary memory and backing store
into page-sized contiguous chunks (called frames). For
example frame #0 runs from byte #0 through byte #4095; and frame
#1 runs from byte #4096 through byte (4096+4095)=8191.
- To put a process into the primary memory, the operating
system writes code and data structures into a set of (physical
memory) frames. The frames don't have to be contiguous
with each other. For example the frame
used for the first 4096 bytes of the process (page #0 of the
process) could be frame #17, which has base address
in physical memory of 17*(4096)=69632, and runs up through
byte 69632+4095=73727. The second 4096 bytes of the process (page #1
of the process) could be in frame #3, which runs from
byte 3*4096=12288 to byte 12288+4095=16383.
- With paging, the logical address space is contiguous.
In effect it is just an array of bytes, ranging from byte #0
to some upper limit.
- As a program runs, the MMU hardware does all the
routine translation of logical addresses to physical addresses
by using a page table. The operating system does
not perform this routine address translation -- that
would require an interrupt for every memory access, and would
be prohibitively slow!
- The OS creates a page table entry for each page, when it first
loads the page into a frame. For logical page number i, the OS
puts the number of the frame allocated for page i into entry i
of the process page table. When the process attempts a memory
access, hardware uses some of the most significant bits of the
the logical address (known as the page number) as an index into
the page table. We can visualize the logical address as
( p | d ), where p is the bits of the page
number and d is the remaining bits of the logical address,
called the offset. The hardware finds a
frame number f at location p in the
page table. To form the physical address, the hardware
constructs ( f | d ) by replacing p in the
logical address with the frame number f.
(The lengths of p and f in bits can be
different.) The hardware then continues with the
memory access.
- Suppose that the page size is 2n bytes.
Then each page offset and each frame offset must
consist of n bits.
- If the number of frames of physical memory is
2k, then there must be at least
k bits in each frame number.
- Similarly, if the number of pages in the logical address space is
2h, then there must be at least h bits in each
page number.
- There is no external fragmentation with paging. However,
typically a process does not need all of the memory in its
"last frame." The remainder is internal fragmentation - about
half a page, on average.
- A small page size reduces internal fragmentation. A large
page size keeps the page table smaller and reduces the total
amount of I/O overhead for copying pages to and from the
backing store.
- Memory protection with paging is pretty straightforward. The
OS creates the page table. The OS uses the page table
to protect memory, much as another OS would use base-limit registers.
(The 'bases' are the frame numbers, rather than physical
addresses of memory cells, and the 'limits' are not explicitly
stored, because they're all just equal to the page size.)
- The OS has to keep track of all the allocations of the
physical frames.
- The OS keeps track of a copy of the page table of each process.
- Suppose a user process gives an address as a parameter when
communicating with the OS. For example the address could be the
base address of an array that the process wants to use as an
I/O buffer. The process gives the OS a logical address.
(The process only knows about logical addresses.) The
operating system needs to know the physical address.
The OS will use the page table of the process to translate.
- 8.5.2 -- Hardware Support
- Page tables can be implemented in a variety of ways.
In an extremely simple case, each process might have its
own page table, and the page table might be implemented using
a bank of dedicated registers.
- In many contemporary systems, the CPU/MMU architecture
contains a page-table base register (PTBR) pointing to a
large page table that is resident in the main memory.
- Such a contemporary system also uses a fast associative memory
address cache (Translation Lookaside Buffer - TLB) so that the
MMU does not usually have to take the time to access the page
table when performing an address translation.
- When it is necessary to access the page table in memory, depending
on the particulars of the system design, it could be either the
hardware or an OS interrupt routine that performs that access.
- Address Space Identifier (ASID) technology allows the TLB to
contain address translation information for several different
processes.
- ASID technology also cuts down on the necessity to do
time-consuming cache flushes during a context switch.
- Effective memory access time is a function of the hit ratio,
memory access time, and TLB search time.
- 8.5.3 -- Protection
- Some bits in page table entries (PTEs) can be used to
make access restrictions on pages - a read-only bit,
for example. The hardware can be designed to generate
a trap if a user process attempts to write to a
read-only page. It is also common for a PTE to contain
a valid bit.
- Some systems make the page table only as long as is necessary
for the size of the process. Such a system would typically
have a page-table length register (PTLR). A process attempting
to access an address "past the end of the table" would generate
a trap to the OS.
- In any case, the valid bit in "extra" page table entries can be
cleared by the OS so that the process will trap if it tries to
use one of those entries.
- Unfortunately a process generally can access the
internal fragment in its last page.
- 8.5.4 -- Shared Pages
- The paging paradigm easily supports shared memory (at least
when "traditional" hierarchical page tables are used.)
- If two processes have the same frame number in both their
page tables then they are able to share that frame.
- The OS can use this idea to allow many processes to share
the same read-only program text.
- Writeable memory may be shared as a means of interprocess
communication.
- 8.6 Structure of the Page Table
- 8.6.1 --- Hierarchical Paging
- A common size for page table entries (PTEs) is 4 bytes.
- 212 bytes = 4KB is a typical page size.
- Assuming the sizes above, a page of 4KB has room for
210 PTEs = 1024 page table entries.
- Assuming a logical address space that uses 32-bit addresses,
there are 232 addressable bytes. If the page size
is 212 bytes, then 20=32-12 of the bits in an
address comprise the page number, which implies
there can be as many as 220 pages in the logical
address space of a process. That is about a million pages.
- Again, under the assumptions above, the page table for a
process with 220 pages would contain
222 bytes, which is 210*212 bytes.
Therefore the page table itself would span 210=1024 pages.
- To avoid having to solve an instance of the dynamic storage
allocation problem for page table allocation,
it may be workable to page the page tables instead
- at least when they are significantly larger than one page
in size, and also when they are not too big.
- In one scheme, the logical address is partitioned as
(P1 | P2 | d). P1 is used as an index
into an outer page table. The entry in the outer
page table is the frame number of one of the pages of the
page table. P2 and d are then used in the
"normal way" to complete the address translation: P2
is used as an index into the specific page of the page table.
The frame number found in the PTE is combined with d
in the usual way to form the physical address.
- For still larger page tables, some architectures have supported
more levels of paging, where, for example, not only is the page
table paged, but so is the outer page table.
One of the SPARCs produced by Sun Microsytems
supported three-level paging, and the Motorola 68030 had
support for four-level paging.
- Generally it is not considered appropriate to map a 64-bit
paged address space with this type of 'traditional' hierarchical page
table. It requires what is considered an excessive number of
levels of page tables -- e.g. seven levels.
- 8.6.2 -- Hashed Page Tables
- Per-process hashed page tables are an alternative to
hierarchical page tables. A hash function is applied to the
virtual address. Collisions are resolved with external
chaining. Each entry on a chain contains a virtual address,
frame number, and pointer for the next item on the chain.
- Clustered page tables are a variant in which each entry in the
page table refers to several pages.
- 8.6.3 -- Inverted Page Table
- Some systems, including the UltraSPARC and PowerPC, utilize
an innovation called an inverted page table. An
inverted page table has one entry for each frame.
The entry identifies which "address space" (e.g. which process)
is using the frame, and for which virtual page number the frame
is being used.
- There is just one inverted page table for the whole system,
not one page table per process. A distinct advantage of the
methodology is that the amount of memory used by the
inverted page table is bounded by a constant times
the amount of physical memory, as opposed to
being bounded only by the number of processes and the
max size of a virtual address space.
- A distinct disadvantage of using inverted page tables is
that the hardware and/or the OS cannot directly index into
the table using the page number, so it could take a long
time to search this table to find the information
needed for a forward address translation. (On the other hand
this structure easily supports efficient reverse
address translation.)
- The idea of a hashed page table may be used in conjunction with
the inverted page table to speed the search for the correct
table entry. A hash function can be applied to the address
space identifier and virtual address to determine the location
to perform an initial probe. External chains can provide
subsequent locations to probe until the matching entry is
found.
- Of course if there is a cache hit in the TLB, the page table is
not consulted and effective memory access time is nearly
equal to memory access time. If there is a TLB miss and the
page table is consulted, then (forward) address translation
requires additional memory accesses
for operations on the page table and hash structure.
- If entries in the inverted page table are allowed to contain
only one virtual page number, it becomes difficult to implement
shared memory. If we provide information for more than one
process and virtual page number in an inverted page table entry,
then the amount of memory used by the page table can no longer
be said to be
big-O of the size of physical memory.
- 8.7 Example: Intel 32-bit and 64-bit Architectures
- Popular PCs run on Intel chips - Linux runs on other
architectures besides Intel.
- Advanced RISC Machine (ARM) architecture is popular for mobile
devices.
- 8.7.1 IA-32 Architecture
- The Intel IA-32 system has a combined segmentation and paging
scheme.
- 8.7.1.1 IA-32 Segmentation
- Logical addresses consist of a (selector, offset) pair
that is very similar in purpose to the kinds of addresses
used in pure segmentation, which consist of a segment
name (number) and an offset within the segment.
- There is a segmentation unit, in effect a part of the
MMU, that uses a data structure much like a segment
table to translate a logical address into an
intermediary form called a linear address.
- 8.7.1.2 IA-32 Paging
- In effect, the IA-32 segments are paged, so the
next step in finding the physical address is to
determine which (logical) page of the segment is referenced
by the linear address, and to locate the physical
frame to which that logical page is mapped.
- There is a paging unit, also in effect part of the
MMU, which translates the linear address into a
physical address.
- There are two page sizes supported by the IA-32, 4KB and
4MB. A Page Size Flag in the outer page table
is set if the page size is 4MB.
- If the flag is not set, then the paging unit carries out
a standard two-level forward address translation procedure
to form the physical address.
- If the flag is set, then the paging unit carries out
a standard single-level forward address translation process,
using the outer page
table alone, and bypassing the step of going to the inner
page table.
- The IA-32 system employs virtual memory techniques, allowing
parts of page tables to not be resident in primary memory.
The OS can bring them into memory from disk when they
are needed.
- 8.7.2 x86-64
- The x86-64 architecture, developed by Advanced Micro Devices (AMD)
and adopted by Intel, can potentially support 64-bit address spaces.
- Current systems are using the architecture to support up to
48-bit virtual addresses and up to 52-bit physical addresses.
- 8.7 Example: ARM Architecture
- ARM architecture is used in many mobile devices.
- ARM architecture can support 1 MB or 16 MB pages
with standard single level paging.
- ARM architecture can support 4 KB or 16 KB pages
with standard two level paging.
- The ARM architecture supports two levels of TLB, with separate outer
(micro) TLBs for instructions and data.
- The outer (micro) TLBs support ASIDs
- If there is a miss in the outer TLB, an inner TLB is consulted.
- If page table "walks" are required because of misses at both
levels of TLB, the ARM hardware performs them.