(Latest Revision: Fri, Feb 23, 2023)
[2023/02/23: added more boldface and some minor rewording]
[2023/02/21: added some boldface emphasis]
[2023/02/17: added a couple of lines to section 4.5.1]
[2022/02/25: format and info addition to section 4.7.1]
[2021/01/20: updated chapter title]
[2019/06/03: added captions to figures]
[2019/05/30: inserted more figures]
[2019/03/21: format changes]
[2019/02/21: current 2019 updates]
Chapter Four -- Threads & Concurrency -- Lecture Notes
4.0 Objectives
Identify the basic components of a thread,
and contrast threads and processes.
Describe the major benefits and significant
challenges of designing
multithreaded processes.
Illustrate different approaches to implicit threading,
including thread pools,
fork-join, and Grand Central Dispatch.
Describe how the Windows and Linux operating systems represent
threads.
DesignLook at multithreaded applications
using
the Pthreads, Java, and Windows threading APIs.
Figure 4.1: Single-threaded and multithreaded processes
4.1 Overview See figure 4.1 at right. We can
think of a
thread as a part of a process that actually performs an execution
of the program - an execution sequence of a program.
There is only one thread in a "traditional process" but there can
be multiple threads - multiple execution sequences - within
a single process. The threads can share many
parts of the context
of the process, like the program code (aka the text),
variables and other data, open files, signals and other messages.
On the other hand, each thread is a separate execution sequence, and
so it needs NOT to share certain parts of its context. Each thread
needs to have its own separate program counter, CPU register values,
thread id number, and run-time stack for supporting its own function
calls.
4.1.1 Motivation It's common now that
we need computer
applications to work for us on multiple separate concurrent
activities, such as a word processor that concurrently renders
a display, checks the spelling in a document, and reads
characters the user types with a keyboard.
Figure 4.2: Multithreaded server architecture
When a multi-threaded process runs on a multiprocessor, it's possible
for two or more of the threads to execute simultaneously, which, of
course, can be a great efficiency win. One example would be a busy
Internet server process that can serve multiple clients
simultaneously.
Most operating system kernels are multithreaded!.
4.1.2 Benefits These are the main categories of
benefits of multithreaded programming.
Responsiveness: work is divided among threads and
some threads can continue working while others are blocked
(for example, waiting for I/O to complete) [ Note this is a type of
concurrent processing that applies to a uniprocessor ]
Resource Sharing: Unlike separate processes,
threads can
share memory and other resources by default,
which makes
it easier for them to communicate and cooperate.
Also, there is
efficient utilization of primary memory when threads share
code and data .
Economy:Not much needs to be done to add a new
thread
to an existing process, or to perform a context switch
between two threads of the same process. Therefore
these operations usually
require less new memory allocation and less time
than creation of a new
process or a context switch between different processes.
Scalability: On a multiprocessor,
multiple threads can work on a problem in parallel
- truly simultaneously.
A single-threaded process can only run on one CPU at a time,
no matter how many CPUs are available in the computer.
4.2 Multicore Programming Threads or processes are
concurrent when they execute at approximately the
same time. This can happen on a computer with a single CPU (aka a
uniprocessor), when, for example, the operating system performs
multitasking. However, parallel threads or processes
truly execute simultaneously. This can only happen on a multiprocessor.
A definition: A multiprocessor with multiple computing cores (CPUs)
on a single chip is called a multicore system.
Figure 4.3: Concurrent execution on a single-core systemFigure 4.4: Parallel execution on a multicore system
4.2.1 Programming Challenges These are five challenges
in programming for multicore systems.
Identifying tasks:How do
we divide up the work of a process
among the threads so that we get lots of work
going on in parallel?
Balance: There's a cost of adding a new thread to a process.
How do we make sure the new thread contributes enough
to justify that cost?
Data Splitting:How do we divide up the data
among threads to facilitate lots of work in parallel?
Data Dependency:When thread X needs to get
some data from thread Y before thread X can
continue, how do we
synchronize the actions of X and Y so the data is passed from one
to the other correctly?
Testing and debugging: The actions of two threads executing
in parallel can be interleaved in very numerous ways. Therefore
there can be a very large number of different orders of instruction
execution possible for a multithreaded application. How can we
effectively test and debug such applications that have unpredictable
execution paths?
Figure 4.5: Data and task parallelism
4.2.2 Types of Parallelism There are two main kinds
of parallelismdata parallelism and
task parallelism.
With data parallelism, the data is divided up,
and each thread is
given one part of the data on which to work. Each thread does the same
operations, but on different data. If we have
two lists to sort and
two threads, and if each thread sorts one of the lists,
that's data parallelism.
With task parallelism, the threads get different kinds
of jobs to do,
and each thread uses whatever part(s) of
the data it needs.
If one thread computes the average of an array while
another thread
finds the maximum value in the array, that's task parallelism.
Programs often use some combination of the two strategies.
Figure 4.6: User and kernel threads
4.3 Multithreading Models
So far the threads we have studied are kernel-level
threads. There is also something called user-level
threads. Basically, user-level threads are
simulations of threads. User threads have some
nice advantages, but also some disadvantages.
Think about a single (kernel-level) thread that executes a time slice
in a CPU. Suppose we make that single thread execute software that
simulates multiple threads.
The resulting simulation can switch between
multiple different
activities many times during the single time slice of the kernel-level
thread in the CPU. In other words, it can very quickly
simulate multiple
context switches of multiple (simulated) threads, all during
one time slice of one kernel-level thread. That is
a very nice
advantage - very low overhead for context switching.
Other advantages are that the
simulator can quickly create, execute,
and terminate simulated threads at will without
being slowed by
requesting help from the kernel and waiting for
system calls to
execute.
Such simulated threads, user-level
threads, are quite popular for use
in many applications.
There must be kernel-level threads
to support user-level threads.
We discuss some design models for this support:
many-to-one, one-to-one, or many-to-many.
Figure 4.7: Many-to-one model
4.3.1 Many-to-One Model:
In this model, a single kernel-level thread supports many user-level
threads. As explained above, context switching
is extremely fast among the user-level threads. This model
supports programmers that want to organize software as a group
of concurrent threads. However the kernel-level thread can only
execute in one CPU at a time. Therefore the user-level threads
can only execute one at a time. True parallelism
is not possible with this model. Also, if any of the
user-level threads
makes a blocking system call, the
kernel-level thread will have to block,
and so it will not be able to run code for any of the user-level
threads until it gets out of its wait queue.
The net effect is that all user threads are blocked if
any one of them makes a blocking system call.
Figure 4.8: One-to-one model
4.3.2 One-to-One Model
In the one-to-one model, the number of kernel-level
threads is equal to the number of
user-level threads.
Each user-level thread has exactly one supporting
kernel-level thread.
Each kernel-level threads supports exactly one
user-level thread.
If we use the one-to-one model, we "cure" the problems of the
many-to-one model .
Now, if one user-level thread blocks,
the others don't have to block .
Also user-level threads can operate in parallel
on a multiprocessor.
However,
because each kernel-level thread supports only one user-level
thread, we lose the advantage of quick context switches
between user-level threads.
Also each time we create a new user-level thread, we need to create
a new kernel-level thread to support it, so
we tend to use up more time creating threads.
There is less flexibility to schedule the user-level threads.
The scheduler in the kernel decideswhen to run the supporting kernel-threads.
This model may lead to excessive numbers
of kernel-level threads,
which could adversely affect performance.
Despite the disadvantages listed above, many operating systems use the
one-to-one model. It's relatively easy to implement,
and the disadvantages are viewed as acceptable because
many systems
have a large number of processing cores that will support large
numbers of kernel-level threads.
Figure 4.9: Many-to-many model
4.3.3 Many-to-Many Model
In the many-to-many model, a set of user-level threads
is supported by
a smaller or equal number of kernel-level threads. A user-level
thread is not necessarily bound to any particular kernel-level
thread. The thread library can re-assign it. For example if
a kernel-level thread X has to block, some of the user-level threads
that X was supporting can migrate for support to different kernel-level
threads.
The many-to-many model can be seen as combining advantages of
the many-to-one and one-to-one models.
The many-to-many model allows the quick creation of a large
number of user-level threads that can do very fast context
switches..
The application can have greater control
over the number of kernel-level threads.
The application can control the scheduling of the user-level threads.
Much of the advantage of the one-to-one model remains:
The ability to exploit parallelism on a multiprocessor and
independent blocking, the ability of one user-level thread
to block without forcing the other user-level threads to block.
Figure 4.10: Two-level model
The two-level model is a variation on the many-to-many model,
in which a user-level thread may be bound to a kernel-level thread.
It is difficult to implement the many-to-many model,
and few systems support it.
4.4 Thread Libraries
Posix thread (pthread) implementations vary from system to system -
could be user-level or kernel-level.
Windows threads are kernel-level
The Java thread API is typically implemented using a native thread
package on the host system (e.g. Pthreads or Windows).
DEFINITION: In asynchronous threading, the parent creates
one or more child threads and then executes concurrently with them.
DEFINITION:In synchronous threading, the parent creates one or more
child threads and waitsfor all the child threads to exit
before resuming execution.
Section 4.4 contains three examples of synchronous threading in which a parent thread
creates a child thread to execute a function. The parent
blocks until the child has exited, and then the parent
resumes execution.
4.4.1 Pthreads
If a variable is declared in a program, and it's outside
of any function, then it is automatically in memory
shared by all threads of a process. So that makes it
very easy to set up an area of shared memory.
&tid is the address of a variable to store the
id number of the thread.
&attr is the address of a data structure containing
attributes for the new thread to have.
fname is the name of the function where the
new thread will begin execution.
paramPtr is a pointer to the parameter that will
be passed to fname when the new thread
executes fname.
Figure 4.11: Multithreaded C program using the Pthreads API
4.4.2 Windows Threads
As with POSIX, in the Windows API,
if a variable is declared in a program, and it's outside
of any function, then it is automatically in memory
shared by all threads of a process. So that makes it
very easy to set up an area of shared memory.
The basic idea of implicit threading is
automation.
Developers and programmers identify tasks that can run in parallel,
and compilers and run-time libraries, or other software,
create and manage threads to perform those tasks.
4.5.1 Thread Pools
The main thread of a busy Internet server does not have time to
'personally' perform the service for each client,
because it needs to return immediately to the job of accepting
the incoming request from the next
client.
One way to deal with that problem is for the main thread
to spawn a child, a service thread, for each client.
The service thread then handles the
client request and terminates.
However it can take excessive time for the creation and termination
of the service threads, and if too many client requests come in too
fast, there may be too many service threads taxing system resources.
The idea of a thread pool is for the server to create a fixed number
of service threads at the time of process start up, and to allow them
to 'stay alive' as long as the server operates. The service threads
that don't have anything to do are kept suspended. When a client
needs service, and if a service thread is available,
the main thread assigns one to the client. Otherwise the main
thread puts the client in a queue where it waits for a service
thread to become available.
When a service thread finishes with a client, it does not terminate.
It just 'goes back to the pool' and waits to be assigned to another
client.
Generally it takes less time for a server process to use an existing
thread to service a request than to create a brand
new service thread. Also,
no matter how much the server is flooded with client requests, the
number of service threads never exceeds the fixed
size of the pool.
"Thread pool architectures" are a form of implicit threading that make it
relatively easy for programmers to implement servers that utilize thread pools.
For example, there are Android, Windows, and Java APIs that provide high-level
commands for creating and managing thread pools.
Some thread pool architectures can monitor the frequency of client
requests and
dynamically adjust the size of the thread pool to match demand.
4.5.1.1 Java Thread Pools
4.5.2 Fork Join
If programmers leave notations in the programs designating work
that can be performed in parallel, then implicit threading library
code can create (fork) child threads to do the work, and arrange
for the children to return results to the parent (join the parent)
when they terminate.
Figure 4.16: Fork-join parallelism
4.5.2.1 Fork Join in Java A version of Java
has a fork-join library that spawns threads to perform recursive
calls in algorithms like Mergesort.
Programmers "mark"
sections of the source code where parallel work is possible, and the
library code takes care of creating and managing threads to do the
work.
Figure 4.17: Fork-join in Java
4.5.3 OpenMP (compiler add-on and API
for C, C++, and FORTRAN)
Figure 4.19: UML Class diagram for Java's fork-join
A programmer can insert labels in the code that
identify certain sections that should be executed by parallel
threads. The compiler responds to the labels by generating code
that creates threads that execute those sections of code
in parallel.
4.5.4 Grand Central Dispatch is comprised
of extensions to C, an API, and a run-time library.
Like OpenMP, it provides parallel processing,
although details of the implementation differ.
4.5.5 Intel Thread Building Blocks is another
approach to implicit threading that relies on templates for
parallel structures and task scheduling, rather than special
compilers or language features.
4.6 Threading Issues This section summarizes "issues"
that have to be resolved by people who implement operating systems
that support multi-threading.
4.6.1 The fork() and exec() System Calls
When an application is multi-threaded,
should the fork()
system call duplicate all threads, or just the calling thread?
Some API's provide both options.
Implementations of exec() typically replace the entire process
of the calling thread, including all threads in that process.
Therefore, if the child created by a
fork() is going to call exec() immediately, there's no point
in having the fork() duplicate all the threads in the process.
4.6.2 Signal Handling
Signals are a simple form of interprocess communication in some
operating systems, primarily versions of unix. Signals were designed
to behave something like interrupts, but they are not
interrupts.
The OS delivers and handles signals.
Delivering signals and
handling (responding to) signals are routine
tasks the OS performs as opportunities arise. Sometimes delivery
of a signal to a process (or thread) is required as part of the
OS performance of interrupt service, or a system call.
The OS delivers signals to a process (thread) by setting a
bit in a context variable of the process (thread). Just
before scheduling a process (thread) to execute, the OS checks
to see if any signals have been delivered to the process (thread)
that have not been handled yet. If so, the OS will cause the
signal to be handled properly. Sometimes it does this by
executing code in kernel mode, and sometimes it handles a
signal by jumping into the user process at the start address
of a special handler routine the process has for responding
to the signal.
The exact appropriate way of handling a signal depends on the
nature of the signal.
Multithreading complicates the problem of implementing signal
delivery.
Should a signal be delivered to all the threads in a
process or just some? Just the threads to which
the signal applies? There are many different kinds
of signals, and any of those possibilities may apply, depending on
the nature of the signal.
Often the handler for a signal should run only once. A signal sent
to a process may be delivered only to the first thread that is not
blocking it.
The OS may provide a function to send a signal to one particular
thread.
4.6.3 Thread Cancellation
Sometimes a thread starts work but it should be cancelled
before it finishes - for example if two threads are searching
a database for a record and one of them finds it, the other
thread should be cancelled.
Thread cancellation can be implemented in a manner similar to how
signals work. In fact it may be implemented using signals.
Since
problems could be caused by instantly
cancelling a thread in a task that is in the midst of doing some
work, the implementation of cancellation typically includes ways
for threads to defer their cancellation so that they have time
to 'clean up' first - for example to deallocate resources they
are holding, or to finish updating shared data.
4.6.4 Thread Local Storage
We have seen that in many systems, variables declared outside any function
are by default shared by all threads.
A thread may need private variables that are globally visible
(from any function). This is called thread local storage.
Most thread APIs provide
support for such thread local storage.
4.6.5 Scheduler Activations
This section describes some rather arcane details of how
the relationship between user-level threads and kernel-level
threads may be implemented.
The "Scheduler Activation" scheme for simulating threads at the
user level relies to a great extent on the kernel communicating
about certain events to the application that is executing
the user-level threads. These communications come from the
kernel and go "up" to the application, and they are named
upcalls. Read the information, but I won't ask you for
any details.
Figure 4.20: Lightweight process (LWP)
4.7 Operating-System Examples
4.7.1 Windows Threads
Figure 4.21: Data structures of a Windows thread
In Windows, applications run as separate processes, which may have multiple
threads.
Windows uses the one-to-one model for mapping user-level threads to kernel-level threads.
Per-thread resources include, ID number, register set,
user stack and kernel stack (for the use of the OS when executing
in behalf of the process, for example when executing a system
call for the process), and private storage used by library code.
There are three primary data structures for holding the context
of threads: ETHREAD (executive thread block), KTHREAD
(kernel thread block), and TEB (thread environment block). The first
two reside in kernel memory, and the TEB is in user-space. It's interesting
to see this aspect of how the abstraction of a "Thread Control Block (TCB)" is
implemented in an actual operating system.
4.7.2 Linux Threads
Linux has a traditional fork() system call that creates an
exact duplicate of the parent.
Linux also has a clone()
system call with flag parameters that determine what
resources will be shared between parent and child (clone). If a large
amount of context is shared between parent and clone, then the
clone is about the same thing as what we have before called
a new thread inside the parent process.
On the other hand, if little or nothing is shared, then the clone is
about the same as the child of a traditional fork() operation.
Also, the clone
can also be something "in between" the two extremes.