10th ed. chapter 04

(Latest Revision: Fri, Feb 23, 2023)
[2023/02/23: added more boldface and some minor rewording]
[2023/02/21: added some boldface emphasis]
[2023/02/17: added a couple of lines to section 4.5.1]
[2022/02/25: format and info addition to section 4.7.1]
[2021/01/20: updated chapter title]
[2019/06/03: added captions to figures]
[2019/05/30: inserted more figures]
[2019/03/21: format changes]
[2019/02/21: current 2019 updates]

Chapter Four -- Threads & Concurrency -- Lecture Notes

4.0 Objectives
- Identify the basic components of a thread, and contrast threads and processes.
- Describe the major benefits and significant challenges of designing multithreaded processes.
- Illustrate different approaches to implicit threading, including thread pools, fork-join, and Grand Central Dispatch.
- Describe how the Windows and Linux operating systems represent threads.
- ~~Design~~ Look at multithreaded applications using the Pthreads, Java, and Windows threading APIs.
Figure 4.1: Single-threaded and multithreaded processes
4.1 Overview See figure 4.1 at right. We can think of a thread as a part of a process that actually performs an execution of the program - an execution sequence of a program. There is only one thread in a "traditional process" but there can be multiple threads - multiple execution sequences - within a single process. The threads can share many parts of the context of the process, like the program code (aka the text), variables and other data, open files, signals and other messages.

On the other hand, each thread is a separate execution sequence, and so it needs NOT to share certain parts of its context. Each thread needs to have its own separate program counter, CPU register values, thread id number, and run-time stack for supporting its own function calls.
- 4.1.1 Motivation It's common now that we need computer applications to work for us on multiple separate concurrent activities, such as a word processor that concurrently renders a display, checks the spelling in a document, and reads characters the user types with a keyboard.
  
  Figure 4.2: Multithreaded server architecture
  
  When a multi-threaded process runs on a multiprocessor, it's possible for two or more of the threads to execute simultaneously, which, of course, can be a great efficiency win. One example would be a busy Internet server process that can serve multiple clients simultaneously.
  
  Most operating system kernels are multithreaded!.
- 4.1.2 Benefits These are the main categories of benefits of multithreaded programming.
  1. Responsiveness: work is divided among threads and some threads can continue working while others are blocked (for example, waiting for I/O to complete) [ Note this is a type of concurrent processing that applies to a uniprocessor ]
  2. Resource Sharing: Unlike separate processes, threads can share memory and other resources by default, which makes it easier for them to communicate and cooperate. Also, there is efficient utilization of primary memory when threads share code and data .
  3. Economy: Not much needs to be done to add a new thread to an existing process, or to perform a context switch between two threads of the same process. Therefore these operations usually require less new memory allocation and less time than creation of a new process or a context switch between different processes.
  4. Scalability: On a multiprocessor, multiple threads can work on a problem in parallel - truly simultaneously. A single-threaded process can only run on one CPU at a time, no matter how many CPUs are available in the computer.
4.2 Multicore Programming Threads or processes are concurrent when they execute at approximately the same time. This can happen on a computer with a single CPU (aka a uniprocessor), when, for example, the operating system performs multitasking. However, parallel threads or processes truly execute simultaneously. This can only happen on a multiprocessor.

A definition: A multiprocessor with multiple computing cores (CPUs) on a single chip is called a multicore system.

Figure 4.3: Concurrent execution on a single-core system

Figure 4.4: Parallel execution on a multicore system
- 4.2.1 Programming Challenges These are five challenges in programming for multicore systems.
  1. Identifying tasks: How do we divide up the work of a process among the threads so that we get lots of work going on in parallel?
  2. Balance: There's a cost of adding a new thread to a process. How do we make sure the new thread contributes enough to justify that cost?
  3. Data Splitting: How do we divide up the data among threads to facilitate lots of work in parallel?
  4. Data Dependency: When thread X needs to get some data from thread Y before thread X can continue, how do we synchronize the actions of X and Y so the data is passed from one to the other correctly?
  5. Testing and debugging: The actions of two threads executing in parallel can be interleaved in very numerous ways. Therefore there can be a very large number of different orders of instruction execution possible for a multithreaded application. How can we effectively test and debug such applications that have unpredictable execution paths?
  Figure 4.5: Data and task parallelism
- 4.2.2 Types of Parallelism There are two main kinds of parallelism data parallelism and task parallelism.
  
  With data parallelism, the data is divided up, and each thread is given one part of the data on which to work. Each thread does the same operations, but on different data. If we have two lists to sort and two threads, and if each thread sorts one of the lists, that's data parallelism.
  
  With task parallelism, the threads get different kinds of jobs to do, and each thread uses whatever part(s) of the data it needs. If one thread computes the average of an array while another thread finds the maximum value in the array, that's task parallelism.
  
  Programs often use some combination of the two strategies.
Figure 4.6: User and kernel threads
4.3 Multithreading Models
- So far the threads we have studied are kernel-level threads. There is also something called user-level threads. Basically, user-level threads are simulations of threads. User threads have some nice advantages, but also some disadvantages.
  
  Think about a single (kernel-level) thread that executes a time slice in a CPU. Suppose we make that single thread execute software that simulates multiple threads.
  
  The resulting simulation can switch between multiple different activities many times during the single time slice of the kernel-level thread in the CPU. In other words, it can very quickly simulate multiple context switches of multiple (simulated) threads, all during one time slice of one kernel-level thread. That is a very nice advantage - very low overhead for context switching.
  
  Other advantages are that the simulator can quickly create, execute, and terminate simulated threads at will without being slowed by requesting help from the kernel and waiting for system calls to execute.
  
  Such simulated threads, user-level threads, are quite popular for use in many applications.
- There must be kernel-level threads to support user-level threads. We discuss some design models for this support: many-to-one, one-to-one, or many-to-many.
  
  Figure 4.7: Many-to-one model
- 4.3.1 Many-to-One Model:
  
  In this model, a single kernel-level thread supports many user-level threads. As explained above, context switching is extremely fast among the user-level threads. This model supports programmers that want to organize software as a group of concurrent threads. However the kernel-level thread can only execute in one CPU at a time. Therefore the user-level threads can only execute one at a time. True parallelism is not possible with this model. Also, if any of the user-level threads makes a blocking system call, the kernel-level thread will have to block, and so it will not be able to run code for any of the user-level threads until it gets out of its wait queue. The net effect is that all user threads are blocked if any one of them makes a blocking system call.
  
  Figure 4.8: One-to-one model
- 4.3.2 One-to-One Model
  
  In the one-to-one model, the number of kernel-level threads is equal to the number of user-level threads.
  
  Each user-level thread has exactly one supporting kernel-level thread.
  
  Each kernel-level threads supports exactly one user-level thread.
  
  If we use the one-to-one model, we "cure" the problems of the many-to-one model . Now, if one user-level thread blocks, the others don't have to block .
  
  Also user-level threads can operate in parallel on a multiprocessor.
  
  However,
  - because each kernel-level thread supports only one user-level thread, we lose the advantage of quick context switches between user-level threads.
  - Also each time we create a new user-level thread, we need to create a new kernel-level thread to support it, so we tend to use up more time creating threads.
  - There is less flexibility to schedule the user-level threads. The scheduler in the kernel decides when to run the supporting kernel-threads.
  - This model may lead to excessive numbers of kernel-level threads, which could adversely affect performance.
  Despite the disadvantages listed above, many operating systems use the one-to-one model. It's relatively easy to implement, and the disadvantages are viewed as acceptable because many systems have a large number of processing cores that will support large numbers of kernel-level threads.
  
  Figure 4.9: Many-to-many model
- 4.3.3 Many-to-Many Model
  
  In the many-to-many model, a set of user-level threads is supported by a smaller or equal number of kernel-level threads. A user-level thread is not necessarily bound to any particular kernel-level thread. The thread library can re-assign it. For example if a kernel-level thread X has to block, some of the user-level threads that X was supporting can migrate for support to different kernel-level threads.
  
  The many-to-many model can be seen as combining advantages of the many-to-one and one-to-one models.
  - The many-to-many model allows the quick creation of a large number of user-level threads that can do very fast context switches. .
  - The application can have greater control over the number of kernel-level threads.
  - The application can control the scheduling of the user-level threads.
  - Much of the advantage of the one-to-one model remains:
    - The ability to exploit parallelism on a multiprocessor and
    - independent blocking, the ability of one user-level thread to block without forcing the other user-level threads to block.
  Figure 4.10: Two-level model
  
  The two-level model is a variation on the many-to-many model, in which a user-level thread may be bound to a kernel-level thread.
  
  It is difficult to implement the many-to-many model, and few systems support it.
4.4 Thread Libraries
- Posix thread (pthread) implementations vary from system to system - could be user-level or kernel-level.
- Windows threads are kernel-level
- The Java thread API is typically implemented using a native thread package on the host system (e.g. Pthreads or Windows).
- DEFINITION: In asynchronous threading, the parent creates one or more child threads and then executes concurrently with them.
- DEFINITION:In synchronous threading, the parent creates one or more child threads and waits for all the child threads to exit before resuming execution.
- Section 4.4 contains three examples of synchronous threading in which a parent thread creates a child thread to execute a function. The parent blocks until the child has exited, and then the parent resumes execution.
- 4.4.1 Pthreads
  - If a variable is declared in a program, and it's outside of any function, then it is automatically in memory shared by all threads of a process. So that makes it very easy to set up an area of shared memory.
  - sample thread-creation call:
    pthread_create(&tid, &attr, fname, paramPtr)
  - &tid is the address of a variable to store the id number of the thread.
  - &attr is the address of a data structure containing attributes for the new thread to have.
  - fname is the name of the function where the new thread will begin execution.
  - paramPtr is a pointer to the parameter that will be passed to fname when the new thread executes fname.
  Figure 4.11: Multithreaded C program using the Pthreads API
- 4.4.2 Windows Threads
  - As with POSIX, in the Windows API, if a variable is declared in a program, and it's outside of any function, then it is automatically in memory shared by all threads of a process. So that makes it very easy to set up an area of shared memory.
  - sample thread-creation call:
    CreateThread(NULL, 0, Summation, &Param, 0, &ThreadID)
  - NULL default security attributes
  - 0 default stack size
  - Summation name of function thread will execute
  - &Param parameter for the function
  - 0 default creation flags
  - &ThreadID thread identifier
- 4.4.3 Java Threads SKIP
  - 4.4.3.1 Java Executor Framework SKIP
4.5 Implicit Threading

The basic idea of implicit threading is automation.

Developers and programmers identify tasks that can run in parallel, and compilers and run-time libraries, or other software, create and manage threads to perform those tasks.
- 4.5.1 Thread Pools
  
  The main thread of a busy Internet server does not have time to 'personally' perform the service for each client, because it needs to return immediately to the job of accepting the incoming request from the next client.
  
  One way to deal with that problem is for the main thread to spawn a child, a service thread, for each client. The service thread then handles the client request and terminates.
  
  However it can take excessive time for the creation and termination of the service threads, and if too many client requests come in too fast, there may be too many service threads taxing system resources.
  
  The idea of a thread pool is for the server to create a fixed number of service threads at the time of process start up, and to allow them to 'stay alive' as long as the server operates. The service threads that don't have anything to do are kept suspended. When a client needs service, and if a service thread is available, the main thread assigns one to the client. Otherwise the main thread puts the client in a queue where it waits for a service thread to become available.
  
  When a service thread finishes with a client, it does not terminate. It just 'goes back to the pool' and waits to be assigned to another client.
  
  Generally it takes less time for a server process to use an existing thread to service a request than to create a brand new service thread. Also, no matter how much the server is flooded with client requests, the number of service threads never exceeds the fixed size of the pool.
  
  "Thread pool architectures" are a form of implicit threading that make it relatively easy for programmers to implement servers that utilize thread pools. For example, there are Android, Windows, and Java APIs that provide high-level commands for creating and managing thread pools.
  
  Some thread pool architectures can monitor the frequency of client requests and dynamically adjust the size of the thread pool to match demand.
  - 4.5.1.1 Java Thread Pools
- 4.5.2 Fork Join
  
  If programmers leave notations in the programs designating work that can be performed in parallel, then implicit threading library code can create (fork) child threads to do the work, and arrange for the children to return results to the parent (join the parent) when they terminate.
  
  Figure 4.16: Fork-join parallelism
  - 4.5.2.1 Fork Join in Java A version of Java has a fork-join library that spawns threads to perform recursive calls in algorithms like Mergesort.
    
    Programmers "mark" sections of the source code where parallel work is possible, and the library code takes care of creating and managing threads to do the work.
  Figure 4.17: Fork-join in Java
- 4.5.3 OpenMP (compiler add-on and API for C, C++, and FORTRAN)
  
  Figure 4.19: UML Class diagram for Java's fork-join
  
  A programmer can insert labels in the code that identify certain sections that should be executed by parallel threads. The compiler responds to the labels by generating code that creates threads that execute those sections of code in parallel.
- 4.5.4 Grand Central Dispatch is comprised of extensions to C, an API, and a run-time library. Like OpenMP, it provides parallel processing, although details of the implementation differ.
- 4.5.5 Intel Thread Building Blocks is another approach to implicit threading that relies on templates for parallel structures and task scheduling, rather than special compilers or language features.
4.6 Threading Issues This section summarizes "issues" that have to be resolved by people who implement operating systems that support multi-threading.
- 4.6.1 The fork() and exec() System Calls
  
  When an application is multi-threaded, should the fork() system call duplicate all threads, or just the calling thread? Some API's provide both options.
  
  Implementations of exec() typically replace the entire process of the calling thread, including all threads in that process. Therefore, if the child created by a fork() is going to call exec() immediately, there's no point in having the fork() duplicate all the threads in the process.
- 4.6.2 Signal Handling
  
  Signals are a simple form of interprocess communication in some operating systems, primarily versions of unix. Signals were designed to behave something like interrupts, but they are not interrupts.
  
  The OS delivers and handles signals. Delivering signals and handling (responding to) signals are routine tasks the OS performs as opportunities arise. Sometimes delivery of a signal to a process (or thread) is required as part of the OS performance of interrupt service, or a system call. The OS delivers signals to a process (thread) by setting a bit in a context variable of the process (thread). Just before scheduling a process (thread) to execute, the OS checks to see if any signals have been delivered to the process (thread) that have not been handled yet. If so, the OS will cause the signal to be handled properly. Sometimes it does this by executing code in kernel mode, and sometimes it handles a signal by jumping into the user process at the start address of a special handler routine the process has for responding to the signal. The exact appropriate way of handling a signal depends on the nature of the signal.
  
  Multithreading complicates the problem of implementing signal delivery.
  
  Should a signal be delivered to all the threads in a process or just some? Just the threads to which the signal applies? There are many different kinds of signals, and any of those possibilities may apply, depending on the nature of the signal.
  
  Often the handler for a signal should run only once. A signal sent to a process may be delivered only to the first thread that is not blocking it.
  
  The OS may provide a function to send a signal to one particular thread.
- 4.6.3 Thread Cancellation
  
  Sometimes a thread starts work but it should be cancelled before it finishes - for example if two threads are searching a database for a record and one of them finds it, the other thread should be cancelled.
  
  Thread cancellation can be implemented in a manner similar to how signals work. In fact it may be implemented using signals.
  
  Since problems could be caused by instantly cancelling a thread in a task that is in the midst of doing some work, the implementation of cancellation typically includes ways for threads to defer their cancellation so that they have time to 'clean up' first - for example to deallocate resources they are holding, or to finish updating shared data.
- 4.6.4 Thread Local Storage
  
  We have seen that in many systems, variables declared outside any function are by default shared by all threads.
  
  A thread may need private variables that are globally visible (from any function). This is called thread local storage. Most thread APIs provide support for such thread local storage.
- 4.6.5 Scheduler Activations
  
  This section describes some rather arcane details of how the relationship between user-level threads and kernel-level threads may be implemented.
  
  The "Scheduler Activation" scheme for simulating threads at the user level relies to a great extent on the kernel communicating about certain events to the application that is executing the user-level threads. These communications come from the kernel and go "up" to the application, and they are named upcalls. Read the information, but I won't ask you for any details.
Figure 4.20: Lightweight process (LWP)
4.7 Operating-System Examples
- 4.7.1 Windows Threads
  
  Figure 4.21: Data structures of a Windows thread
  
  In Windows, applications run as separate processes, which may have multiple threads.
  
  Windows uses the one-to-one model for mapping user-level threads to kernel-level threads.
  
  Per-thread resources include, ID number, register set, user stack and kernel stack (for the use of the OS when executing in behalf of the process, for example when executing a system call for the process), and private storage used by library code.
  
  There are three primary data structures for holding the context of threads: ETHREAD (executive thread block), KTHREAD (kernel thread block), and TEB (thread environment block). The first two reside in kernel memory, and the TEB is in user-space. It's interesting to see this aspect of how the abstraction of a "Thread Control Block (TCB)" is implemented in an actual operating system.
- 4.7.2 Linux Threads
  
  Linux has a traditional fork() system call that creates an exact duplicate of the parent.
  
  Linux also has a clone() system call with flag parameters that determine what resources will be shared between parent and child (clone). If a large amount of context is shared between parent and clone, then the clone is about the same thing as what we have before called a new thread inside the parent process. On the other hand, if little or nothing is shared, then the clone is about the same as the child of a traditional fork() operation.
  
  Also, the clone can also be something "in between" the two extremes.