Thread Pools

Thread Pools

When Intel released the Pentium 4, the plan, as far as anyone could tell, was to use ultra-deep pipelining and high-bandwidth RAM to reach about 10-20 GHz. As it were, the P4 reached 3 GHz quickly, but the GHz race came to a halt at 3.8 GHz. At that point, the Pentium 4 had turned from a CPU to a space heater. The chip consumed 120 W, required insane amount of cooling, and yet managed to get pipeline stalls too often to even be able to operate continuously at its rated speed.

The inability of the P4 to reach the expected speeds signalled that we had, at least for the forseeable future, reached the end of getting more performance by ramping up CPU clock frequency. Well, if we can't have faster chips, then we'll use more of them. This was a simple solution, but the simplicity of the statement hid the incredible difficulty of its real-world implementation.

Up until about 2005, programmers could count on the computer manufacturers to deliver faster and faster chips. If your program was slow, just get a faster computer to run it on. Newer computers did the same thing as older ones, but faster. From 2005 and onwards, this was no longer true. We had to re-structure our programs. Instead of a new computer being a "better old" computer, the new computer was akin to a bunch of old computers, wrapped up in a single metal box. Instead of having our program be a sequential list of instructions that would be executed on one of them, we had to re-structure it so it could be divided into several lists, to be executed on all of them. Herb Sutter detailed this fundamental change in his article The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software[a].

As I write this in 2010, we still haven't quite figured out how to do it well.

This article describes how I designed the thread pools used in my video processing toolkit.

1. Use Cases

Parallel processing requires that we can split our work into discrete, independent units of work. The use cases I found are:

  1. Independent: An independent task is a task that must not run in the calling context, but need not run immediately. An example would be a task that handles a request to a web server. The request must be handled, but it shouldn't be handled by the same thread that listens on the server socket.

  2. Opportunistic: This is a work unit that should be executed if there are idle resources in the system, but otherwise should run in the caller's context. An example is a CPU-intensive task that should be offloaded onto another processor core if one is idle.

  3. Scheduled: A scheduled task is simply a task that is set run at a specified time in the future.

2. Resource Management

Multicore programming is tightly related to management of system-wide resources. When executing multiple work units, we are limited by the most scarce of resources. If we have multiple tasks that each require one gigabyte of ram and one gigabyte of disk, then we will typically be limited by the size of the installed ram, even though we may have disk to spare.

Whenever a parallel task stalls waiting for a resource it is essentially clogging up the work pipeline - it doesn't free its own resources, and it's not progressing either.

Therefore, we will classify the tasks according to the resource they are dependent on, to the extent that we can do so. Each class then gets a number of permits - this is the maximum number of tasks that can execute simultaneously in a given class. The classes I used are:

ClassTypeNumber of Permits
CPUTasks limited by CPU power. This is for tasks that operate on in-memory data and performs no I/O.Same as the number of processor cores in the system.
IOTasks limited by local I/O bandwidth. This is for disk-intensive tasks.Same as the number of I/O channels in the system.
EMPTYPseudo-class with no permits. Used to force sequential execution through the completion mechanism.Zero
GENERALA catch-all class for tasks that do a little bit of everything.Two times the number of cores in the system.

3. Operation

The threadpool has four operations: execute, executeImmediately, schedule and complete.

3.1. Future<T> execute (ThreadClass, Callable<T>)

Executes the task in the given thread class, or queues it for later execution.

3.2. Future<T> executeImmediately (ThreadClass, Callable<T>)

Executes the task in the given thread class, or in the calling thread, if no permits are available. This is used for opportunistic multithreading - execute in parallel if possible, but otherwise just execute sequentially. The big difference between this one and execute is that we know that the task is either executing or completed when the method returns, as opposed to execute where the task might be queued as well.

3.3. void complete (Future<T>...)

If the tasks aren't already completed, executes those tasks not currently executing in the calling thread. After that, waits for execution to complete. When this method returns, the future value or exception has been set.

3.4. RepeatFuture<T> schedule (ThreadClass, Callable<T>, when, repeat)

Schedules a callable for execution at a given time. A separate scheduler thread in the thread pool will execute the callable as soon after the given start time as possible.

4. Common Patterns

A couple of common patterns crystallized out as development proceeded.

4.1. Submit and Complete / Fork-and-Join

This pattern consists of three steps:

  1. Create the work units in a for-loop and submit them as they are created.

  2. Put the resulting Futures in a collection.

  3. Complete all futures.

In Java pseudo-code, this would be:

List<Future<?>> futures = ...;
for (...) {
    Callable<?> callable = ...;
    Future<?> future = execute (callable);
    futures.add (future);

// Jump in and complete any future 
// not executing in the thread pool
complete (futures);

This pattern appears when a big body of work can be split into lots of small parts, such as in any situation calling for a "parallel for"-loop. Since it is so useful, I've built a WorkSet-class to encapsulate this pattern.

4.2. Opportunistic

Opportunistic multithreading is when we attempt to submit a work unit to the thread pool, but execute it in the calling thread if the threadpool can't immediately accept it.

Callable<?> callable = ...;
Future<?> future = executeImmediately (callable);

... do other stuff ...

complete (future);

4.3. Thread Renaming

It is convenient to rename the threadpool thread to the name of the thread that submitted the work unit during the execution of the work unit. This helps benchmarking and profiling, as once can easier see the degree of parallelism achieved. The Callables therefore tend to have this structure:

public abstract class ChildCallable<T>
    implements Callable<T> {
    private final String creatingThreadName;
    public ChildCallable () {
        this.creatingThreadName = 
            Thread.currentThread ().getName ();
    public T call () throws Exception {
        Thread thread = Thread.currentThread ();
        String oldThreadName = thread.getName ();
        thread.setName (creatingThreadName);
        try {
            return execute ();
        } finally {
            thread.setName (oldThreadName);
    protected abstract T execute () throws Exception;

The ChildCallable class can then be used almost as a drop-in for Callable - the only difference is that one has to override the execute method instead of the call method.

5. Further Reading

A couple of sources of inspiration and knowledge:

2010-11-17, updated 2014-11-11