Comparison

`parallel_for`

container

`concurrent_vector`

`concurrent_queue`

`task_scheduler_init`

Memory

thread num limit

#include "tbb/parallel_for.h"
#include "tbb/task_scheduler_init.h"
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include "tbb/global_control.h"

using namespace tbb;

void foo()
{
    // The following code could use up to 16 threads.
    task_scheduler_init tsi(16);
    parallel_for( . . . );
}

void bar()
{
    // The following code could use up to 8 threads.
    task_scheduler_init tsi(8);
    parallel_for( . . . );
}

int main()
{
    {
        const size_t parallelism = task_scheduler_init::default_num_threads();
        // total parallelism that TBB can utilize is cut in half for the dynamic extension
        // of the given scope, including calls to foo() and bar()
        global_control c(global_control::max_allowed_parallelism, parallelism/2);
        foo();
        bar();
    } // restore previous parallelism limitation, if one existed
}

Intel Threading Building Blocks

Chap.1 Why Threading Building Blocks?

Comparison with Raw Threads and MPI

Raw threads and MPI expose the control of parallelism at its lowest level. They represent the assembly languages of parallelism.

Comparison with OpenMP

Threading Building Blocks does away with this in favor of a single, automatic, divide-and-conquer approach to scheduling. Implemented with work stealing (a technique for moving tasks from loaded processors to idle ones), it compares favorably to dynamic or guided scheduling, but without the problems of a centralized dealer.

Chap.2 Thinking Parallel

Decomposition

Data Parallelism

embarrassingly parallel

Task Parallelism

can be embarrassingly parallel

Pipelining (Task and Data Parallelism Together)

Mixed Solutions

coarse-grained parallelism: the interactions between the tasks are infrequent
fine-grained parallelism: frequent interactions

Intel® Threading Building Blocks Tutorial

Chap.3 Parallelizing Simple Loops

3.2 `parallel_for`

3.2.3 Controlling Chunking

A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute.

Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations. If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000. The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.

Run your algorithm.

Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.

Tip: A general rule of thumb for parallelizing loop nests is to parallelize the outermost one possible. The reason is that each iteration of an outer loop is likely to provide a bigger grain of work than an iteration of an inner loop.

3.2.4 Bandwidth and Cache Affinity

Using affinity_partitioner can significantly improve performance when:

The computation does a few operations per data access.

The data acted upon by the loop fits in cache.

The loop, or a similar loop, is re-executed over the same data.

There are more than two hardware threads available. If only two threads are available, the default scheduling in Intel® TBB usually provides sufficient cache affinity.

affinity_partitioner should be considered a tool, not a cure-all, when there is a low ratio of computations to memory accesses.

3.2.5 Partitioner Summary

simple_partitioner can be useful in the following situations:

The subrange size for operator() must not exceed a limit. That might be advantageous, for example, if your operator() needs a temporary array proportional to the size of the range. With a limited subrange size, you can use an automatic variable for the array instead of having to use dynamic memory allocation.

A large subrange might use cache inefficiently. For example, suppose the processing of a subrange involves repeated sweeps over the same memory locations. Keeping the subrange below a limit might enable the repeated referenced memory locations to fit in cache.

You want to tune to a specific machine.

3.3 `parallel_reduce`

In general, the splitting constructor does two things:

Copy read-only information necessary to run the loop body.

Initialize the reduction variable(s) to the identity element of the operation(s).

Chap.4 Parallelizing Complex Loops

4.1 Cook Until Done: `parallel_do`

There are two ways that parallel_do can acquire work scalably.

The iterators can be random-access iterators.

The body argument to parallel_do, if it takes a second argument feeder of type parallel_do<Item>&, can add more work by calling feeder.add(item). For example, suppose processing a node in a tree is a prerequisite to processing its descendants. With parallel_do, after processing a node, you could use feeder.add to add the descendant nodes. The instance of parallel_do does not terminate until all items have been processed.

Chap.6 Containers

Chap.7 Mutual Exclusion

7.1 Mutex Flavors

Scalable. Some mutexes are called scalable. In a strict sense, this is not an accurate name, because a mutex limits execution to one thread at a time. A scalable mutex is one that does not do worse than this. A mutex can do worse than serialize execution if the waiting threads consume excessive processor cycles and memory bandwidth, reducing the speed of threads trying to do real work. Scalable mutexes are often slower than non-scalable mutexes under light contention, so a non-scalable mutex may be better. When in doubt, use a scalable mutex.

Fair. Mutexes can be fair or unfair. A fair mutex lets threads through in the order they arrived. Fair mutexes avoid starving threads. Each thread gets its turn. However, unfair mutexes can be faster, because they let threads that are running go through first, instead of the thread that is next in line which may be sleeping on account of an interrupt.

Recursive. Mutexes can be recursive or non-recursive. A recursive mutex allows a thread that is already holding a lock on the mutex to acquire another lock on the mutex. This is useful in some recursive algorithms, but typically adds overhead to the lock implementation.

Yield or Block. This is an implementation detail that impacts performance. On long waits, an Intel® TBB mutex either yields or blocks. Here yields means to repeatedly poll whether progress can be made, and if not, temporarily yield (The yielding is implemented via SwitchToThread() on Microsoft Windows operating systems and by sched_yield() on other systems.) the processor. To block means to yield the processor until the mutex permits progress. Use the yielding mutexes if waits are typically short and blocking mutexes if waits are typically long.

7.4 Lock Pathologies

7.4.1 Deadlock

deadlock happens when:

There is a cycle of threads

Each thread holds at least one lock on a mutex, and is waiting on a mutex for which the next thread in the cycle already has a lock.

No thread is willing to give up its lock.

Two common ways to avoid deadlock are:

Avoid needing to hold two locks at the same time. Break your program into small actions in which each can be accomplished while holding a single lock.

Always acquire locks in the same order.

Use atomic operations instead of locks.

7.4.2 Convoying

Convoying occurs when the operating system interrupts a thread that is holding a lock. All other threads must wait until the interrupted thread resumes and releases the lock. Fair mutexes can make the situation even worse, because if a waiting thread is interrupted, all the threads behind it must wait for it to resume.

To minimize convoying, try to hold the lock as briefly as possible. Precompute whatever you can before acquiring the lock.

To avoid convoying, use atomic operations instead of locks where possible.

Chap.8 Atomic Operations

The advantage of atomic operations is that they are relatively quick compared to locks, and do not suffer from deadlock and convoying.

The disadvantage is that they only do a limited set of operations, and often these are not enough to synthesize more complicate operations efficiently.

ABA problem

8.2 Memory Consistency

weak memory consistency

Chap.9 Timing

When measuring the performance of parallel programs, it is usually wall clock time, not CPU time, that matters.

tick_count t0 = tick_count::now();
... do some work ...
tick_count t1 = tick_count::now();
printf(“work took %g seconds\n”,(t1-t0).seconds());

Chap.10 Memory Allocation

Scalability. Problems of scalability arise when using memory allocators originally designed for serial programs, on threads that might have to compete for a single shared pool in a way that allows only one thread to allocate at a time. Use the memory allocator template scalable_allocator<T> to avoid such scalability bottlenecks. This template can improve the performance of programs that rapidly allocate and free memory.
False sharing. Problems of sharing arise when two threads access different words that share the same cache line. The problem is that a cache line is the unit of information interchange between processor caches. If one processor modifies a cache line and another processor reads (or writes) the same cache line, the cache line must be moved from one processor to the other, even if the two processors are dealing with different words within the line. False sharing can hurt performance because cache lines can take hundreds of clocks to move.
https://en.wikipedia.org/wiki/False_sharing
Avoiding and Identifying False Sharing Among Threads
Understanding False Sharing
False Sharing hits again!
Eliminate False Sharing
[!]Effective Concurrency: Eliminate False Sharing

10.2 Automically Replacing malloc and Other C/C++ Functions for Dynamic Memory Allocation

# Set LD_PRELOAD so that loader loads release version of proxy
LD_PRELOAD=libtbbmalloc_proxy.so.2
# Link with release version of proxy and scalable allocator
g++ foo.o bar.o -ltbbmalloc_proxy -ltbbmalloc -o a.out

Chap.11 The Task Scheduler

11.1 Task-Based Programming

The threads you create with a threading package are logical threads, which map onto the physical threads of the hardware. For computations that do not wait on external devices, highest efficiency usually occurs when there is exactly one running logical thread per physical thread.

Undersubscription occurs when there are not enough running logical threads to keep the physical threads working.

Oversubscription occurs when there are more running logical threads than physical threads. Oversubscription usually leads to time sliced execution of logical threads.

The key advantage of tasks versus logical threads is that tasks are much lighter weight than logical threads. A task in Intel® Threading Building Blocks, in contrast, is typically a small routine, and also, cannot be preempted at the task level (though its logical thread can be preempted).

Tasks in Intel® Threading Building Blocks are efficient too because the scheduler is unfair.

The scheduler does load balancing.

11.2 When Task-Based Programming Is Inappropriate

However, if threads block frequently, there is a performance loss when using the task scheduler because while the thread is blocked, it is not working on any tasks. Blocking typically occurs while waiting for I/O or mutexes for long periods.

11.4 How Task Scheduling Works

Appendix A Costs of Time Slicing

The most obvious is the time for context switching between logical threads.

A more subtle cost is cache cooling.

Another cost is lock preemption. The effect is called convoying, because the threads end up “bumper to bumper” waiting for the preempted thread in front to resume driving.

TBB - yszheda/wiki GitHub Wiki

Comparison

parallel_for

container

concurrent_vector

concurrent_queue

task_scheduler_init

Memory

thread num limit

Intel Threading Building Blocks

Chap.1 Why Threading Building Blocks?

Comparison with Raw Threads and MPI

Comparison with OpenMP

Chap.2 Thinking Parallel

Decomposition

Data Parallelism

Task Parallelism

Pipelining (Task and Data Parallelism Together)

Mixed Solutions

Intel® Threading Building Blocks Tutorial

Chap.3 Parallelizing Simple Loops

3.2 parallel_for

3.2.3 Controlling Chunking

3.2.4 Bandwidth and Cache Affinity

3.2.5 Partitioner Summary

3.3 parallel_reduce

Chap.4 Parallelizing Complex Loops

4.1 Cook Until Done: parallel_do

Chap.6 Containers

Chap.7 Mutual Exclusion

7.1 Mutex Flavors

7.4 Lock Pathologies

7.4.1 Deadlock

7.4.2 Convoying

Chap.8 Atomic Operations

8.2 Memory Consistency

Chap.9 Timing

Chap.10 Memory Allocation

10.2 Automically Replacing malloc and Other C/C++ Functions for Dynamic Memory Allocation

Chap.11 The Task Scheduler

11.1 Task-Based Programming

11.2 When Task-Based Programming Is Inappropriate

11.4 How Task Scheduling Works

Appendix A Costs of Time Slicing

⚠️ **GitHub.com Fallback** ⚠️

`parallel_for`

`concurrent_vector`

`concurrent_queue`

`task_scheduler_init`

3.2 `parallel_for`

3.3 `parallel_reduce`

4.1 Cook Until Done: `parallel_do`

⚠️ GitHub.com Fallback ⚠️