https://github.com/xiaoweiChen/CPP-Concurrency-In-Action-2ed-2019

CHAPTER 4 Synchronizing concurrent operations

4.2 Waiting for one-off events with futures

4.2.1 Returning values from background tasks

4.2.2 Associating a task with a future

4.2.3 Making `(std::)promises`

4.2.4 Saving an exception for the future

if the function call invoked as part of std::async throws an exception, that exception is stored in the future in place of a stored value, the future becomes ready, and a call to get() rethrows that stored exception.

4.2.5 Waiting from multiple threads

4.3 Waiting with a time limit

4.4 Using synchronization of operations to simplify code

CHAPTER 5 The C++ memory model and operations on atomic types

5.1.3 Modification orders

https://en.cppreference.com/w/cpp/atomic/memory_order

5.2 Atomic operations and types in C++

5.2.1 The standard atomic types

In actual fact, the standard atomic types themselves might use such emulation: they (almost) all have an is_lock_free() member function, which allows the user to determine whether operations on a given type are done directly with atomic instructions ( x.is_lock_free() returns true ) or done by using a lock internal to the compiler and library ( x.is_lock_free() returns false ).

The only type that doesn’t provide an is_lock_free() member function is std::atomic_flag . This type is a really simple Boolean flag, and operations on this type are required to be lock-free

5.2.2 Operations on `std::atomic_flag`

Objects of type std::atomic_flag must be initialized with ATOMIC_FLAG_INIT .

class spinlock_mutex
{
  std::atomic_flag flag;

  public:
  spinlock_mutex(): flag(ATOMIC_FLAG_INIT) {}

  void lock()
  {
    while(flag.test_and_set(std::memory_order_acquire));
  }

  void unlock()
  {
    flag.clear(std::memory_order_release);
  }
};

5.2.3 Operations on `std::atomic<bool>`

common pattern with the atomic types: the assignment operators they support return values (of the corresponding nonatomic type) rather than references.

compare_exchange_weak(): spurious failure
compare_exchange_strong() is guaranteed to return false only if the actual value wasn’t equal to the expected value.

5.2.4 Operations on `std::atomic<T*>`: pointer arithmetic

fetch_add()
fetch_sub()

5.2.5 Operations on standard atomic integral types

load() , store() , exchange() , compare_exchange_weak() , and compare_exchange_strong()
fetch_add() , fetch_sub() , fetch_and() , fetch_or() , fetch_xor() , compound-assignment forms of these operations ( += , -= , &= , |= , and ^= ), and pre- and post-increment and decrement ( ++x , x++ , -- x , and x-- )

5.2.6 The `std::atomic<>` primary class template

In order to use std::atomic<UDT> for some user-defined type UDT , this type must have a trivial copy-assignment operator. This means that the type must not have any virtual functions or virtual base classes and must use the compiler-generated copy-assignment operator. Not only that, but every base class and non-static data member of a user-defined type must also have a trivial copy-assignment operator. This essentially permits the compiler to use memcpy() or an equivalent operation for assignment operations, because there’s no user-written code to run.

Finally, the type must be bitwise equality comparable. This goes alongside the assignment requirements; not only must you be able to copy an object of type UDT using memcpy() , but you must be able to compare instances for equality using memcmp() . This guarantee is required in order for compare/exchange operations to work.

double-word-compare-and-swap ( DWCAS )

5.2.7 Free functions for atomic operations

5.3 Synchronizing operations and enforcing ordering

5.3.1 The synchronizes-with relationship

5.3.2 The happens-before relationship

5.3.3 Memory ordering for atomic operations

sequentially consistent ordering ( memory_order_seq_cst )
acquire-release ordering ( memory_order_consume , memory_order_acquire , memory_order_release , and memory_order_acq_rel )
relaxed ordering ( memory_order_relaxed )

Any sequentially consistent atomic operations done after that load must also appear after the store to other threads in the system using sequentially consistent atomic operations.

Sequential consistency is the most straightforward and intuitive ordering, but it’s also the most expensive memory ordering because it requires global synchronization between all threads. On a multiprocessor system this may require quite extensive and time-consuming communication between processors.

NON-SEQUENTIALLY CONSISTENT MEMORY ORDERINGS

RELAXED ORDERING

ACQUIRE-RELEASE ORDERING

Under this ordering model, atomic loads are acquire operations ( memory_order_acquire ), atomic stores are release operations ( memory_order_release ), and atomic read-modify-write operations (such as fetch_add() or exchange() ) are either acquire, release, or both ( memory_order_acq_rel ).

Synchronization is pairwise, between the thread that does the release and the thread that does the acquire. A release operation synchronizes-with an acquire operation that reads the value written.

DATA DEPENDENCY WITH ACQUIRE-RELEASE ORDERING AND MEMORY_ORDER_CONSUME

There are two new relations that deal with data dependencies: dependency-ordered-before and carries-a-dependency-to.

One important use for this kind of memory ordering is where the atomic operation loads a pointer to some data. By using memory_order_consume on the load and memory_order_release on the prior store, you ensure that the pointed-to data is correctly synchronized, without imposing any synchronization requirements on any other nondependent data.

std::kill_dependency: https://en.cppreference.com/w/cpp/atomic/kill_dependency

5.3.4 Release sequences and synchronizes-with

5.3.5 Fences / memory barriers

std::atomic_thread_fence: https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence

general idea with fences: if an acquire operation sees the result of a store that takes place after a release fence, the fence synchronizes-with that acquire operation; and if a load that takes place before an acquire fence sees the result of a release operation, the release operation synchronizes-with the acquire fence.

5.3.6 Ordering nonatomic operations with atomics

CHAPTER 6 Designing lock-based concurrent data structures

6.1 What does it mean to design for concurrency?

thread-safe

6.1.1 Guidelines for designing data structures for concurrency

6.2 Lock-based concurrent data structures

6.2.1 A thread-safe stack using locks

6.2.2 A thread-safe queue using locks and condition variables

6.2.3 A thread-safe queue using fine-grained locks and condition variables

CHAPTER 7 Designing lock-free concurrent data structures

7.1 Definitions and consequences

7.1.1 Types of nonblocking data structures

spin lock is nonblocking but not lock-free.

7.1.2 Lock-free data structures

For a data structure to qualify as lock-free, more than one thread must be able to access the data structure concurrently. if one of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.

7.1.3 Wait-free data structures

A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads. Algorithms that can involve an unbounded number of retries because of clashes with other threads are thus not wait-free.

7.1.4 The pros and cons of lock-free data structures

the primary reason for using lock-free data structures is to enable maximum concurrency.
A second reason to use lock-free data structures is robustness. If a thread dies while holding a lock, that data structure is broken forever. But if a thread dies partway through an operation on a lock-free data structure, nothing is lost except that thread’s data; other threads can proceed normally.
The flip side here is that if you can’t exclude threads from accessing the data structure, then you must be careful to ensure that the invariants are upheld or choose alternative invariants that can be upheld. Also, you must pay attention to the ordering constraints you impose on the operations.
deadlocks are impossible with lock-free data structures, although there is the possibility of live locks instead.
- A live lock occurs when two threads each try to change the data structure, but for each thread the changes made by the other require the operation to be restarted, so both threads loop and try again.
although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
- First, the atomic operations used for lock-free code can be much slower than nonatomic operations, and there’ll likely be more of them in a lock-free data structure than in the mutex locking code for a lock-based data structure.
- The hardware must synchronize data between threads that access the same atomic variables.

7.2 Examples of lock-free data structures

7.2.1 Writing a thread-safe stack without locks

7.2.3 Detecting nodes that can’t be reclaimed using hazard pointers

unsigned const max_hazard_pointers=100;

struct hazard_pointer
{
  std::atomic<std::thread::id> id;
  std::atomic<void*> pointer;
};

hazard_pointer hazard_pointers[max_hazard_pointers];

class hp_owner
{
  hazard_pointer* hp;

  public:
  hp_owner(hp_owner const&)=delete;
  hp_owner operator=(hp_owner const&)=delete;

  hp_owner():
    hp(nullptr)
  {
    for(unsigned i=0;i<max_hazard_pointers;++i)
    {
      std::thread::id old_id;
      // Try to claim ownership of a hazard pointer
      if(hazard_pointers[i].id.compare_exchange_strong(
            old_id,std::this_thread::get_id()))
      {
        hp=&hazard_pointers[i];
        break;
      }
    }
    if(!hp)
    {
      throw std::runtime_error("No hazard pointers available");
    }
  }

  std::atomic<void*>& get_pointer()
  {
    return hp->pointer;
  }

  ~hp_owner()
  {
    hp->pointer.store(nullptr);
    hp->id.store(std::thread::id());
  }
};

std::atomic<void*>& get_hazard_pointer_for_current_thread()
{
  // Each thread has its own hazard pointer
  thread_local static hp_owner hazard;
  return hazard.get_pointer();
}

7.2.4 Detecting nodes in use with reference counting

template<typename T>
class lock_free_stack
{
  private:
    struct node;

    struct counted_node_ptr
    {
      int external_count;
      node* ptr;
    };

    struct node
    {
      std::shared_ptr<T> data;
      std::atomic<int> internal_count;
      counted_node_ptr next;
      node(T const& data_):
        data(std::make_shared<T>(data_)),
        internal_count(0)
      {}
    };

    std::atomic<counted_node_ptr> head;

    void increase_head_count(counted_node_ptr& old_counter)
    {
      counted_node_ptr new_counter;
      do
      {
        new_counter = old_counter;
        ++new_counter.external_count;
      }
      while(!head.compare_exchange_strong(old_counter, new_counter));
      old_counter.external_count = new_counter.external_count;
    }

  public:
    ~lock_free_stack()
    {
      while(pop());
    }

    void push(T const& data)
    {
      counted_node_ptr new_node;
      new_node.ptr = new node(data);
      new_node.external_count = 1;
      new_node.ptr->next = head.load();
      while (!head.compare_exchange_weak(new_node.ptr->next, new_node));
    }

    std::shared_ptr<T> pop()
    {
      counted_node_ptr old_head = head.load();
      for (;;)
      {
        increase_head_count(old_head);
        node* const ptr = old_head.ptr;
        if (!ptr)
        {
          return std::shared_ptr<T>();
        }

        if (head.compare_exchange_strong(old_head, ptr->next))
        {
          std::shared_ptr<T> res;
          res.swap(ptr->data);
          int const count_increase = old_head.external_count - 2;
          if(ptr->internal_count.fetch_add(count_increase) == -count_increase)
          {
            delete ptr;
          }
          return res;
        }
        else if(ptr->internal_count.fetch_sub(1) == 1)
        {
          delete ptr;
        }
      }
    }
};

7.2.5 Applying the memory model to the lock-free stack

template<typename T>
class lock_free_stack
{
  private:
    struct node;
    struct counted_node_ptr
    {
      int external_count;
      node* ptr;
    };
    struct node
    {
      std::shared_ptr<T> data;
      std::atomic<int> internal_count;
      counted_node_ptr next;
      node(T const& data_):
        data(std::make_shared<T>(data_)),
        internal_count(0)
      {}
    };

    std::atomic<counted_node_ptr> head;

    void increase_head_count(counted_node_ptr& old_counter)
    {
      counted_node_ptr new_counter;
      do
      {
        new_counter = old_counter;
        ++new_counter.external_count;
      }
      while (!head.compare_exchange_strong(old_counter, new_counter,
            std::memory_order_acquire,
            std::memory_order_relaxed));
      old_counter.external_count = new_counter.external_count;
    }

  public:

    ~lock_free_stack()
    {
      while (pop());
    }

    void push(T const& data)
    {
      counted_node_ptr new_node;
      new_node.ptr = new node(data);
      new_node.external_count = 1;
      new_node.ptr->next = head.load(std::memory_order_relaxed);
      while (!head.compare_exchange_weak(new_node.ptr->next, new_node,
            std::memory_order_release,
            std::memory_order_relaxed));
    }

    std::shared_ptr<T> pop()
    {
      counted_node_ptr old_head = head.load(std::memory_order_relaxed);
      for (;;)
      {
        increase_head_count(old_head);
        node* const ptr = old_head.ptr;
        if (!ptr)
        {
          return std::shared_ptr<T>();
        }
        if (head.compare_exchange_strong(old_head, ptr->next, std::memory_order_relaxed))
        {
          std::shared_ptr<T> res;
          res.swap(ptr->data);
          int const count_increase = old_head.external_count - 2;
          if (ptr->internal_count.fetch_add(count_increase, std::memory_order_release) == -count_increase)
          {
            delete ptr;
          }
          return res;
        }
        else if (ptr->internal_count.fetch_add(-1, std::memory_order_relaxed) == 1)
        {
          ptr->internal_count.load(std::memory_order_acquire);
          delete ptr;
        }
      }
    }
};

7.2.6 Writing a thread-safe queue without locks

// TODO

7.3 Guidelines for writing lock-free data structures

7.3.1 Guideline: use `std::memory_order_seq_cst` for prototyping

7.3.2 Guideline: use a lock-free memory reclamation scheme

Waiting until no threads are accessing the data structure and deleting all objects that are pending deletion
Using hazard pointers to identify that a thread is accessing a particular object
Reference counting the objects so that they aren’t deleted until there are no outstanding references

7.3.3 Guideline: watch out for the ABA problem

7.3.4 Guideline: identify busy-wait loops and help the other thread

CHAPTER 8 Designing concurrent code

8.1 Techniques for dividing work between threads

8.1.1 Dividing data between threads before processing begins

8.1.2 Dividing data recursively

8.1.3 Dividing work by task type

DIVIDING WORK BY TASK TYPE TO SEPARATE CONCERNS

DIVIDING A SEQUENCE OF TASKS BETWEEN THREADS

8.2 Factors affecting the performance of concurrent code

8.2.1 How many processors?

oversubscription
std::thread::hardware_concurrency()

8.2.2 Data contention and cache ping-pong

high/low contention
cache ping-pong

The effects of contention with mutexes are usually different from the effects of contention with atomic operations for the simple reason that the use of a mutex naturally serializes threads at the operating system level rather than at the processor level. If you have enough threads ready to run, the operating system can schedule another thread to run while one thread is waiting for the mutex, whereas a processor stall prevents any threads from running on that processor. However, it will still impact the performance of those threads that are competing for the mutex; they can only run one at a time, after all.

Cache ping-pong effects can nullify the benefits of a single-writer, multiple-reader mutex if the workload is unfavorable, because all threads accessing the data (even reader threads) still have to modify the mutex itself. As the number of processors accessing the data goes up, the contention on the mutex itself increases, and the cache line holding the mutex must be transferred between cores, thus potentially increasing the time taken to acquire and release locks to undesirable levels.

8.2.3 False sharing

Processor caches deal in blocks of memory called cache lines. However, if the data items in a cache line are unrelated and need to be accessed by different threads, this can be a major cause of performance problems.

8.2.4 How close is your data?

data proximity: if the data accessed by a single thread is spread out in memory, it’s likely that it lies on separate cache lines. On the flip side, if the data accessed by a single thread is close together in memory, it’s more likely to lie on the same cache line. Consequently, if data is spread out, more cache lines must be loaded from memory onto the processor cache, which can increase memory access latency and reduce performance compared to data that’s located close together.