Cplusplus Concurrency In Action Notes - yszheda/wiki GitHub Wiki
if the function call invoked as part of std::async
throws an exception, that exception is stored in the future in place of a stored value, the future becomes ready, and a call to get()
rethrows that stored exception.
https://en.cppreference.com/w/cpp/atomic/memory_order
In actual fact, the standard atomic types themselves might use such emulation: they (almost) all have an
is_lock_free()
member function, which allows the user to determine whether operations on a given type are done directly with atomic instructions (x.is_lock_free()
returns true ) or done by using a lock internal to the compiler and library (x.is_lock_free()
returns false ).
The only type that doesn’t provide an
is_lock_free()
member function isstd::atomic_flag
. This type is a really simple Boolean flag, and operations on this type are required to be lock-free
Objects of type std::atomic_flag
must be initialized with ATOMIC_FLAG_INIT
.
class spinlock_mutex
{
std::atomic_flag flag;
public:
spinlock_mutex(): flag(ATOMIC_FLAG_INIT) {}
void lock()
{
while(flag.test_and_set(std::memory_order_acquire));
}
void unlock()
{
flag.clear(std::memory_order_release);
}
};
common pattern with the atomic types: the assignment operators they support return values (of the corresponding nonatomic type) rather than references.
-
compare_exchange_weak()
: spurious failure -
compare_exchange_strong()
is guaranteed to return false only if the actual value wasn’t equal to the expected value.
fetch_add()
fetch_sub()
-
load()
,store()
,exchange()
,compare_exchange_weak()
, andcompare_exchange_strong()
-
fetch_add()
,fetch_sub()
,fetch_and()
,fetch_or()
,fetch_xor()
, compound-assignment forms of these operations ( += , -= , &= , |= , and ^= ), and pre- and post-increment and decrement ( ++x , x++ , -- x , and x-- )
In order to use
std::atomic<UDT>
for some user-defined typeUDT
, this type must have a trivial copy-assignment operator. This means that the type must not have any virtual functions or virtual base classes and must use the compiler-generated copy-assignment operator. Not only that, but every base class and non-static data member of a user-defined type must also have a trivial copy-assignment operator. This essentially permits the compiler to usememcpy()
or an equivalent operation for assignment operations, because there’s no user-written code to run.
Finally, the type must be bitwise equality comparable. This goes alongside the assignment requirements; not only must you be able to copy an object of type
UDT
usingmemcpy()
, but you must be able to compare instances for equality usingmemcmp()
. This guarantee is required in order for compare/exchange operations to work.
- double-word-compare-and-swap ( DWCAS )
-
sequentially consistent ordering (
memory_order_seq_cst
) -
acquire-release ordering (
memory_order_consume
,memory_order_acquire
,memory_order_release
, andmemory_order_acq_rel
) -
relaxed ordering (
memory_order_relaxed
)
Any sequentially consistent atomic operations done after that load must also appear after the store to other threads in the system using sequentially consistent atomic operations.
Sequential consistency is the most straightforward and intuitive ordering, but it’s also the most expensive memory ordering because it requires global synchronization between all threads. On a multiprocessor system this may require quite extensive and time-consuming communication between processors.
Under this ordering model, atomic loads are acquire operations (
memory_order_acquire
), atomic stores are release operations (memory_order_release
), and atomic read-modify-write operations (such asfetch_add()
orexchange()
) are either acquire, release, or both (memory_order_acq_rel
).
Synchronization is pairwise, between the thread that does the release and the thread that does the acquire. A release operation synchronizes-with an acquire operation that reads the value written.
There are two new relations that deal with data dependencies: dependency-ordered-before and carries-a-dependency-to.
One important use for this kind of memory ordering is where the atomic operation loads a pointer to some data. By using
memory_order_consume
on the load andmemory_order_release
on the prior store, you ensure that the pointed-to data is correctly synchronized, without imposing any synchronization requirements on any other nondependent data.
-
std::kill_dependency
: https://en.cppreference.com/w/cpp/atomic/kill_dependency
-
std::atomic_thread_fence
: https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
general idea with fences: if an acquire operation sees the result of a store that takes place after a release fence, the fence synchronizes-with that acquire operation; and if a load that takes place before an acquire fence sees the result of a release operation, the release operation synchronizes-with the acquire fence.
- thread-safe
spin lock is nonblocking but not lock-free.
For a data structure to qualify as lock-free, more than one thread must be able to access the data structure concurrently. if one of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads. Algorithms that can involve an unbounded number of retries because of clashes with other threads are thus not wait-free.
-
the primary reason for using lock-free data structures is to enable maximum concurrency.
-
A second reason to use lock-free data structures is robustness. If a thread dies while holding a lock, that data structure is broken forever. But if a thread dies partway through an operation on a lock-free data structure, nothing is lost except that thread’s data; other threads can proceed normally.
-
The flip side here is that if you can’t exclude threads from accessing the data structure, then you must be careful to ensure that the invariants are upheld or choose alternative invariants that can be upheld. Also, you must pay attention to the ordering constraints you impose on the operations.
-
deadlocks are impossible with lock-free data structures, although there is the possibility of live locks instead.
- A live lock occurs when two threads each try to change the data structure, but for each thread the changes made by the other require the operation to be restarted, so both threads loop and try again.
-
although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
- First, the atomic operations used for lock-free code can be much slower than nonatomic operations, and there’ll likely be more of them in a lock-free data structure than in the mutex locking code for a lock-based data structure.
- The hardware must synchronize data between threads that access the same atomic variables.
unsigned const max_hazard_pointers=100;
struct hazard_pointer
{
std::atomic<std::thread::id> id;
std::atomic<void*> pointer;
};
hazard_pointer hazard_pointers[max_hazard_pointers];
class hp_owner
{
hazard_pointer* hp;
public:
hp_owner(hp_owner const&)=delete;
hp_owner operator=(hp_owner const&)=delete;
hp_owner():
hp(nullptr)
{
for(unsigned i=0;i<max_hazard_pointers;++i)
{
std::thread::id old_id;
// Try to claim ownership of a hazard pointer
if(hazard_pointers[i].id.compare_exchange_strong(
old_id,std::this_thread::get_id()))
{
hp=&hazard_pointers[i];
break;
}
}
if(!hp)
{
throw std::runtime_error("No hazard pointers available");
}
}
std::atomic<void*>& get_pointer()
{
return hp->pointer;
}
~hp_owner()
{
hp->pointer.store(nullptr);
hp->id.store(std::thread::id());
}
};
std::atomic<void*>& get_hazard_pointer_for_current_thread()
{
// Each thread has its own hazard pointer
thread_local static hp_owner hazard;
return hazard.get_pointer();
}
template<typename T>
class lock_free_stack
{
private:
struct node;
struct counted_node_ptr
{
int external_count;
node* ptr;
};
struct node
{
std::shared_ptr<T> data;
std::atomic<int> internal_count;
counted_node_ptr next;
node(T const& data_):
data(std::make_shared<T>(data_)),
internal_count(0)
{}
};
std::atomic<counted_node_ptr> head;
void increase_head_count(counted_node_ptr& old_counter)
{
counted_node_ptr new_counter;
do
{
new_counter = old_counter;
++new_counter.external_count;
}
while(!head.compare_exchange_strong(old_counter, new_counter));
old_counter.external_count = new_counter.external_count;
}
public:
~lock_free_stack()
{
while(pop());
}
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr = new node(data);
new_node.external_count = 1;
new_node.ptr->next = head.load();
while (!head.compare_exchange_weak(new_node.ptr->next, new_node));
}
std::shared_ptr<T> pop()
{
counted_node_ptr old_head = head.load();
for (;;)
{
increase_head_count(old_head);
node* const ptr = old_head.ptr;
if (!ptr)
{
return std::shared_ptr<T>();
}
if (head.compare_exchange_strong(old_head, ptr->next))
{
std::shared_ptr<T> res;
res.swap(ptr->data);
int const count_increase = old_head.external_count - 2;
if(ptr->internal_count.fetch_add(count_increase) == -count_increase)
{
delete ptr;
}
return res;
}
else if(ptr->internal_count.fetch_sub(1) == 1)
{
delete ptr;
}
}
}
};
template<typename T>
class lock_free_stack
{
private:
struct node;
struct counted_node_ptr
{
int external_count;
node* ptr;
};
struct node
{
std::shared_ptr<T> data;
std::atomic<int> internal_count;
counted_node_ptr next;
node(T const& data_):
data(std::make_shared<T>(data_)),
internal_count(0)
{}
};
std::atomic<counted_node_ptr> head;
void increase_head_count(counted_node_ptr& old_counter)
{
counted_node_ptr new_counter;
do
{
new_counter = old_counter;
++new_counter.external_count;
}
while (!head.compare_exchange_strong(old_counter, new_counter,
std::memory_order_acquire,
std::memory_order_relaxed));
old_counter.external_count = new_counter.external_count;
}
public:
~lock_free_stack()
{
while (pop());
}
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr = new node(data);
new_node.external_count = 1;
new_node.ptr->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(new_node.ptr->next, new_node,
std::memory_order_release,
std::memory_order_relaxed));
}
std::shared_ptr<T> pop()
{
counted_node_ptr old_head = head.load(std::memory_order_relaxed);
for (;;)
{
increase_head_count(old_head);
node* const ptr = old_head.ptr;
if (!ptr)
{
return std::shared_ptr<T>();
}
if (head.compare_exchange_strong(old_head, ptr->next, std::memory_order_relaxed))
{
std::shared_ptr<T> res;
res.swap(ptr->data);
int const count_increase = old_head.external_count - 2;
if (ptr->internal_count.fetch_add(count_increase, std::memory_order_release) == -count_increase)
{
delete ptr;
}
return res;
}
else if (ptr->internal_count.fetch_add(-1, std::memory_order_relaxed) == 1)
{
ptr->internal_count.load(std::memory_order_acquire);
delete ptr;
}
}
}
};
// TODO
- Waiting until no threads are accessing the data structure and deleting all objects that are pending deletion
- Using hazard pointers to identify that a thread is accessing a particular object
- Reference counting the objects so that they aren’t deleted until there are no outstanding references
- oversubscription
std::thread::hardware_concurrency()
- high/low contention
- cache ping-pong
The effects of contention with mutexes are usually different from the effects of contention with atomic operations for the simple reason that the use of a mutex naturally serializes threads at the operating system level rather than at the processor level. If you have enough threads ready to run, the operating system can schedule another thread to run while one thread is waiting for the mutex, whereas a processor stall prevents any threads from running on that processor. However, it will still impact the performance of those threads that are competing for the mutex; they can only run one at a time, after all.
Cache ping-pong effects can nullify the benefits of a single-writer, multiple-reader mutex if the workload is unfavorable, because all threads accessing the data (even reader threads) still have to modify the mutex itself. As the number of processors accessing the data goes up, the contention on the mutex itself increases, and the cache line holding the mutex must be transferred between cores, thus potentially increasing the time taken to acquire and release locks to undesirable levels.
Processor caches deal in blocks of memory called cache lines. However, if the data items in a cache line are unrelated and need to be accessed by different threads, this can be a major cause of performance problems.
data proximity: if the data accessed by a single thread is spread out in memory, it’s likely that it lies on separate cache lines. On the flip side, if the data accessed by a single thread is close together in memory, it’s more likely to lie on the same cache line. Consequently, if data is spread out, more cache lines must be loaded from memory onto the processor cache, which can increase memory access latency and reduce performance compared to data that’s located close together.
- contention, false sharing, and data proximity.