Is Parallel Programming Hard - yszheda/wiki GitHub Wiki

Chapter 4. Tools of the Trade

4.2.5 Atomic Operations (gcc Classic)

compare-and-swap operation: __sync_bool_compare_and_swap() __sync_val_compare_and_swap()
The __sync_synchronize() primitive issues a “memory barrier”

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) ACCESS_ONCE(x)
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })
#define barrier() __asm__ __volatile__("": : :"memory")

4.2.6 Atomic Operations (C11)

4.2.8 Per-Thread Variables

pthread_key_create() pthread_key_delete() pthread_setspecific() pthread_getspecific()

4.3 Alternatives to POSIX Operations

4.3.4 Accessing Shared Variables

4.3.4.1 Shared-Variable Shenanigans

Load tearing: the compiler uses multiple load instructions for a single access.
Store tearing: the compiler uses multiple store instructions for a single access. But for properly aligned machine-sized stores, WRITE_ONCE() will prevent store tearing.
Load fusing: the compiler uses the result of a prior load from a given variable instead of repeating the load.
Store fusing: the compiler notices a pair of successive stores to a given variable with no intervening loads from that variable.
Code reordering
Invented loads
Invented stores
Store-to-load transformations: the compiler notices that a plain store might not actually change the value in memory.
Dead-code elimination: the compiler notices that the value from a load is never used, or when a variable is stored to, but never loaded from.

4.3.4.2 A Volatile Solution

Implementations are forbidden from tearing an aligned volatile access when machine instructions of that access’s size and type are available. -> avoid unnecessary load and store tearing.
Implementations must not assume anything about the semantics of a volatile access, nor, for any volatile access that returns a value, about the possible set of values that might be returned. -> avoid optimizations that are inapplicable given that other processors might be concurrently accessing the location in question.
Aligned machine-sized non-mixed-size volatile accesses interact naturally with volatile assembly-code sequences before and after. This is necessary because some devices must be accessed using a combination of volatile MMIO accesses and special-purpose assembly-language instructions.

To summarize, the volatile keyword:

can prevent load tearing and store tearing in cases where the loads and stores are machine-sized and properly aligned.
can also prevent load fusing, store fusing, invented loads, and invented stores.
However, although it does prevent the compiler from reordering volatile accesses with each other, it does nothing to prevent the CPU from reordering these accesses.
it does nothing to prevent either compiler or CPU from reordering nonvolatile accesses with each other or with volatile accesses.

4.3.4.3 Assembling the Rest of a Solution

	compiler reordering	cpu reordering
`barrier()`	(/)	(x)
`smp_mb()`	(/)	(/)

4.3.4.4 Avoiding Data Races

access sharedvariables only when holding a particular lock
access a given “shared” variable only from a given CPU or thread

4.3.6 Per-CPU Variables

DEFINE_PER_THREAD(type, name)
DECLARE_PER_THREAD(type, name)
per_thread(name, thread)
__get_thread_var(name)
init_per_thread(name, v)

Chapter 5. Counting

Quick Quiz 5.11: But why can’t CPU designers simply ship the addition operation to the data, avoiding the need to circulate the cache line containing the global variable being incremented?

If the value of the variable is required, then the thread will be forced to wait for the operation to be shipped to the data, and then for the result to be shipped back.
If the atomic increment must be ordered with respect to prior and/or subsequent operations, then the thread will be forced to wait for the operation to be shipped to the data, and for an indication that the operation completed to be shipped back.
Shipping operations among CPUs will likely require more lines in the system interconnect, which will consume more die area and more electrical power.

5.2.3 Per-Thread-Variable-Based Implementation

Remember, when a thread exits, its per-thread variables disappear.

Quick Quiz 5.18: Doesn’t that explicit counterp array in Listing 5.4 reimpose an arbitrary limit on the number of threads? Why doesn’t the C language provide a per_thread() interface, similar to the Linux kernel’s per_cpu() primitive, to allow threads to more easily access each others’ per-thread variables?

To be fair, user-mode thread-local storage faces some challenges that the Linux kernel gets to ignore.

When a user-level thread exits, its per-thread variables all disappear, which complicates the problem of per-thread variable access, particularly before the advent of user-level RCU. In contrast, in the Linux kernel, when a CPU goes offline, that CPU’s per-CPU variables remain mapped and accessible.
when a new user-level thread is created, its per-thread variables suddenly come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped and initialized at boot time, regardless of whether the corresponding CPU exists yet, or indeed, whether the corresponding CPU will ever exist.

5.2.4 Eventually Consistent Implementation

One way to retain update-side scalability while greatly improving read-side performance is to weaken consistency requirements.

5.3 Approximate Limit Counters

5.4 Exact Limit Counters

5.4.3 Signal-Theft Limit Counter Design

WXWorkCapture_16920871484039

5.5 Parallel Counting Discussion

Quick Quiz 5.65: But if we are going to have to partition everything, why bother with shared-memory multithreading? Why not just partition the problem completely and run as multiple processes, each in its own address space?

advantages to shared-memory parallelism:

Only the most performance-critical portions of the application must be partitioned, and such portions are usually a small fraction of the application.
Although cache misses are quite slow compared to individual register-to-register instructions, they are typically considerably faster than inter-process communication primitives, which in turn are considerably faster than things like TCP/IP networking.
Shared-memory multiprocessors are readily available and quite inexpensive, so, in stark contrast to the 1990s, there is little cost penalty for use of shared memory parallelism.

partitioning over CPUs or threads
batching so that more work can be done by each expensive synchronization operation
weakening synchronization operations where feasible.

WXWorkCapture_16920907542326

Chapter 6. Partitioning and Synchronization Design

partition first, batch second, weaken third, and code fourth.

6.1 Partitioning Exercises

6.1.1 Dining Philosophers Problem

This is an example of “horizontal parallelism” or “data parallelism”, so named because there is no dependency among the pairs of philosophers. In a horizontally parallel data-processing system, a given item of data would be processed by only one of a replicated set of software components.

“Horizontal parallelism” processes packets from different network connections in parallel, while “vertical parallelism” handles different protocol-processing steps for a given packet in parallel. “Vertical parallelism” is also called “pipelining”.

6.1.2 Double-Ended Queue

6.1.2.1 Left- and Right-Hand Locks

WXWorkCapture_16928660149209

6.1.2.2 Compound Double-Ended Queue

WXWorkCapture_16928660418179

The main complication arises when dequeuing from an empty queue, in which case it is necessary to:

If holding the right-hand lock, release it and acquire the left-hand lock.
Acquire the right-hand lock.
Rebalance the elements across the two queues.
Remove the required element if there is one.
Release both locks.

6.1.2.3 Hashed Double-Ended Queue

WXWorkCapture_16928664882676

6.1.2.4 Compound Double-Ended Queue Revisited

6.2 Design Criteria

Speedup
Contention
Work-to-Synchronization Ratio
Read-to-Write Ratio
Complexity

The less time a program spends in critical sections, the greater the potential speedup.
Contention effects will consume the excess CPU and/or wallclock time should the actual speedup be less than the number of available CPUs.
If the available synchronization primitives have high overhead compared to the critical sections that they guard, the best way to improve speedup is to reduce the number of times that the primitives are invoked (perhaps by batching critical sections, using data ownership, using asymmetric primitives, or by moving toward a more coarse-grained design such as code locking).
If the critical sections have high overhead compared to the primitives guarding them, the best way to improve speedup is to increase parallelism by moving to reader/writer locking, data locking, asymmetric, or data ownership.
If the critical sections have high overhead compared to the primitives guarding them and the data structure being guarded is read much more often than modified, the best way to increase parallelism is to move to reader/writer locking or asymmetric primitives.
Many changes that improve SMP performance, for example, reducing lock contention, also improve real-time latencies

6.3 Synchronization Granularity

WXWorkCapture_16929482623545

6.3.1 Sequential Program

6.3.2 Code Locking

code locking is particularly prone to “lock contention”, where multiple CPUs need to acquire the lock concurrently.

6.3.3 Data Locking

What are some ways of preventing a structure from being freed while its lock is being acquired?

Provide a statically allocated lock that is held while the per-structure lock is being acquired, which is an example of hierarchical locking. Of course, using a single global lock for this purpose can result in unacceptably high levels of lock contention, dramatically reducing performance and scalability.
Provide an array of statically allocated locks, hashing the structure’s address to select the lock to be acquired. Given a hash function of sufficiently high quality, this avoids the scalability limitations of the single global lock, but in read-mostly situations, the lock-acquisition overhead can result in unacceptably degraded performance.
Use a garbage collector, in software environments providing them, so that a structure cannot be deallocated while being referenced.
As a special case of a garbage collector, use a global reference counter, or a global array of reference counters.
Use hazard pointers, which can be thought of as an inside-out reference count.
Use transactional memory (TM), so that each reference and modification to the data structure in question is performed atomically.
Use RCU, which can be thought of as an extremely lightweight approximation to a garbage collector. Updaters are not permitted to free RCU-protected data structures that RCU readers might still be referencing.

6.3.4 Data Ownership

Data ownership partitions a given data structure over the threads or CPUs, so that each thread/CPU accesses its subset of the data structure without any synchronization overhead whatsoever.

“embarrassingly parallel”

Another important instance of data ownership occurs when the data is read-only, in which case, all threads can “own” it via replication.

6.3.5 Locking Granularity and Performance

WXWorkCapture_16929524736373

6.4 Parallel Fastpath

WXWorkCapture_16929527835332

6.4.1 Reader/Writer Locking

Reader/writer locking is a simple instance of asymmetric locking.

6.4.2 Hierarchical Locking

6.4.3 Resource Allocator Caches

6.4.3.1 Parallel Resource Allocation Problem

6.4.3.2 Parallel Fastpath for Resource Allocation

WXWorkCapture_16929535652264

6.4.3.8 Real-World Design

WXWorkCapture_16929541968623

6.5 Beyond Partitioning

humiliatingly parallel: Adding threads significantly reduces the overall computational cost, resulting in large algorithmic superlinear speedups.

Chapter 7. Locking

7.1 Staying Alive

7.1.1 Deadlock

7.1.1.1 Locking Hierarchies

7.1.1.2 Local Locking Hierarchies

The golden rule in this case is “Release all locks before invoking unknown code.”

7.1.1.3 Layered Locking Hierarchies

7.1.1.4 Locking Hierarchies and Pointers to Locks

defer acquisition of one of the conflicting locks.

Used by Linux kernel RCU

7.1.1.5 Locking Hierarchies and Pointers to Locks

7.1.1.6 Conditional Locking

spin_trylock() primitive acquires the lock immediately if the lock is available (returning non-zero), and otherwise returns zero without acquiring the lock.

“livelock” if no thread makes any forward progress or “starvation” if some threads make forward progress but others do not

7.1.1.7 Acquire Needed Locks First

two-phase locking in transactional database systems

two-phase locking, has seen long production use in transactional database systems. In the first phase of a two-phase locking transaction, locks are acquired but not released. Once all needed locks have been acquired, the transaction enters the second phase, where locks are released, but not acquired.

This locking approach allows databases to provide serializability guarantees for their transactions, in other words, to guarantee that all values seen and produced by the transactions are consistent with some global ordering of all the transactions.

7.1.1.8 Single-Lock-at-a-Time Designs

7.1.1.9 Signal/Interrupt Handlers

The trick is to block signals (or disable interrupts, as the case may be) when acquiring any lock that might be acquired within an interrupt handler. Furthermore, if holding such a lock, it is illegal to attempt to acquire any lock that is ever acquired outside of a signal handler without blocking signals.

Unfortunately, blocking and unblocking signals can be expensive in some operating systems, notably including Linux, so performance concerns often mean that locks acquired in signal handlers are only acquired in signal handlers, and that lockless synchronization mechanisms are used to communicate between application code and signal handlers.

7.1.2 Livelock and Starvation

Livelock and starvation are serious issues in software transactional memory implementations, and so the concept of contention manager has been introduced to encapsulate these issues. In the case of locking, simple exponential backoff can often address livelock and starvation. The idea is to introduce exponentially increasing delays before each retry.

Quick Quiz 7.16: How can the livelock shown in Listing 7.6 be avoided?

If a given critical section retries too many times, unconditionally acquire a global lock, then unconditionally acquire all the needed locks. This avoids both deadlock and livelock, and scales reasonably assuming that the global lock need not be acquired too often.

7.1.3 Unfairness

WXWorkCapture_16922560174846

7.1.4 Inefficiency

Locks are implemented using atomic instructions and memory barriers, and often involve cache misses.

Quick Quiz 7.19: How might the lock holder be interfered with?

If the data protected by the lock is in the same cache line as the lock itself, then attempts by other CPUs to acquire the lock will result in expensive cache misses on the part of the CPU holding the lock. This is a special case of false sharing, which can also occur if a pair of variables protected by different locks happen to share a cache line. In contrast, if the lock is in a different cache line than the data that it protects, the CPU holding the lock will usually suffer a cache miss only on first access to a given variable. Of course, the downside of placing the lock and data into separate cache lines is that the code will incur two cache misses rather than only one in the uncontended case.

7.2 Types of Locks

7.2.1 Exclusive Locks

Quick Quiz 7.18: Does it ever make sense to have an exclusive lock acquisition immediately followed by a release of that same lock, that is, an empty critical section?

the semantics of exclusive locks have two components:

(1) the familiar data-protection semantic and

(2) a messaging semantic, where releasing a given lock notifies a waiting acquisition of that same lock. An empty critical section uses the messaging component without the data-protection component.

unconditionally acquiring an exclusive lock has two effects:

Waiting for all prior holders of that lock to release it
Blocking any other acquisition attempts until the lock is released.

partitioning strategies:

Strict FIFO, with acquisitions starting earlier acquiring the lock earlier.
Approximate FIFO, with acquisitions starting sufficiently earlier acquiring the lock earlier.
FIFO within priority level, with higher-priority threads acquiring the lock earlier than any lower priority threads attempting to acquire the lock at about the same time, but so that some FIFO ordering applies for threads of the same priority.
Random, so that the new lock holder is chosen randomly from all threads attempting acquisition, regardless of timing.
Unfair, so that a given acquisition might never acquire the lock.

7.2.2 Reader-Writer Locks

Reader-writer locks permit any number of readers to hold the lock concurrently on the one hand or a single writer to hold the lock on the other. In theory, then, reader-writer locks should allow excellent scalability for data that is read often and written rarely. In practice, the scalability will depend on the reader-writer lock implementation.

7.2.3 Beyond Reader-Writer Locks

policy: VAX/VMS distributed lock manager (DLM)

7.2.4 Scoped Locking

However, RAII locking also has a dark side. RAII makes it quite difficult to encapsulate lock acquisition and release, for example, in iterators.

Strict RAII locking also prohibits overlapping critical sections, due to the fact that scopes must nest.

7.3 Locking Implementation Issues

7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange

typedef int xchglock_t;
#define DEFINE_XCHG_LOCK(n) xchglock_t n = 0

void xchg_lock(xchglock_t *xp)
{
    while (xchg(xp, 1) == 1) {
        while (READ_ONCE(*xp) == 1)
            continue;
    }
}

void xchg_unlock(xchglock_t *xp)
{
    (void)xchg(xp, 0);
}

Quick Quiz 7.25: Why bother with the inner loop on lines 7–8 of Listing 7.9? Why not simply repeatedly do the atomic exchange operation on line 6?

Suppose that the lock is held and that several threads are attempting to acquire the lock. In this situation, if these threads all loop on the atomic exchange operation, they will ping-pong the cache line containing the lock among themselves, imposing load on the interconnect. In contrast, if these threads are spinning in the inner loop on lines 7–8, they will each spin within their own caches, placing negligible load on the interconnect.

7.3.2 Other Exclusive-Locking Implementations

based on atomic instructions:
- Pros:
  - works well when contention is low
  - small memory footprint.
- Cons: avoids giving the lock to threads that cannot use it -> unfairness or even starvation at high contention levels
ticket lock: avoids unfairness at high contention levels.
queued-lock: avoid high cache-invalidation overhead by assigning each thread a queue element.
- The key point is that each thread spins on its own queue element, so that the lock holder need only invalidate the first element from the next thread’s CPU’s cache.
- preferentially granting locks locally -> avoid starvation
- Cons: increases overhead at low contention
priority inversion problem:
- priority inheritance
- prevent preemption while a lock is held
- Linux: futexes
token-based mechanism
- Pros: useful in cases where CPUs need periodic access to the critical section, but can tolerate variances in token-circulation rate.
- Cons: a given CPU cannot necessarily acquire it immediately, even if no other CPU is using it at the moment.
Linux kernel now uses queued spinlocks.

7.4 Lock-Based Existence Guarantees

7.5 Locking: Hero or Villain?

Chapter 8. Data Ownership

8.1 Multiple Processes

8.2 Partial Data Ownership and `pthreads`

8.3 Function Shipping

Quick Quiz 8.5: What mechanisms other than POSIX signals may be used for function shipping?

System V message queues.
Shared-memory dequeue.
Shared-memory mailboxes.
UNIX-domain sockets.
TCP/IP or UDP, possibly augmented by any number of higher-level protocols, including RPC, HTTP, XML, SOAP, and so on.

8.4 Designated Thread

8.5 Privatization

8.6 Other Uses of Data Ownership

Chapter 9. Deferred Processing

9.2 Reference Counting

9.3 Hazard Pointers

rather than incrementing an integer stored in the data element, instead store a pointer to that data element in per-CPU (or per-thread) lists. Each element of these lists is called a hazard pointer

Therefore, hazard-pointer readers must typically restart the full traversal in the face of a concurrent deletion. Often the restart must go back to some global (and thus immortal) pointer, but it is sometimes possible to restart at some intermediate location if that location is guaranteed to still be live, for example, due to the current thread holding a lock, a reference count, etc.

Because algorithms using hazard pointers might be restarted at any step of their traversal through the linked data structure, such algorithms must typically take care to avoid making any changes to the data structure until after they have acquired all the hazard pointers that are required for the update in question.

These hazard-pointer restrictions result in great benefits to readers, courtesy of the fact that the hazard pointers are stored local to each CPU or thread, which in turn allows traversals to be carried out without any writes to the data structures being traversed.

hazard pointers enable the CPU caches to do resource replication, which in turn allows weakening of the parallel-access-control mechanism, thus boosting performance and scalability.

Another advantage of restarting hazard pointers traversals is a reduction in minimal memory footprint: Any object not currently referenced by some hazard pointer may be immediately freed.

9.4 Sequence Locks

The limitations of sequence locking are:

Sequence locking restricts updates
Sequence locking does not permit traversal of pointers to objects that might be freed by updaters.

Sequence locks allow writers to defer readers, but not vice versa. This can result in unfairness and even starvation in writer-heavy workloads

9.5 Read-Copy Update (RCU)

Primitive	Purpose
`rcu_read_lock()`	Start an RCU read-side critical section.
`rcu_read_unlock()`	End an RCU read-side critical section.
`rcu_dereference()`	Safely load an RCU-protected pointer.
`synchronize_rcu()`	Wait for all pre-existing RCU read-side critical sections to complete.
`call_rcu()`	Invoke the specified function after all pre-existing RCU read-side critical sections complete.
`rcu_assign_pointer()`	Safely update an RCU-protected pointer.

9.5.1.3 Waiting for Readers

quiescent-state-based reclamation (QSBR)

9.5.1.5 RCU Properties

A key RCU property is that reads need not wait for updates. This property enables RCU implementations to provide low-cost or even no-cost readers, resulting in low overhead and excellent scalability.

9.5.2 RCU Fundamentals

9.5.2.1 Publish-Subscribe Mechanism

rcu_assign_pointer(): publishing pointers
rcu_dereference(): subscribing to pointers

rcu_assign_pointer() must be atomic in the sense that concurrent readers see either the old value of the pointer or the new value of the pointer, but not some mash-up of these two values. These requirements are met by the C11 store-release operation, and in fact in the Linux kernel, rcu_assign_pointer() is defined in terms of smp_store_release(), which is similar to C11 store-release.

9.5.2.2 Wait For Pre-Existing RCU Readers

RCU’s wait-for-readers guarantee therefore has two parts:

If any part of a given RCU read-side critical section precedes the beginning of a given grace period, then the entirety of that critical section precedes the end of that grace period.
If any part of a given RCU read-side critical section follows the end of a given grace period, then the entirety of that critical section follows the beginning of that grace period.

9.5.2.3 Maintain Multiple Versions of Recently Updated Objects

Because these synchronization-free readers provide very weak temporal synchronization, RCU users compensate via spatial synchronization.

One way to resolve this strange situation is via weaker semanitics.
If this outcome is problematic, another way to resolve this situation is through use of stronger synchronization mechanisms, such as reader-writer locking, or clever use of timestamps and versioning.

9.5.2.4 Summary of RCU Fundamentals

A publish-subscribe mechanism for adding new data featuring rcu_assign_pointer() for update-side publication and rcu_dereference() for read-side subscription.
A way of waiting for pre-existing RCU readers to finish based on readers being delimited by rcu_read_lock() and rcu_read_unlock() on the one hand and updaters waiting via synchronize_rcu() or call_rcu() on the other.
A discipline of maintaining multiple versions to permit change without harming or unduly delaying concurrent RCU readers.

9.5.3 RCU Linux-Kernel API

// TODO

9.5.4 RCU Usage

// TODO

Chapter 10. Data Structures

Quick Quiz 10.4: Given the negative scalability of the Schrödinger’s Zoo application across sockets, why not just run multiple copies of the application, with each copy having a subset of the animals and confined to run on a single socket?

"sharding"

10.4 Non-Partitionable Data Structures

10.4.1 Resizable Hash Table Design

WXWorkCapture_16940709961072

10.4.4 Other Resizable Hash Tables

WXWorkCapture_16945008847821

WXWorkCapture_169450135396

10.6 Micro-Optimization

10.6.3 Hardware Considerations

rearranging structures to accommodate cache geometry:

Place read-mostly data far from frequently updated data. For example, place read-mostly data at the beginning of the structure and frequently updated data at the end. Place data that is rarely accessed in between.
If the structure has groups of fields such that each group is updated by an independent code path, separate these groups from each other. Again, it can be helpful to place rarely accessed data between the groups. In some cases, it might also make sense to place each such group into a separate structure referenced by the original structure.
Where possible, associate update-mostly data with a CPU, thread, or task.
Partition your data on a per-CPU, per-thread, or per-task basis

rules of thumb deal with locks:

Given a heavily contended lock protecting data that is frequently modified, take one of the following approaches: (a) Place the lock in a different cacheline than the data that it protects. (b) Use a lock that is adapted for high contention, such as a queued lock. (c) Redesign to reduce lock contention. (This approach is best, but is not always trivial.)
Place uncontended locks into the same cache line as the data that they protect. This approach means that the cache miss that brings the lock to the current CPU also brings its data.
Protect read-mostly data with hazard pointers, RCU, or, for long-duration critical sections, reader-writer locks.

Chapter 14. Advanced Synchronization

14.1 Avoiding Locks

14.2 Memory Barriers

Appendix B. “Toy” RCU Implementations

B.3 Simple Counter-Based RCU

WXWorkCapture_169450517481

B.4 Starvation-Free Counter-Based RCU

WXWorkCapture_16945052004525

WXWorkCapture_16945052733317

B.5 Scalable Counter-Based RCU

WXWorkCapture_16945069254447

WXWorkCapture_16945070365829

B.6 Scalable Counter-Based RCU With Shared Grace Periods

WXWorkCapture_16945077697379

WXWorkCapture_16945077893802

WXWorkCapture_16945078072098

B.7 RCU Based on Free-Running Counter

WXWorkCapture_16945086609762

B.8 Nestable RCU Based on Free-Running Counter

WXWorkCapture_16945103937096

WXWorkCapture_16945104145925

B.9 RCU Based on Quiescent States

WXWorkCapture_16945114756704

WXWorkCapture_16945115088192

WXWorkCapture_16945115667845

B.10 Summary of Toy RCU Implementations

Appendix C. Why Memory Barriers?

C.1 Cache Structure

capacity miss
associativity miss
write miss
communication miss

C.2 Cache-Coherence Protocols

C.2.1 MESI States

MESI stands for “modified”, “exclusive”, “shared”, and “invalid”
A line in the “modified” state has been subject to a recent memory store from the corresponding CPU, and the corresponding memory is guaranteed not to appear in any other CPU’s cache.
"exclusive": the cache line has not yet been modified by the corresponding CPU, which in turn means that the copy of the cache line’s data that resides in memory is up-to-date.
A line in the “shared” state might be replicated in at least one other CPU’s cache, so that this CPU is not permitted to store to the line without first consulting with other CPUs.
A line in the “invalid” state is empty, in other words, it holds no data.

C.2.2 MESI Protocol Messages

Read: The “read” message contains the physical address of the cache line to be read.
Read Response: The “read response” message contains the data requested by an earlier “read” message. This “read response” message might be supplied either by memory or by one of the other caches.
Invalidate: The “invalidate” message contains the physical address of the cache line to be invalidated. All other caches must remove the corresponding data from their caches and respond.
Invalidate Acknowledge: A CPU receiving an “invalidate” message must respond with an “invalidate acknowledge” message after removing the specified data from its cache.
Read Invalidate: The “read invalidate” message contains the physical address of the cache line to be read, while at the same time directing other caches to remove the data. A “read invalidate” message requires both a “read response” and a set of “invalidate acknowledge” messages in reply.
Writeback: The “writeback” message contains both the address and the data to be written back to memory (and perhaps “snooped” into other CPUs’ caches along the way). This message permits caches to eject lines in the “modified” state as needed to make room for other data.

C.2.3 MESI State Diagram

C.2.4 MESI Protocol Example

C.3 Stores Result in Unnecessary Stalls

C.3.1 Store Buffers

With the addition of these store buffers, CPU 0 can simply record its write in its store buffer and continue executing. When the cache line does finally make its way from CPU 1 to CPU 0, the data will be moved from the store buffer to the cache line.

Quick Quiz B.7: But if the main purpose of store buffers is to hide acknowledgment latencies in multiprocessor cache-coherence protocols, why do uniprocessors also have store buffers?

Because memory is much slower than is cache on uniprocessors, store buffers on uniprocessors can help to hide write-miss latencies.

These store buffers are local to a given CPU or, on systems with hardware multithreading, local to a given core. Either way, a given CPU is permitted to access only the store buffer assigned to it.

C.3.2 Store Forwarding

“store forwarding”: each CPU refers to (or “snoops”) its store buffer as well as its cache when performing loads. In other words, a given CPU’s stores are directly forwarded to its subsequent loads, without having to pass through the cache.

C.3.3 Store Buffers and Memory Barriers

The memory barrier smp_mb() will cause the CPU to flush its store buffer before applying each subsequent store to its variable’s cache line. The CPU could either simply stall until the store buffer was empty before proceeding, or it could use the store buffer to hold subsequent stores until all of the prior entries in the store buffer had been applied.

C.4 Store Sequences Result in Unnecessary Stalls

each store buffer must be relatively small + memory barrier -> the CPU must once again wait for invalidations to complete in order to drain its store buffer before it can continue executing

C.4.1 Invalidate Queues

the CPU need not actually invalidate the cache line before sending the acknowledgement. It could instead queue the invalidate message with the understanding that the message will be processed before the CPU sends any further messages regarding that cache line.

C.4.2 Invalidate Queues and Invalidate Acknowledge

A CPU with an invalidate queue may acknowledge an invalidate message as soon as it is placed in the queue, instead of having to wait until the corresponding line is actually invalidated. Of course, the CPU must refer to its invalidate queue when preparing to transmit invalidation messages—if an entry for the corresponding cache line is in the invalidate queue, the CPU cannot immediately transmit the invalidate message; it must instead wait until the invalidate-queue entry has been processed.

C.4.3 Invalidate Queues and Memory Barriers

C.5 Read and Write Memory Barriers

a “read memory barrier” marks only the invalidate queue and a “write memory barrier” marks only the store buffer, while a full-fledged memory barrier does both.

C.6 Example Memory-Barrier Sequences

C.6.1 Ordering-Hostile Architecture

This hardware must obey the following ordering constraints:

Each CPU will always perceive its own memory accesses as occurring in program order.

CPUs will reorder a given operation with a store only if the two operations are referencing different locations.

All of a given CPU’s loads preceding a read memory barrier (smp_rmb()) will be perceived by all CPUs to precede any loads following that read memory barrier.

All of a given CPU’s stores preceding a write memory barrier (smp_wmb()) will be perceived by all CPUs to precede any stores following that write memory barrier.

All of a given CPU’s accesses (loads and stores) preceding a full memory barrier (smp_mb()) will be perceived by all CPUs to precede any accesses following that memory barrier.

// TODO: examples

C.7 Memory-Barrier Instructions For Specific CPUs

C.7.3 ARMv7-A/R

DMB (data memory barrier) causes the specified type of operations to appear to have completed before any subsequent operations of the same type. In addition, ARM allows cache coherence to have one of three scopes: single processor, a subset of the processors (“inner”) and global (“outer”).
DSB (data synchronization barrier) causes the specified type of operations to actually complete before any subsequent operations (of any type) are executed.
ISB (instruction synchronization barrier) flushes the CPU pipeline, so that all instructions following the ISB are fetched only after the ISB completes.

ARM also implements control dependencies, so that if a conditional branch depends on a load, then any store executed after that conditional branch will be ordered after the load. However, loads following the conditional branch will not be guaranteed to be ordered unless there is an ISB instruction between the branch and the load.

C.7.4 IA64

IA64 offers a weak consistency model, so that in absence of explicit memory-barrier instructions, IA64 is within its rights to arbitrarily reorder memory references.

memory-fence instruction: mf

C.7.9 x86

C.8 Are Memory Barriers Forever?

C.9 Advice to Hardware Designers

problems:

I/O devices that ignore cache coherence.

External busses that fail to transmit cache-coherence data.

Device interrupts that ignore cache coherence.

Inter-processor interrupts (IPIs) that ignore cache coherence.

Context switches that get ahead of cache coherence.

Overly kind simulators and emulators.