Parallel Programming - yszheda/wiki GitHub Wiki
Courses
“Big Ideas”
- Scale “out”, not “up”
-
- Scale out: add more computers
-
- Scale up: more powerful computer
- Move processing to the data
-
- Cluster have limited bandwidth
- Process data sequentially, avoid random access
-
- Seeks are expensive, disk throughput is reasonable
- Seamless scalability
-
- From the mythical man-month to the tradable machine-hour
Parallel Programming Model
-
Shared Memory
-
- Pros: easy for data sharing and communication
-
- Cons: synchronization problem
-
Distributed Memory
-
- Pros: easy for programming
-
- Cons: communication cost
-
SPMD: Single Program Multiple Data
-
- programs have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only parts of the program they are designed to execute
-
MPMD: Multiple Program Multiple Data
-
- Separate programs written for each processors
-
- Processes spawn from the master process
-
Pthread: shared memory, MPMD
-
MPI: distributed memory, SPMD
MPI
Synchronous vs. blocking
- From the view of the pair of communicated processes
-
- Synchronous communication --- sending and receiving data occurs simultaneously
-
- Asynchronous communication --- sending and receiving data occurs non-simultaneously
- From the view of individual function call
-
- Blocking --- has been used to describe routines that do not return until the transfer is completed
-
- Non-blocking --- has been used to describe routines that return whether or not the message had been received
- Synchronous vs. blocking:
-
- Synchronous comm. commonly implemented by blocking call
-
- Synchronous comm. intrinsically performs two action: Transfer Data & Synchronize Processes
Asynchronous/Non-Blocking Message Passing
- How message-passing routines can return before the message transfer has been completed?
-
- Generally, a message buffer needed between source and destination to hold message
-
- Message buffer is a memory space at the sender and/or receiver side
-
- For send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work
Pthread
Threads vs. Processes
- Process (heavyweight process): complete separate program with its own variables, stack, heap, and everything else.
- Thread (lightweight process): share the same memory space for global variables, resources.
Shared Segment
-
Shared segment memory space can also be created between processes
-
POSIX API:
-
- shmget(): allocate a shared memory segment
-
- shmat(): attach a shared memory segment to a process
-
- shmdt(): detach a shared memory segment from a process
-
- shmctl(): deallocate a shared memory segment
-
System calls or library routines are called “Thread-Safe” if they can be called from multiple threads simultaneously and always produce correct results.
Synchronization Tools
The Critical-Section Problem: ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section.
Solution:
- Mutual Exclusion - If process Pi is executing in its critical section, then no other processes can be executing in their critical sections
- Progress - If no process is executing in its critical section and there exist some processes that wish to enter their critical section, then the selection of the processes that will enter the critical section next cannot be postponed indefinitely
- Bounded Waiting - A bound must exist on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted
Peterson’s Solution
while (true) {
flag[i] = TRUE;
turn = j;
while ( flag[j] && turn == j);
// CRITICAL SECTION
flag[i] = FALSE;
// REMAINDER SECTION
}
Bakery Algorithm
do {
choosing[i] = true;
number[i] = max(number[0], number[1], ..., number [n - 1]) + 1;
choosing[i] = false;
for (j = 0; j < n; j++) {
while (choosing[j]) ;
while ((number[j] != 0)&&((number[j], j) < (number[i], i))) ;
}
// critical section
number[i] = 0;
// remainder section
} while (1);
Synchronization Hardware
Modern machines provide special atomic hardware instructions:
Atomic = non-interruptible
Either test memory word and set value or swap contents of two memory words.
boolean TestAndSet (boolean *target)
{
boolean rv = *target;
*target = TRUE;
return rv:
}
while (true) {
while ( TestAndSet (&lock )) ; // do nothing
// critical section
lock = FALSE;
// remainder section
}
void Swap (boolean *a, boolean *b)
{
boolean temp = *a;
*a = *b;
*b = temp:
}
while (true) {
key = TRUE;
while (key == TRUE)
Swap(&lock, &key);
// critical section
lock = FALSE;
// remainder section
}
Semaphore
Semaphore Implementation
wait (S){
value--;
if (value < 0) {
add this process to waiting queue;
block();
}
}
Signal (S){
value++;
if (value <= 0) {
remove a process P from the waiting queue;
wakeup(P);
}
}
Semaphore S; // initialized to 1
do{
wait (S);
// Critical Section
signal (S);
// remainder section;
} while(1);
Classical Problems of Synchronization
Bounded-Buffer Problem
- N buffers, each can hold one item
- Semaphore
mutex
initialized to the value 1 - Semaphore
full
initialized to the value 0 - Semaphore
empty
initialized to the value N.
// producer
while (true) {
// produce an item
wait(empty);
wait(mutex);
// add the item to the buffer
signal(mutex);
signal(full);
}
// consumer
while (true) {
wait(full);
wait(mutex);
// remove an item from buffer
signal(mutex);
signal(empty);
// consume the removed item
}
- C++11 non-blocking producer/consumer
- A multi-threaded Producer Consumer with C++11
- Multi producer/consumer queue (without Boost) in C++11
- C++11 non-blocking producer/consumer
Readers and Writers Problem
Allow multiple readers to read at the same time. Only one single writer can access the shared data at the same time.
- Semaphore
mutex
initialized to 1. - Semaphore
wrt
initialized to 1. - Integer
readcount
initialized to 0.
// writer
while (true) {
wait (wrt) ;
// writing is performed
signal (wrt) ;
}
// reader
while (true) {
wait(mutex);
readcount ++;
if (readercount == 1) wait(wrt);
signal(mutex);
// reading is performed
wait(mutex);
readcount --;
if (redacount == 0) signal(wrt);
signal(mutex);
}
Dining-Philosophers Problem
- Critical Section is a piece of code that can only be accessed by one process/thread at a time.
- Mutual exclusion is the problem to insure only one process/thread can be in a critical section.
Lock/Mutex
#include “pthread.h”
pthread_mutex mutex;
pthread_mutex_init (&mutex, NULL);
pthread_mutex_lock(&mutex); // enter critical section
pthread_mutex_unlock(&mutex); // leave critical section
pthread_mutex_destory(&mutex);
Condition Variables (CV)
- CV represent some condition that a thread can:
-
- Wait on, until the condition occurs; or
-
- Notify other waiting threads that the condition has occurred
- Three operations on condition variables:
-
- wait() --- Block until another thread calls signal() or boardcast() on the CV
-
- signal() --- Wake up one thread waiting on the CV
-
- broadcast() --- Wake up all threads waiting on the CV
All condition variable operation MUST be performed while a mutex is locked!!!
pthread_cond_t cond; pthread_mutex_t mutex;
pthread_cond_init (cond, NULL); pthread_mutex_init (mutex, NULL);
action() {
pthread_mutex_lock (&mutex)
while (x != 0)
pthread_cond_wait (cond, mutex);
pthread_mutex_unlock (&mutex);
take_action();
}
counter() {
thread_mutex_lock (&mutex)
x--;
if (x==0)
pthread_cond_signal (cond);
pthread_mutex_unlock (&mutex);
}
Semaphore
[My Conclusion]Semaphore控制有限资源访问数不超过资源数目。Mutex/Lock需要Lock拥有者释放,Semaphore不需要。
- a record of how many units of a particular resource are available
-
- If #record = 1 -> binary semaphore, mutex lock
-
- If #record > 1 -> counting semaphore
- accessed only through 2 atomic ops: wait & signal
#include <semaphore.h>
sem_t sem;
sem_init(&sem);
sem_wait(&sem);
// critical section
sem_post(&sem);
sem_destory(&sem);
Monitor
high-level抽象
Deadlock
pthread_mutex_trylock()
Cache Coherence
OpenMP
Fork-Join model: the master thread forks a specified number of slave threads and divides task among them
Working-Sharing Construct
- DO/for Directive: shares iterations of a loop across the team. Represents a type of "data parallelism" (类似SPMD)
- SECTIONS Directive: breaks work into separate, discrete sections of code. Each section is executed by a thread.
- SINGLE Directive
Definition:
- A work-sharing construct divides the execution of the enclosed code region among the threads that encounter it
- Work-sharing constructs DO NOT launch new threads
- There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct
Synchronization Construct
Partitioning & Divide-And-Conquer Strategies
Examples
Bucket Sort
N-Body
Barnes-Hut Algorithm
- http://en.wikipedia.org/wiki/Barnes-Hut_simulation
- http://arborjs.org/docs/barnes-hut
- http://beltoforion.de/barnes_hut/barnes_hut_en.html
Orthogonal Recursive Bisection Method
Pipelined Computation
Examples
Insertion Sort
Prime Number Generation
每个node比较是否是当前prime的倍数
Linear Equation Solver
Synchronous Computation
Barrier Implementations
- Counter Implementation: 同步节点数
- Tree Implementation
- Butterfly Implementation
Counter Barrier Implementation
Counter-based barrier often have two phases:
- Arrival phase: a process enters arrival phase and does not leave this phase until all processes have arrived in this phase
- Departure phase: Processes are released after moving to the departure phase
Deadlock
A set of blocked processes each holding some resources and waiting to acquire a resource held by another process in the set
Necessary Conditions for Deadlock
- Mutual exclusion: only 1 process at a time can use a resource
- Hold & Wait: a process holding some resources and is waiting for another resource
- No preemption: a resource can be only released by a process voluntarily
- Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that P0 -> P1 -> P2 -> ... -> Pn -> P0
All four conditions must hold for possible deadlock!
Handling Deadlocks
- Ensure the system will never enter a deadlock state
-
- deadlock prevention: ensure that at least one of the 4 necessary conditions cannot hold
-
- deadlock avoidance: dynamically examines the resource-allocation state before allocation
- Allow to enter a deadlock state and then recover
-
- deadlock detection
-
- deadlock recovery
- Ignore the problem and pretend that deadlocks never occur in the system
-
- used by most operating systems, including UNIX.
Examples
- Prefix Sum
- System of Linear Equations
- Heat Distribution
// TODO
Load-Balancing & Termination Detection
Load-Balancing
Static Load-Balancing
Dynamic Load Balancing
Approaches
- Centralized Work Pool
- Decentralized Work Pool
- Fully Distributed Work Pool
Distributed Termination Detection
Termination at time t, must satisfy:
- Application-specific local termination conditions exist throughout the collection of processes
- There are no messages in transit between processes
// TODO
- Single-Pass Ring Termination Algorithm
- Dual-Pass Ring Termination Algorithm
Sorting
Parallel Bubble Sort: pipeline
- Sequential: O(n^2)
- Parallel: O(2n)
Merge Sort
- Sequential: O(nlogn)
- Parallel: O(2n)
Quick Sort
- Sequential: O(nlogn)
- Parallel: O(2n)
Bitonic Mergesort
- 两个序列,一个升序,一个降序。
- O(logn * logn)
Characteristic of Botonic Sequence
对相应元素做compare-and-exchange,仍得到Botonic Sequence,且其中一个Botonic Sequence的元素都小于另一个Botonic Sequence的的元素。
Rank Sort
Counting Sort
- If the numbers are integers and the range is known, Coding the rank sort algorithm to reduce the sequential time from O(n^2) to O(n)
- Use array c[] to count the histogram of values
- Complexity:
-
- Sequential: O(n + m)
-
- Parallel with n processors: O(logn)
linux cross-process locking
Optimistic concurrency control (OCC)
Transactional memory
Lock-Free Programming
- https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free
- Genuinely test std::atomic is lock-free or not
- https://en.wikipedia.org/wiki/Load-link/store-conditional
- https://en.wikipedia.org/wiki/ABA_problem
Hazard Pointer
- C++ Lock-free Hazard Pointer
- 实现无锁的栈与队列(5):Hazard Pointer
- Lock-Free Data Structures with Hazard Pointers
Atomics
- https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
- 簡介 C++11 atomic 和 memory order
Compare-and-Swap: The Mother of All RMWs
shared.compare_exchange_weak(T& expected, T desired, ...);
In general, the C++11 standard does not guarantee that atomic operations will be lock-free.
In general, any time you have a small amount of data protected by a mutex, and you can pack that data entirely into a 32- or 64-bit integer type, you can always convert your mutex-based operations into lock-free RMW operations
memory model
-
x86/64:
mfence
xchg
-
Linux kernel:
smp_mb
Explicit Compiler Barriers
compiler barriers are sufficient to prevent memory reordering on a single-processor system.
// full compiler barrier in GCC. In Microsoft Visual C++, [_ReadWriteBarrier](http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx) serves the same purpose.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
Implied Compiler Barriers
- CPU fence instructions
- C++11 atomic
- function calls (excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used.)
volatile
data types in C are not usually necessary in correctly-written multithreaded code.
Out-Of-Thin-Air Stores
C++11 standard
Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.
What’s cool is that neither acquire nor release semantics requires the use of a #StoreLoad barrier, which is often a more expensive memory barrier type. e.g. on PowerPC:
lwsync
(“lightweight sync”) = #LoadLoad, #LoadStore and #StoreStoresync
= #StoreLoad (more expensive)
Weak Memory Models
*weakly-ordered / weak ordering / relaxed memory model processors:
- DEC Alpha: doesn't maintain data dependency ordering
- ARM
- PowerPC
- Itanium
Strong Memory Models
TSO (total store order): there is always a single, global order of writes to shared memory from all cores.
Sequential Consistency
no memory reordering
sequential consistency only really becomes interesting as a software memory model:
- In Java 5 and higher:
volatile
- In C++:
memory_order_seq_cst
memory fence / barrier
-
:star2: https://www.kernel.org/doc/Documentation/memory-barriers.txt
-
:star2: https://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#305
enforce correct memory ordering on the processor by issuing any instruction which acts as a memory barrier:
- Certain inline assembly directives in GCC, such as the PowerPC-specific
asm volatile("lwsync" ::: "memory")
- Any Win32 Interlocked operation, except on Xbox 360
- Many operations on C++11 atomic types, such as
load(std::memory_order_acquire)
- Operations on POSIX mutexes, such as
pthread_mutex_lock
LoadStore instruction reordering might happen on certain processors if, say, there is a cache miss on the load followed by a cache hit on the store.
A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.
tmp = new Singleton;
// only needs to prevent preceding memory operations from being reordered past itself
m_instance.store(tmp, std::memory_order_release);
tmp = new Singleton;
// a release fence must prevent preceding memory operations from being reordered past all subsequent writes.
std::atomic_thread_fence(std::memory_order_release);
m_instance.store(tmp, std::memory_order_relaxed);
Let A and B represent operations performed by a multithreaded process. If A happens-before B, then the memory effects of A effectively become visible to the thread performing B before B is performed.
synchronizes-with implies happens-before
- The payload is the set of data being propagated between threads
- the guard variable protects access to the payload.
It’s a Runtime Relationship
Go’s memory model simply sticks with the term “happens-before”
-
Data dependency ordering guarantees that all memory accesses performed along a single chain will be performed in-order.
-
On the other hand, no guarantees are made about independent chains!