Parallel Programming - yszheda/wiki GitHub Wiki

Courses

“Big Ideas”

Scale “out”, not “up”
- Scale out: add more computers
- Scale up: more powerful computer
Move processing to the data
- Cluster have limited bandwidth
Process data sequentially, avoid random access
- Seeks are expensive, disk throughput is reasonable
Seamless scalability
- From the mythical man-month to the tradable machine-hour

Parallel Programming Model

Shared Memory
- Pros: easy for data sharing and communication
- Cons: synchronization problem
Distributed Memory
- Pros: easy for programming
- Cons: communication cost
SPMD: Single Program Multiple Data
- programs have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only parts of the program they are designed to execute
MPMD: Multiple Program Multiple Data
- Separate programs written for each processors
- Processes spawn from the master process
Pthread: shared memory, MPMD
MPI: distributed memory, SPMD

MPI

Synchronous vs. blocking

From the view of the pair of communicated processes
- Synchronous communication --- sending and receiving data occurs simultaneously
- Asynchronous communication --- sending and receiving data occurs non-simultaneously
From the view of individual function call
- Blocking --- has been used to describe routines that do not return until the transfer is completed
- Non-blocking --- has been used to describe routines that return whether or not the message had been received
Synchronous vs. blocking:
- Synchronous comm. commonly implemented by blocking call
- Synchronous comm. intrinsically performs two action: Transfer Data & Synchronize Processes

Asynchronous/Non-Blocking Message Passing

How message-passing routines can return before the message transfer has been completed?
- Generally, a message buffer needed between source and destination to hold message
- Message buffer is a memory space at the sender and/or receiver side
- For send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work

Pthread

Threads vs. Processes

Process (heavyweight process): complete separate program with its own variables, stack, heap, and everything else.
Thread (lightweight process): share the same memory space for global variables, resources.

Shared Segment

Shared segment memory space can also be created between processes
POSIX API:
- shmget(): allocate a shared memory segment
- shmat(): attach a shared memory segment to a process
- shmdt(): detach a shared memory segment from a process
- shmctl(): deallocate a shared memory segment
System calls or library routines are called “Thread-Safe” if they can be called from multiple threads simultaneously and always produce correct results.

Synchronization Tools

The Critical-Section Problem: ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section.

Solution:

Mutual Exclusion - If process Pi is executing in its critical section, then no other processes can be executing in their critical sections
Progress - If no process is executing in its critical section and there exist some processes that wish to enter their critical section, then the selection of the processes that will enter the critical section next cannot be postponed indefinitely
Bounded Waiting - A bound must exist on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted

Peterson’s Solution

while (true) {
    flag[i] = TRUE;
    turn = j;
    while ( flag[j] && turn == j);
    // CRITICAL SECTION
    flag[i] = FALSE;
    // REMAINDER SECTION
}

Bakery Algorithm

do { 
    choosing[i] = true;
    number[i] = max(number[0], number[1], ..., number [n - 1]) + 1;
    choosing[i] = false;
    for (j = 0; j < n; j++) {
        while (choosing[j]) ;
        while ((number[j] != 0)&&((number[j], j) < (number[i], i))) ;
    }
    // critical section
    number[i] = 0;
    // remainder section
} while (1);

Synchronization Hardware

Modern machines provide special atomic hardware instructions:

Atomic = non-interruptible

Either test memory word and set value or swap contents of two memory words.

boolean TestAndSet (boolean *target)
{
    boolean rv = *target;
    *target = TRUE;
    return rv:
}

while (true) {
    while ( TestAndSet (&lock )) ;   // do nothing
    // critical section
    lock = FALSE;
    // remainder section 
}

void Swap (boolean *a, boolean *b)
{
    boolean temp = *a;
    *a = *b;
    *b = temp:
}

while (true)  {
    key = TRUE;
    while (key == TRUE)
        Swap(&lock, &key);
    // critical section
    lock = FALSE;
    // remainder section 
}

Semaphore

Semaphore Implementation

wait (S){
    value--;
    if (value < 0) {
        add this process to waiting queue;
        block();
    }
}

Signal (S){
    value++;
    if (value <= 0) { 
        remove a process P from the waiting queue;
        wakeup(P);
    }
}

Semaphore S;    // initialized to 1
do{
    wait (S);
    // Critical Section
    signal (S);
    // remainder section;
} while(1);

Classical Problems of Synchronization

Bounded-Buffer Problem

N buffers, each can hold one item
Semaphore mutex initialized to the value 1
Semaphore full initialized to the value 0
Semaphore empty initialized to the value N.

// producer

while (true)  {
    // produce an item
    wait(empty);
    wait(mutex);
    // add the item to the buffer
    signal(mutex);
    signal(full);
}

// consumer

while (true) {
    wait(full);
    wait(mutex);
    // remove an item from  buffer
    signal(mutex);
    signal(empty);
    // consume the removed item
}

Readers and Writers Problem

Allow multiple readers to read at the same time. Only one single writer can access the shared data at the same time.

Semaphore mutex initialized to 1.
Semaphore wrt initialized to 1.
Integer readcount initialized to 0.

// writer

while (true) {
    wait (wrt) ;

    // writing is performed

    signal (wrt) ;
}

// reader

while (true) {
    wait(mutex);
    readcount ++;
    if (readercount == 1)  wait(wrt);
    signal(mutex);

    // reading is performed

    wait(mutex);
    readcount --;
    if (redacount == 0)  signal(wrt);
    signal(mutex);
}

Dining-Philosophers Problem

Critical Section is a piece of code that can only be accessed by one process/thread at a time.
Mutual exclusion is the problem to insure only one process/thread can be in a critical section.

Lock/Mutex

#include “pthread.h”
pthread_mutex   mutex;
pthread_mutex_init (&mutex, NULL);
pthread_mutex_lock(&mutex);	// enter critical section


pthread_mutex_unlock(&mutex);	// leave critical section
pthread_mutex_destory(&mutex);

Condition Variables (CV)

CV represent some condition that a thread can:
- Wait on, until the condition occurs; or
- Notify other waiting threads that the condition has occurred
Three operations on condition variables:
- wait() --- Block until another thread calls signal() or boardcast() on the CV
- signal() --- Wake up one thread waiting on the CV
- broadcast() --- Wake up all threads waiting on the CV

All condition variable operation MUST be performed while a mutex is locked!!!

pthread_cond_t    cond;			pthread_mutex_t    mutex;
pthread_cond_init (cond, NULL);		pthread_mutex_init (mutex, NULL);

action() {
    pthread_mutex_lock (&mutex)
    while (x != 0)
        pthread_cond_wait (cond, mutex);
    pthread_mutex_unlock (&mutex);
    take_action();
}

counter() {
  thread_mutex_lock (&mutex)
  x--;
  if (x==0) 
      pthread_cond_signal (cond);
  pthread_mutex_unlock (&mutex);
}

Semaphore

[My Conclusion]Semaphore控制有限资源访问数不超过资源数目。Mutex/Lock需要Lock拥有者释放，Semaphore不需要。

a record of how many units of a particular resource are available
- If #record = 1 -> binary semaphore, mutex lock
- If #record > 1 -> counting semaphore
accessed only through 2 atomic ops: wait & signal

#include <semaphore.h>
sem_t  sem;
sem_init(&sem);
sem_wait(&sem);
// critical section
sem_post(&sem);
sem_destory(&sem);

Monitor

high-level抽象

Deadlock

pthread_mutex_trylock()

Cache Coherence

OpenMP

Fork-Join model: the master thread forks a specified number of slave threads and divides task among them

Working-Sharing Construct

DO/for Directive: shares iterations of a loop across the team. Represents a type of "data parallelism" （类似SPMD）
SECTIONS Directive: breaks work into separate, discrete sections of code. Each section is executed by a thread.
SINGLE Directive

Definition:

A work-sharing construct divides the execution of the enclosed code region among the threads that encounter it
Work-sharing constructs DO NOT launch new threads
There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct

Synchronization Construct

Partitioning & Divide-And-Conquer Strategies

Examples

Bucket Sort

N-Body

Barnes-Hut Algorithm

Orthogonal Recursive Bisection Method

Pipelined Computation

Examples

Insertion Sort

Prime Number Generation

每个node比较是否是当前prime的倍数

Linear Equation Solver

Synchronous Computation

Barrier Implementations

Counter Implementation: 同步节点数
Tree Implementation
Butterfly Implementation

Counter Barrier Implementation

Counter-based barrier often have two phases:

Arrival phase: a process enters arrival phase and does not leave this phase until all processes have arrived in this phase
Departure phase: Processes are released after moving to the departure phase

Deadlock

A set of blocked processes each holding some resources and waiting to acquire a resource held by another process in the set

Necessary Conditions for Deadlock

Mutual exclusion: only 1 process at a time can use a resource
Hold & Wait: a process holding some resources and is waiting for another resource
No preemption: a resource can be only released by a process voluntarily
Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that P0 -> P1 -> P2 -> ... -> Pn -> P0

All four conditions must hold for possible deadlock!

Handling Deadlocks

Ensure the system will never enter a deadlock state
- deadlock prevention: ensure that at least one of the 4 necessary conditions cannot hold
- deadlock avoidance: dynamically examines the resource-allocation state before allocation
Allow to enter a deadlock state and then recover
- deadlock detection
- deadlock recovery
Ignore the problem and pretend that deadlocks never occur in the system
- used by most operating systems, including UNIX.

Examples

Prefix Sum
System of Linear Equations
Heat Distribution

// TODO

Load-Balancing & Termination Detection

Load-Balancing

Static Load-Balancing

Dynamic Load Balancing

Approaches

Centralized Work Pool
Decentralized Work Pool
Fully Distributed Work Pool

Distributed Termination Detection

Termination at time t, must satisfy:

Application-specific local termination conditions exist throughout the collection of processes
There are no messages in transit between processes

// TODO

Single-Pass Ring Termination Algorithm
Dual-Pass Ring Termination Algorithm

Sorting

Parallel Bubble Sort: pipeline

Sequential: O(n^2)
Parallel: O(2n)

Merge Sort

Sequential: O(nlogn)
Parallel: O(2n)

Quick Sort

Sequential: O(nlogn)
Parallel: O(2n)

Bitonic Mergesort

两个序列，一个升序，一个降序。
O(logn * logn)

Characteristic of Botonic Sequence

对相应元素做compare-and-exchange，仍得到Botonic Sequence，且其中一个Botonic Sequence的元素都小于另一个Botonic Sequence的的元素。

Rank Sort

Counting Sort

If the numbers are integers and the range is known, Coding the rank sort algorithm to reduce the sequential time from O(n^2) to O(n)
Use array c[] to count the histogram of values
Complexity:
- Sequential: O(n + m)
- Parallel with n processors: O(logn)

linux cross-process locking

Optimistic concurrency control (OCC)

https://en.wikipedia.org/wiki/Optimistic_concurrency_control

Transactional memory

https://en.wikipedia.org/wiki/Transactional_memory

Lock-Free Programming

An Introduction to Lock-Free Programming

Hazard Pointer

Atomics

Atomic vs. Non-Atomic Operations

You Can Do Any Kind of Atomic Read-Modify-Write Operation

Compare-and-Swap: The Mother of All RMWs

shared.compare_exchange_weak(T& expected, T desired, ...);

In general, the C++11 standard does not guarantee that atomic operations will be lock-free.

In general, any time you have a small amount of data protected by a mutex, and you can pack that data entirely into a 32- or 64-bit integer type, you can always convert your mutex-based operations into lock-free RMW operations

memory model

Memory Reordering Caught in the Act
x86/64: mfence xchg
Linux kernel: smp_mb

Memory Ordering at Compile Time

Explicit Compiler Barriers

compiler barriers are sufficient to prevent memory reordering on a single-processor system.

// full compiler barrier in GCC. In Microsoft Visual C++, [_ReadWriteBarrier](http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx) serves the same purpose.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")

Implied Compiler Barriers

CPU fence instructions
C++11 atomic
function calls (excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used.)

volatile data types in C are not usually necessary in correctly-written multithreaded code.

Out-Of-Thin-Air Stores

C++11 standard

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.

Acquire and Release Semantics

What’s cool is that neither acquire nor release semantics requires the use of a #StoreLoad barrier, which is often a more expensive memory barrier type. e.g. on PowerPC:

lwsync (“lightweight sync”) = #LoadLoad, #LoadStore and #StoreStore
sync = #StoreLoad (more expensive)

Weak vs. Strong Memory Models

Weak Memory Models

*weakly-ordered / weak ordering / relaxed memory model processors:

DEC Alpha: doesn't maintain data dependency ordering
ARM
PowerPC
Itanium

Strong Memory Models

TSO (total store order): there is always a single, global order of writes to shared memory from all cores.

Sequential Consistency

no memory reordering

sequential consistency only really becomes interesting as a software memory model:

In Java 5 and higher: volatile
In C++: memory_order_seq_cst

This Is Why They Call It a Weakly-Ordered CPU

Can Reordering of Release/Acquire Operations Introduce Deadlock?

memory fence / barrier

Memory Barriers Are Like Source Control Operations

enforce correct memory ordering on the processor by issuing any instruction which acts as a memory barrier:

Certain inline assembly directives in GCC, such as the PowerPC-specific asm volatile("lwsync" ::: "memory")
Any Win32 Interlocked operation, except on Xbox 360
Many operations on C++11 atomic types, such as load(std::memory_order_acquire)
Operations on POSIX mutexes, such as pthread_mutex_lock

LoadStore instruction reordering might happen on certain processors if, say, there is a cache miss on the load followed by a cache hit on the store.

A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.

Acquire and Release Fences Don't Work the Way You'd Expect

tmp = new Singleton;
// only needs to prevent preceding memory operations from being reordered past itself
m_instance.store(tmp, std::memory_order_release);

tmp = new Singleton;
// a release fence must prevent preceding memory operations from being reordered past all subsequent writes.
std::atomic_thread_fence(std::memory_order_release);
m_instance.store(tmp, std::memory_order_relaxed);

The Happens-Before Relation

Let A and B represent operations performed by a multithreaded process. If A happens-before B, then the memory effects of A effectively become visible to the thread performing B before B is performed.

The Synchronizes-With Relation

synchronizes-with implies happens-before

The payload is the set of data being propagated between threads
the guard variable protects access to the payload.

It’s a Runtime Relationship

Go’s memory model simply sticks with the term “happens-before”

Acquire and Release Fences

The Purpose of memory_order_consume in C++11
Data dependency ordering guarantees that all memory accesses performed along a single chain will be performed in-order.
On the other hand, no guarantees are made about independent chains!