Parallel Programming - yszheda/wiki GitHub Wiki

Courses


“Big Ideas”

  • Scale “out”, not “up”
    • Scale out: add more computers
    • Scale up: more powerful computer
  • Move processing to the data
    • Cluster have limited bandwidth
  • Process data sequentially, avoid random access
    • Seeks are expensive, disk throughput is reasonable
  • Seamless scalability
    • From the mythical man-month to the tradable machine-hour

Parallel Programming Model

  • Shared Memory

    • Pros: easy for data sharing and communication
    • Cons: synchronization problem
  • Distributed Memory

    • Pros: easy for programming
    • Cons: communication cost
  • SPMD: Single Program Multiple Data

    • programs have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only parts of the program they are designed to execute
  • MPMD: Multiple Program Multiple Data

    • Separate programs written for each processors
    • Processes spawn from the master process
  • Pthread: shared memory, MPMD

  • MPI: distributed memory, SPMD

MPI

Synchronous vs. blocking

  • From the view of the pair of communicated processes
    • Synchronous communication --- sending and receiving data occurs simultaneously
    • Asynchronous communication --- sending and receiving data occurs non-simultaneously
  • From the view of individual function call
    • Blocking --- has been used to describe routines that do not return until the transfer is completed
    • Non-blocking --- has been used to describe routines that return whether or not the message had been received
  • Synchronous vs. blocking:
    • Synchronous comm. commonly implemented by blocking call
    • Synchronous comm. intrinsically performs two action: Transfer Data & Synchronize Processes

Asynchronous/Non-Blocking Message Passing

  • How message-passing routines can return before the message transfer has been completed?
    • Generally, a message buffer needed between source and destination to hold message
    • Message buffer is a memory space at the sender and/or receiver side
    • For send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work

Pthread

Threads vs. Processes

  • Process (heavyweight process): complete separate program with its own variables, stack, heap, and everything else.
  • Thread (lightweight process): share the same memory space for global variables, resources.

Shared Segment

  • Shared segment memory space can also be created between processes

  • POSIX API:

    • shmget(): allocate a shared memory segment
    • shmat(): attach a shared memory segment to a process
    • shmdt(): detach a shared memory segment from a process
    • shmctl(): deallocate a shared memory segment
  • System calls or library routines are called “Thread-Safe” if they can be called from multiple threads simultaneously and always produce correct results.

Synchronization Tools


The Critical-Section Problem: ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section.

Solution:

  • Mutual Exclusion - If process Pi is executing in its critical section, then no other processes can be executing in their critical sections
  • Progress - If no process is executing in its critical section and there exist some processes that wish to enter their critical section, then the selection of the processes that will enter the critical section next cannot be postponed indefinitely
  • Bounded Waiting - A bound must exist on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted

Peterson’s Solution

while (true) {
    flag[i] = TRUE;
    turn = j;
    while ( flag[j] && turn == j);
    // CRITICAL SECTION
    flag[i] = FALSE;
    // REMAINDER SECTION
}

Bakery Algorithm

do { 
    choosing[i] = true;
    number[i] = max(number[0], number[1], ..., number [n - 1]) + 1;
    choosing[i] = false;
    for (j = 0; j < n; j++) {
        while (choosing[j]) ;
        while ((number[j] != 0)&&((number[j], j) < (number[i], i))) ;
    }
    // critical section
    number[i] = 0;
    // remainder section
} while (1);

Synchronization Hardware

Modern machines provide special atomic hardware instructions:

Atomic = non-interruptible

Either test memory word and set value or swap contents of two memory words.

boolean TestAndSet (boolean *target)
{
    boolean rv = *target;
    *target = TRUE;
    return rv:
}

while (true) {
    while ( TestAndSet (&lock )) ;   // do nothing
    // critical section
    lock = FALSE;
    // remainder section 
} 
void Swap (boolean *a, boolean *b)
{
    boolean temp = *a;
    *a = *b;
    *b = temp:
}

while (true)  {
    key = TRUE;
    while (key == TRUE)
        Swap(&lock, &key);
    // critical section
    lock = FALSE;
    // remainder section 
}

Semaphore

Semaphore Implementation

wait (S){
    value--;
    if (value < 0) {
        add this process to waiting queue;
        block();
    }
}

Signal (S){
    value++;
    if (value <= 0) { 
        remove a process P from the waiting queue;
        wakeup(P);
    }
}

Semaphore S;    // initialized to 1
do{
    wait (S);
    // Critical Section
    signal (S);
    // remainder section;
} while(1);

Classical Problems of Synchronization

Bounded-Buffer Problem

  • N buffers, each can hold one item
  • Semaphore mutex initialized to the value 1
  • Semaphore full initialized to the value 0
  • Semaphore empty initialized to the value N.
// producer

while (true)  {
    // produce an item
    wait(empty);
    wait(mutex);
    // add the item to the buffer
    signal(mutex);
    signal(full);
}
// consumer

while (true) {
    wait(full);
    wait(mutex);
    // remove an item from  buffer
    signal(mutex);
    signal(empty);
    // consume the removed item
}

Readers and Writers Problem

Allow multiple readers to read at the same time. Only one single writer can access the shared data at the same time.

  • Semaphore mutex initialized to 1.
  • Semaphore wrt initialized to 1.
  • Integer readcount initialized to 0.
// writer

while (true) {
    wait (wrt) ;

    // writing is performed

    signal (wrt) ;
}
// reader

while (true) {
    wait(mutex);
    readcount ++;
    if (readercount == 1)  wait(wrt);
    signal(mutex);

    // reading is performed

    wait(mutex);
    readcount --;
    if (redacount == 0)  signal(wrt);
    signal(mutex);
}

Dining-Philosophers Problem


  • Critical Section is a piece of code that can only be accessed by one process/thread at a time.
  • Mutual exclusion is the problem to insure only one process/thread can be in a critical section.

Lock/Mutex

#include “pthread.h”
pthread_mutex   mutex;
pthread_mutex_init (&mutex, NULL);
pthread_mutex_lock(&mutex);	// enter critical section


pthread_mutex_unlock(&mutex);	// leave critical section
pthread_mutex_destory(&mutex);

Condition Variables (CV)

  • CV represent some condition that a thread can:
    • Wait on, until the condition occurs; or
    • Notify other waiting threads that the condition has occurred
  • Three operations on condition variables:
    • wait() --- Block until another thread calls signal() or boardcast() on the CV
    • signal() --- Wake up one thread waiting on the CV
    • broadcast() --- Wake up all threads waiting on the CV

All condition variable operation MUST be performed while a mutex is locked!!!

pthread_cond_t    cond;			pthread_mutex_t    mutex;
pthread_cond_init (cond, NULL);		pthread_mutex_init (mutex, NULL);

action() {
    pthread_mutex_lock (&mutex)
    while (x != 0)
        pthread_cond_wait (cond, mutex);
    pthread_mutex_unlock (&mutex);
    take_action();
}

counter() {
  thread_mutex_lock (&mutex)
  x--;
  if (x==0) 
      pthread_cond_signal (cond);
  pthread_mutex_unlock (&mutex);
}

Semaphore

[My Conclusion]Semaphore控制有限资源访问数不超过资源数目。Mutex/Lock需要Lock拥有者释放,Semaphore不需要。

  • a record of how many units of a particular resource are available
    • If #record = 1 -> binary semaphore, mutex lock
    • If #record > 1 -> counting semaphore
  • accessed only through 2 atomic ops: wait & signal
#include <semaphore.h>
sem_t  sem;
sem_init(&sem);
sem_wait(&sem);
// critical section
sem_post(&sem);
sem_destory(&sem);

Monitor

high-level抽象

Deadlock

pthread_mutex_trylock()

Cache Coherence

OpenMP

Fork-Join model: the master thread forks a specified number of slave threads and divides task among them

Working-Sharing Construct

  • DO/for Directive: shares iterations of a loop across the team. Represents a type of "data parallelism" (类似SPMD)
  • SECTIONS Directive: breaks work into separate, discrete sections of code. Each section is executed by a thread.
  • SINGLE Directive

Definition:

  • A work-sharing construct divides the execution of the enclosed code region among the threads that encounter it
  • Work-sharing constructs DO NOT launch new threads
  • There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct

Synchronization Construct

Partitioning & Divide-And-Conquer Strategies

Examples

Bucket Sort

N-Body

Barnes-Hut Algorithm

Orthogonal Recursive Bisection Method

Pipelined Computation

Examples

Insertion Sort

Prime Number Generation

每个node比较是否是当前prime的倍数

Linear Equation Solver

Synchronous Computation

Barrier Implementations

  • Counter Implementation: 同步节点数
  • Tree Implementation
  • Butterfly Implementation

Counter Barrier Implementation

Counter-based barrier often have two phases:

  • Arrival phase: a process enters arrival phase and does not leave this phase until all processes have arrived in this phase
  • Departure phase: Processes are released after moving to the departure phase

Deadlock

A set of blocked processes each holding some resources and waiting to acquire a resource held by another process in the set

Necessary Conditions for Deadlock

  • Mutual exclusion: only 1 process at a time can use a resource
  • Hold & Wait: a process holding some resources and is waiting for another resource
  • No preemption: a resource can be only released by a process voluntarily
  • Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that P0 -> P1 -> P2 -> ... -> Pn -> P0

All four conditions must hold for possible deadlock!

Handling Deadlocks

  • Ensure the system will never enter a deadlock state
    • deadlock prevention: ensure that at least one of the 4 necessary conditions cannot hold
    • deadlock avoidance: dynamically examines the resource-allocation state before allocation
  • Allow to enter a deadlock state and then recover
    • deadlock detection
    • deadlock recovery
  • Ignore the problem and pretend that deadlocks never occur in the system
    • used by most operating systems, including UNIX.

Examples

  • Prefix Sum
  • System of Linear Equations
  • Heat Distribution

// TODO

Load-Balancing & Termination Detection

Load-Balancing

Static Load-Balancing

Dynamic Load Balancing

Approaches

  • Centralized Work Pool
  • Decentralized Work Pool
  • Fully Distributed Work Pool

Distributed Termination Detection

Termination at time t, must satisfy:

  • Application-specific local termination conditions exist throughout the collection of processes
  • There are no messages in transit between processes

// TODO

  • Single-Pass Ring Termination Algorithm
  • Dual-Pass Ring Termination Algorithm

Sorting

Parallel Bubble Sort: pipeline

  • Sequential: O(n^2)
  • Parallel: O(2n)

Merge Sort

  • Sequential: O(nlogn)
  • Parallel: O(2n)

Quick Sort

  • Sequential: O(nlogn)
  • Parallel: O(2n)

Bitonic Mergesort

  • 两个序列,一个升序,一个降序。
  • O(logn * logn)

Characteristic of Botonic Sequence

对相应元素做compare-and-exchange,仍得到Botonic Sequence,且其中一个Botonic Sequence的元素都小于另一个Botonic Sequence的的元素。

Rank Sort

Counting Sort

  • If the numbers are integers and the range is known, Coding the rank sort algorithm to reduce the sequential time from O(n^2) to O(n)
  • Use array c[] to count the histogram of values
  • Complexity:
    • Sequential: O(n + m)
    • Parallel with n processors: O(logn)

linux cross-process locking

Optimistic concurrency control (OCC)

Transactional memory

Lock-Free Programming

image

Hazard Pointer

Atomics



Compare-and-Swap: The Mother of All RMWs

shared.compare_exchange_weak(T& expected, T desired, ...);

image

In general, the C++11 standard does not guarantee that atomic operations will be lock-free.

In general, any time you have a small amount of data protected by a mutex, and you can pack that data entirely into a 32- or 64-bit integer type, you can always convert your mutex-based operations into lock-free RMW operations


memory model



Explicit Compiler Barriers

compiler barriers are sufficient to prevent memory reordering on a single-processor system.

// full compiler barrier in GCC. In Microsoft Visual C++, [_ReadWriteBarrier](http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx) serves the same purpose.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")

Implied Compiler Barriers

  • CPU fence instructions
  • C++11 atomic
  • function calls (excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used.)

volatile data types in C are not usually necessary in correctly-written multithreaded code.

Out-Of-Thin-Air Stores

C++11 standard

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.


What’s cool is that neither acquire nor release semantics requires the use of a #StoreLoad barrier, which is often a more expensive memory barrier type. e.g. on PowerPC:

  • lwsync (“lightweight sync”) = #LoadLoad, #LoadStore and #StoreStore
  • sync = #StoreLoad (more expensive)

image

image

image

image


Weak Memory Models

image

*weakly-ordered / weak ordering / relaxed memory model processors:

Strong Memory Models

image

TSO (total store order): there is always a single, global order of writes to shared memory from all cores.

Sequential Consistency

no memory reordering

sequential consistency only really becomes interesting as a software memory model:

  • In Java 5 and higher: volatile
  • In C++: memory_order_seq_cst



memory fence / barrier


enforce correct memory ordering on the processor by issuing any instruction which acts as a memory barrier:

LoadStore instruction reordering might happen on certain processors if, say, there is a cache miss on the load followed by a cache hit on the store.

A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.


tmp = new Singleton;
// only needs to prevent preceding memory operations from being reordered past itself
m_instance.store(tmp, std::memory_order_release);
tmp = new Singleton;
// a release fence must prevent preceding memory operations from being reordered past all subsequent writes.
std::atomic_thread_fence(std::memory_order_release);
m_instance.store(tmp, std::memory_order_relaxed);

Let A and B represent operations performed by a multithreaded process. If A happens-before B, then the memory effects of A effectively become visible to the thread performing B before B is performed.


synchronizes-with implies happens-before

  • The payload is the set of data being propagated between threads
  • the guard variable protects access to the payload.

It’s a Runtime Relationship

image

Go’s memory model simply sticks with the term “happens-before”


image


  • The Purpose of memory_order_consume in C++11

  • Data dependency ordering guarantees that all memory accesses performed along a single chain will be performed in-order.

  • On the other hand, no guarantees are made about independent chains!

image