ATPESC Minutes, 29 July Day 2 Afternoon - ahmadia/atpesc-2013 GitHub Wiki

Algorithmic Adaptations to Extreme Scale, David Keyes

Paraphrasing Benjamin Franklin:

An infrastructure, if you can keep it

Baton pass from the BSP generation to the energy-aware generation

End of Bulk-Synchronous-Processing as we know it

No final answers, but suggestions on investments for the future.

A conclusion up front

Hope for some of today's best algorithms to make the leap
- greater concurrency
- less synchrony
- built-in resilience ("algorithm-based fault tolerance" or ABFT)
Programming models will have to be severely stretched
Everything should be "on the table" for trades "over the threshold"

Motivation

High performance with high productivity on all the "multis"

Multi-scale, multi-physics problems in multi-dimensions

multi-models and/or multi-levels of refinement
Exploiting polyalgorithms in multiple precisions and multi-protocol programming styles
multi-core, massively multiprocessor system
Requiring a multi-disciplinary approach
Really just "highest" productivity

Purpose of the presentation

Increase quality of "co-design" dialog between application user-developers and systems software and hardware developers
Vendors are surprisingly willing
- No longer one type of vendor
- All vendors motivated by some mass market
  - low-power
  - high graphics
  - high reliability
- Computational scientists want all three

For background of talk, head to International Exascale Software Roadmap

IESR

Extrapolating exponentials eventually fails

Scientific computing at a crossroads wrt extreme scale
Proceeded steadily for decades from giga- (1988) to terea- (1998) to peta- (2008)
- same SPMD programming model
- same assumptions about who is responsible for what
- same classes of algorithms (25 years of Gordon Bell Prizes)
Exa- is qualitatively different and will be much harder
Core numerical analysis and scientific computing will confront exascale to maintain sponsor relevance
- not a distraction, but an intellectual stimulus

Relevance of exascale to users today

Modelers are on the front line
- without concurrent research, the passage to exascale hardware will yield little new scientific fruit
Scientists will find the computational power to do things many have wanted
- exascale at the lab means petascale on the desk
We suggest some mathematical opportunities, after reviewing the hardware challenges

Why exa- is different

Double precision FMADD flop: 11pJ Cross-die per word access (1.2 pJ/mm): 24pJ = 96 pJ overall

Today's power costs per operation

DP FMADD flop: 100pJ DPDRAM read-to-register 4800pJ DP word transmit-to-neighbor 7500pJ Dp word transmit-across-system: 9000pJ

(care/of John Shalf)

Moore's Law (1965) does not end but Dennard's MOSFET scaling (1972) does

What will first "general purpose" exaflop/s machines look like?

Hardware: many potentially exciting paths beyond today's CMOS silicon-etched logic, but not commercially at scale within the decade
Software: MPI+X
X is OpenMP, CUDA, OpenACC, pthreads, etc..

Current Number 1: Tianhe-2 by Inspur / NUDT / Intel

Ratios to go

Nodes: 60
Concurrency: 5
Memory: 1
Node peak: 1
Concurrency: 320
Memory: 40
Peak performance: 20
Power: (1)

Some exascale themes

Clock rates cease to increase, arithmetic capacity continues to increase w/ concurrency
Store capacity diverges exponentially below arithmetic intensity ...

Implications of operating on the edge

Dynamic reduction required in power per flop and per byte will make computing and copying data less reliable
More heterogeneity in performance
Expanding the number of nodes beyond 10^6 would not pose a serious threat to algorithms that lend themselves to well-amortized precise load balancing
A real challenge is usefully expanding the number of cores on a node to 10^3
It is already orders of mangitude slower to retrieve an operand from main DRAM memory than to perform an arithmetic operation - will get worse by another

Missing mathematics

New formulations with
- greater arithmetic intensity
- reduced communication
- reduced synchronization
- assured accuracy with adaptively less fp precision
- algorithmic resilience to many types of faults
Quantification of trades between limiting resources

Arithmetic intensity illustration

S. Williams "Roofline Model"
Fast multipole method buys you considerably more flops per byte

Research in progress: FMM vs AMG preconditioning, strong scaling on Stampede

FMM eventually strong scales beyond automatic multigrid

Reduction of frequency of communication and synchronization for Ax=b

Classical: amortize communication over many power/reduce steps
- s-step Krylov
- Block Krylov
- "tall skinny QR"
Invade classical steps
- dynamic scheduling with DAGs
- NUMA-aware work-stealing

Reduction of domain of synchronization

Nonlinear Schwarz replaces a Newton method for a global nonlinear system, F(u) = 0
With a set of local problems on subsets of the global nonlinear system

How are most workhorse simulations impelemented at the infra-petascale today?

Iterative methods based on data decomposition and message-passing
Many names: SPMD/BSP/CSP
PETSc, Trilinos, etc..

Estimating scalability

Take complexity estimates and model of the architecture
Estimate optimal concurrency and optimal execution time
weak scaling

Workhorse innards: Krylov-Schwarz, a bulk synchronous implicit solver

Idle time due to load imbalance becomes a challenge at one million cores

Our programming idiom is nested loops

Dataflow illustration - generalized eigensolver

(c/o H. Ltaief at KAUST)

These loops, with their artifactual orderings, can be expressed as a DAG

Co-variance matrix inversion on hybrid CPU-GPU environment with DAG scheduling

Research in progress: locality preserving work-stealing on Cholesky solver gets 93% of DGEMM performance.

Multiphysics w/ legacy codes: an endangered species:

The data transfer cost may be much higher than the computation cost for codes

Many codes have the algebraic and software structure of multiphysics

Exascale is motivated by these
These may carry auxiliary data structures to/from which blackbox model data is passed

Multiphysics layouts must invade blackboxes

Shows good example via W. Gropp

IJHPCA Special issue: Multiphysics challenges and opportunities

Multiphysics modeling of CO2 sequestration by coupling PDEs and molecular dynamics

MD model does not scale as well as reservoir simulator

Bad news/good news

One may have to explicitly control data motion
One finally will get the privilege of controlling the vertical data motion
Optimal formulations and algorithms may lead to poorly proportioned computations for exascale hardware resource balances
Architecture may lure users into more arithmetically intensive formulations (fast multipole, lattice Boltzmann, rather than mainly PDEs)
Hardware nonuniformity may force abandonment of the BSP paradigm
Hardware and algorithmic nonuniformity will be indistinguishable at the performance level

Gordon Bell prize-winners rebalance almost every iteration

Fully deterministic algorithms may simply come to be regarded as too synchronization-vulnerable
A rich numerical analysis of algorithms that make use of statistically inferred "missing" quantities may emerge
Fully reliable executions may simple come to be regarded as too costly/synchronization-vulnerable
Algorithmic-based fault tolerance (ABFT) will be much cheaper than hardware and OS-mediated reliability
- Partition into two sets, large, fast and unreliable set where the errors can either be detected or rigorously bounded
- In 1956, Von Neumann gave 5 hours of lectures on "Synthesis of reliable organisms from unreliable components"
Default use of (uniform) high precision may come to an end, as wasteful of storage and bandwidth
- Representation of a smooth function in a hierarchical basis requires fewer bits than storing its nodal values
- We still have to compute and communicate "deltas" between states rather than the full state quantities, as we did when double precision was expensive
- A combining network node will have to remember not just the last address, but also the last values, and send just the deltas
Equidistributing errors properly while minimizing resource use will lead to innovative error analyses in numerical analysis

WIP: Reducing precision in the new QDHWeig eigensolver

How will PDE computations adapt?

Programming model will still be message-passing (due to large legacy code base), adapted to multicore or hybrid processors beneath a relaxed synchronization MPI-like interface.
Greater expressiveness in scientific programming

Evolution of Newton-Krylov-Schwarz

Critical path of a nonlinear implicit PDE solve is essentially updating the step
We often insert into this path things that could be done less synchronously
- Jacobian/preconditioner refresh
- convergence testing
- I/O and viz

Sources of nonuniformity

Systme
Algorithmic
Effects of both types are similar when it comes to waiting at synchronization points ...

Programming practice

Explore new programming models
Attention to locality and reuse is valuable at all scales
New algorithms and data structures can be explored under the reliable assumption that flop/s are cheap

Path for scaling up applications

Weak scale applications up to distributed memory limits
Strong scale applications beyond this
Scale the workflow, itself
Algorithm-architecture co-design process is staged, with any of these types of scaling valuable by themselves

Many Required Software Enabling Technologies

Model-related
Development-related
Production-related

SciDAC philosophy

Many applications drive
enabling technologies respond to all

DOE Exascale Mathematis Working Group

74 2 page whitepapers contributed by the international community to the EMWG
Randomized algorithsm
On the fly data compression
Mining massive data sets
Algorithm based fault tolerance
Adaptive precision algorithms
Concurrency from dimensions beyond space

Randomized algorithms in subspace correction methods

Solve AX=b by pointwise relaxation

GS - 1823

deterministic and pre-ordered Southwell - 1935
deterministic and dynamically ordered Griebel-Oswald - 2011
random (and dynamically ordered)
Excellent convergence w/ fault tolerance and synchronization reduction

Traditional highly-regimented and serial method can be decoupled in a concurrent way

Hiearchical representations for extreme data

via Hans Bungartz
Saving most simulation results to persistent storage will be impractical; instead hybrid in situ / in transit analysis
Challenges: On-the-fly compression
Storage complexity: O(n^d) -> O(n (logn)^(d-1))
Representation accuracy: O(n^-p) -> ...

Q: What about testing and validation? A: It's harder, but if you look at uncertainty quantification, one option is to think of a computation as a "slightly unreliable map" and less of a direct operator.

Q: Would mathematicians be interested in a floating-point instruction that threw away bottom 64-bits of double-precision floating-point computation A: Sure!

Programming Models for High Performance Computing, Marc Snir

Establish a taxonomy of HPC programming models and systems
Introduce the main programming models and systems in current use
Understand there is no magic bullet, but trade-offs

Some taxonomy

Programming model: Set of operations and constructs used to express parallelism
- Message passing, shared memory, bulk-synchronous
Programming system: Library or language that embodies one or more programming models
- MPI, OpenMP, CUDA
- C++ and Java are systems that implement the same programming model
- MPI supports multiple programming models: message passing, one-sided communication, bsp, asynchronous
Model: defines semantics of execution and defines performance of execution
- Usually, semantics are carefully defined, but not so performance model

Implementation Stack

Language & Libraries
Executable

High-Level execution model (defined by run-time)

Run-time

Low-Level execution model (defined by HW)

Hardware

Possibly more layers

...

Communication & Synchronization

Load/store (from memory to cache or cache to cache)
Hardware and software cache prefetch
Read-modify-write operations
- Mutual exclusion - non-ordering, symmetric synchronization
- Monitor/wait - ordering, asymmetric synchronization
CPUs provide extensive support to the former; HPC software as strong need for the latter

Node architecture -- evolution

Increasing number of physical threads
increasing cache levels
- lower level caches may be replaced with scratchpads
Shared memory may not be coherent and will be NUMA
Physical threads may not be all identical
- CPU-like vs. GPU-like physical threads
- May have same ISA and different performance, or different ISAs and different performance

Generic Hardware -- system

Communication & Synchronization primitives:
- point-to-point: send/receive
- one-sided: put/get/accumulate
- collective: barrier, reduce, broadcast
flat network vs. partially or completely exposed topology

Programming Models

Low-Level Programming Model

Fixed number of threads
Fixed number of nodes
Performance model:
- All threads compute at same speed
- Memory is flat
- Communication time predicted by simple model (a+bm, where m is message size)
Model can be implemented via MPI + pthreads, or MPI + restricted use of OpenMP
Pros
- Very hardware specific, can achieve best performance on current systems
Cons
- Very low-level, hard to get correct programs
- Very hardware specific: may only work on one machine
- Does not accommodate heterogeneous cores
- Performance model may not work in the future

Higher-level Programming Models

Virtualize resource and have the run-time handle the mapping of virtual resources to physical resources

"Virtual model execution" has well-defined semantics and performance model

Encourages/mandates use of programming patterns that reduce the likelihood of errors and facilitates virtualization.

Ensures that high-level performance model gives correct predictions
Simplifies programming
Provides more portability: program need not exactly match the physical resources - e.g., number of threads or nodes
Can hide imperfections in the low-level model

(Cons)

Can be inefficient

** You must understand the boundaries of where your model is valid **

Example: Dijkstra's Go-to Statement Considered Harmlful -- Restrict to Facilitate Programming

If program uses gotos arbitrarily then it is hard to understand relation between static program and dynamic execution state
If program uses only well-nested loops and function calls, the relation is simple: Execution state is defined by:
- Program counter
- Call stack
- Value of loop indices
Much easier to understand semantics and performance
Claim: not restrictive

Locality in Shared Memory: Virtualization and Performance model

Caching = memory virtualization - Name (address) of variable does not determine its location; caching hardware does so
Performance model: Memory capacity is DRAM size; memory latency is L1 speed
- reasonable approximation if cache hit rate is high
- good cache hit rate is essential to performance, because of high memory latency
- 99% of power consumed in future chips is spent moving data
For high cache hit rate, need good locality
- temporal locality, spatial locality, thread locality
Performance model is valid only for programs with good locality

Dynamic Task Scheduling: Virtualization and Performance Model

It is convenient to express parallel computations as a set of tasks and dependencies between tasks (task graph)
New threads could be spawned dynamically during execution

Mapping Tasks to Physical Threads

Work: Total number of operations
Depth: Critical path length
Desired performance model: $Time = max (Work/P, Depth)$
- Can be achieved, with suitable restrictions on task graph
Restrictions on task granularity: need to be large enough so to amortize cost of scheduling task
- cost: overhead of scheduling + overhead of moving data to place where task is scheduled (1000s of instructions in shared memory)
Restrictions on task graph
- static graph
- well-structured dynamic task graph

Well-Structured Program

Task can spawn children tasks and wait for their completion
- spawn
- parallel loop
No other dependencies exist, except parent to child
- Series-parallel (and/or) execution graph
Parallel equivalent of a go-to less program
Can be implemented in suitably restricted OpenMP (No synchronization between concurrent tasks, etc...)
- Scheduling can be done using work-stealing.

Work Stealing

A thread appends a newly created task to local queue
An idle thread picks a task from its local queue
If local queue is empty, it steals a task from another queue

Pros

Theoretically optimal and works well in practice
Handles variance in thread speed well

Cons

Needs many more tasks than physical threads (over-decomposition)

Can a Shared-Memory Programming Model Work on Distributed Memory?

Naive mapping: Distributed virtual shared memory (DVSM):
- Software emulation of HW coherence
- cache line = page
- cache miss = page miss
Abysmal performance - lines too large and coherence overhead too large
Next step: Give up caching & coherence; give up dynamic task scheduling - Partitioned Global Address Space (PGAS)
- Languages: UPC, Co-Array Fortran
- Libraries: Global Arrays

PGAS model

Fixed number of (single-threaded) locales
Private memory at each locale
Shared memory and shared arrays are partitioned across locales
Library implementation: remote shared variables accessed via put/get operation
Language implementation: distinct references between local and remote shared variables.

PGAS Performance Model

Not same as shared memory
Temporal/spatial/thread locality do not core the problem. No hardware caching & no efficient SW caching or aggregation of multiple remote accesses.
Performance model is same as distributed memory
- In order to obtain good performance need to explicitly get/put remote data in large chunks and copy to local memory.
Open problem, can we define a "Shared-memory like" programming model for distributed memory that provides shared memory performance model?

How About the Vector Units?

Low-level: vector instructions
High-level: compiler handles
High-level performance model: floating-point performance is a (large) fraction of vector unit performance
- Requires suitable programming style to write vectorizable loops
- Direct addressing, aligned dense operands.

Distributed Memory Programming

Shared memory - program does not control explicitly data location; focus is on control partitioning, and data follows control
- Performs well if code satisfies restrictions listed for efficient dynamic tasks scheduling and locality
Distributed memory - data location is mostly static and controlled by program; focus is on data partitioning and control follows data
- Performs well if (mostly) static control partition works and communication across data partitions is limited

Data Partitioning

Local = physical location (usually, node)
- Global address = <locale id, local address>
Global view of data: Aggregates (arrays) are partitioned across locales
- Partition is static (or slow changing)
- Predefined partition types and (possibly) user-defined partitions
- Data is accessed with "global name" (e.g. array indices), compiler translates to global address.
Local view of data: Data is accessed using global address (if remote) or local address (if local)
- Syntactic sugar: co-arrays, with a locale index

Computation is expensive (and even more so for fancier partitions)

Control Partitioning

Local view of control - each local runs its own code, executions in distinct locales synchronize explicitly _ Variant: support remote task invocation (code on one local can spawn computation on another locale). Charm++, Chapel
Global view of control - one global parallel loop; compiler & run-time maps each iterate to a locale chosen to minimize communication (e.g. owner compute rule)
- User can associate control with data using "on" statements
- No immediate view of what accesses are local and what accesses are global

This is not a settled issue.

Can a Distributed-Memory Programming Model Work on Shared Memory?

Sure - Partition memory into distinct locales, and associate each locale witha fixed set of physical threads (one or more)

loses some flexibility in dynamic scheduling
uses shared memory inefficiently to emulate send/receive or put/get
achieves good locality

Reducing communication is essential in order to reduce power
We may not have coherence

Non-Coherent, Shared-Memory-Like Programming Model

Cache
1. Local memory
2. Address translation
3. Transfer is implicit
4. Coherence

is essential
could be done more efficiently
can be done, if maintain many context
too expensive and superfluous for scientific computing

Is HPC Converging to One Architecture?

Punctuated equilibrium

Grand Dream: Not One Programming Model

A multitude of programming models, ranging from domain specific to architecture specific.

Aside by Aron: I wish he'd spent more time guiding the scientists along the programming models, instead of the taxonomy overview, which didn't do a great job of categorizing things like CUDA, OpenCL, or DSLs.

The Evolution of GPU Accelerated Computing, Steve Parker, Senior Director (HPC & Rendering)

How did your graphics card end up getting used in supercomputers?

Driving Factor - Power Exponential increase in performance has been accompanied by an exponential increase in power.

200 KW CM5 in 93 replaced by 8.2 MW Titan in 2012

Classic Dennard Scaling 2x more transistors by scaling chip features down 0.7x per process generation buys 1.4x faster transistors 0.7x capacitance 0.7x voltage nets: ~2.5x increase in performance at constant power (Vast oversimplifications ahead)

Still get more transistors and lower capacitance, but we no longer get faster transistors or decreased voltage.

Now we get 2x performance for 20-30% more power.

The High Cost of Data Movement 64-bit DP FMA: 20 pJ (28nm IC) 256-bit access 8kB SRAM: 50 pJ Moving the data halfway across chip: 256 pJ One corner of chip to the other: 1000 pJ Moving it off the chip: 16000 pJ (DRAM) Efficient off-chip link: 500 pJ

If you want 50 GFLOP/s per W, your power budget is about 20 pJ. Don't move data.

What to Do? Stop making it worse. Emergence of multicore CPUs HPC must go Hybrid Do most work by cores optimized for extreme energy efficiency

Evolution of Graphics Processing (mostly skipped) The Pioneers: Early GPGPU (2002), Purcell, Strzodka, Rumpf, Larsen, McAllister, Thompson Ian Buck - Showed how hard it was to do GPGPU using current technologies Came up with Brook for GPUs: Stream Computing on Graphics Hardware

Ideas from Brook made it in to CUDA

GPU Computing Matures 2006 - first CUDA capable chips 2013 - Kepler architecture

CUDA C Programming Parallel C code looks like Serial C code CUDA Parallelism and Memory Model Fundamental unit of computation is a thread, extremely lightweight Blocked together in a thread block, grouped together in a kernel Kepler SMX - power efficiency Hyper-Q

7.1B transistors (largest processor-based chip ever built)

15 SMX unites

1.3 TFLOPS FP64

1.5 MB L2 Cache

384-bit GDDR5

PCI Express Gen3

Kepler GK110 SMX vs Fermi SM 3x sustained perf/W 1.4x for free from process 1.7x comes from wider, slower chip Ground up redesign for performance per Watt

Dynamic Parallelism Simpler code, more general, higher performance Capability for GPU to launch kernels from itself Quicksort allows launching parallel recursion Code size cut by 2, performance cut by 2 Titan is running 6 flagship scientific applications on Titan The Future of HPC Programming Computers are not getting faster... just wider Presents both a challenge and an opportunity for computational scientists Need to structure all HPC applications as throughput problems a 1 GhZ clocked exascale computer implies 1 billion way parallelism Locality within nodes much more important OpenACC Directives (Portable Parallelism) Programmer Focuses on Exposing Parallelism Performance improvements in GPU code drives performance improvements in CPU code

Changing Computing Landscape Processor shipments now dominated by ARMs Attack of the killer cell phones nVIDIA starts from Tegra devices and goes up to supercomputers Tegra Roadmap Currently shipping Tegra 4 Future is Logan: Fully programmable GPU (CUDA, OpenGL 4.4, no double-precision units) Logan runs at 1 Watt, allegedly delivers 100 GFlop/s Watt The Future of HPC is Green Power constraint is a Big Deal

Must move to simpler cores for most work Must pay more attention to intra-node locality HPC is increasingly supported by consumer markets

Both driven fundamentally by power Unprecedented architectural convergence GPUs evolving to a tightly integrated, hybrid processor

Hybrid cores, hybrid memory, integrated network Goal: efficient on any code with high parallelism This is simply how computers will be built 3 Ways to Accelerate GPUs Libraries OpenACC Directives Programming Languages (CUDA, Fortran) Conclusion Check out nVidia programming course on udacity and their SDK. Q: How do GPUs do with RDMA? What about GPU Direct?

A: GPU Direct allows you to move data from a GPU to a peer device without going through CPU memory. Huge latency and convenience win.

MPI for Scalable Computing, Bill Gropp, Rusty Lusk, Rajeev Thakur

We assume everybody has some MPI experience
We will focus more on understanding MPI concepts than on coding details
Emphasis will be on issues affecting scalability
There will be some code walkthroughs and exercises
We will use MPICH on your (Linux or MacOS) laptop for initial experiments supports preliminary implementation of the new MPI-3 standard
Vesta (BG/Q) will also be available for larger runs

What is MPI

MPI is a message-passing library interface standard
MPI was defined in 1994 by a broadly-based group of parallel computer vendors, computer scientists, and application developers.
2-year intensive process.
Implementations appeared quickly and now MPI is taken for granted as vendor-supported software on any parallel machine
Free, portable implementations exist for clusters and other environments (MPICH, Open MPI) MPI-1 1994, MPI-2 1997, MPI-2.1 2008, MPI-2.2 2009, MPI-3 2012

Basic definitions

process - an address space, a program, and one or more threads of control, each with its own subroutine call stack and program counter. The threads share the address space, which has advantages and disadvantages.
- an old-fashioned UNIX process is a single-threaded process
MPI-1 - a parallel program was thought of as a collection of old-fashioned UNIX processes, each identified by its MPI rank.
In MPI-2, semantics were defined that enable MPI processes to be multithreaded (see "hybrid programming", later this week).

Programming and Address Spaces

Sequential programming = one single-threaded process
Parallel programming =
- One process, multiple threads
- Multiple single-threaded processes
- Multiple multiple-threaded processes
Shared-memory parallel programming is harder than it looks
Yet, processes (or threads) need to communicate, or else one has just a collection of sequential programs rather than a parallel program.
MPI is for communication among processes (with separate address spaces)

MPI Communication

MPI limits in both time and space the exposure of one process's address space to action by (the threads of) another process

MPI Non-Blocking Communication

MPI_Irecv exposes part of its address space to the "system" (OS + MPI implementation code + non-portable communication hardware/software)
The "system" may utilize internal buffers, perhaps smaller than the application's buffers, requiring multiple data transfers by the system
MPI_Isend tells the system where the data is to be moved is located and into what process's receive buffer it is to be placed.
Both buffers at this point belong to the "system"
MPI_Wait on both sides delays its caller until the system no longer needs to access the buffer.
Receiver can now make sure of the new data in the buffer
Sender can now reuse the buffer

The blocking operations (MPI_Send, MPI_Recv) can be dangerous

The MPI Forum only included them because users of earlier systems would expect them. Deadlock danger: exchanging large messages 0 1 MPI_Send(1) MPI_Send(0) MPI_Recv(1)

Some notes lost

Collective Operations

MPI provides many collective communication patterns, some with computation included. Custom computation operations are possible.
Multiple algorithms based on message sizes, machine topologies, machine capabilities.
- Scalable algorithms a research topic
Common feature: called by all processes in a communicator
Performance note: Measuring time taken by a collective operation can obscure what is really a load balancing problem.
MPI-3 has non-blocking collective operations

MPI-2

Introduced dynamic process management, remote memory access, parallel I/O, thread safety, C++ (now removed), and Fortran-90 bindings.
We won't discuss here DPM (not universally implemented, particularly on large systems, since it involves process management at the OS level).
Thread safety will be covered under hybrid programming
A little about RMA now, a lot later!

MPI-2 RMA: Remote Memory ACcess, or One-Sided Operations

The RMA window object can be thought of as a generalization of the MPI-1 communication buffer
Allocating a window object exposes a larger part of a process's address space for access by other processes, and (usually) for a longer time.
- room for multiple, simultaneously active communication buffers
- MPI window = union of all process's window objects
Separates "buffer" allocation, data movement initiation, and synchronization (checking for completion)

MPI_Win_create
MPI_Put
MPI_Get
MPI_Accumulate
MPI_Fence

MPI-2 Parallel I/O

MPI_IO is based on an analogy: Reading from and writing to files is "like" receiving and sending messages from/to the parallel file system.
Concepts from MPI-1 are reused