ATPESC Minutes, 29 July Day 2 Afternoon - ahmadia/atpesc-2013 GitHub Wiki
David Keyes
Algorithmic Adaptations to Extreme Scale,Paraphrasing Benjamin Franklin:
An infrastructure, if you can keep it
Baton pass from the BSP generation to the energy-aware generation
- End of Bulk-Synchronous-Processing as we know it
No final answers, but suggestions on investments for the future.
A conclusion up front
- Hope for some of today's best algorithms to make the leap
- greater concurrency
- less synchrony
- built-in resilience ("algorithm-based fault tolerance" or ABFT)
- Programming models will have to be severely stretched
- Everything should be "on the table" for trades "over the threshold"
Motivation
- High performance with high productivity on all the "multis"
- Multi-scale, multi-physics problems in multi-dimensions
-
multi-models and/or multi-levels of refinement
-
Exploiting polyalgorithms in multiple precisions and multi-protocol programming styles
-
multi-core, massively multiprocessor system
-
Requiring a multi-disciplinary approach
-
Really just "highest" productivity
Purpose of the presentation
-
Increase quality of "co-design" dialog between application user-developers and systems software and hardware developers
-
Vendors are surprisingly willing
- No longer one type of vendor
- All vendors motivated by some mass market
- low-power
- high graphics
- high reliability
- Computational scientists want all three
For background of talk, head to International Exascale Software Roadmap
IESR
Extrapolating exponentials eventually fails
- Scientific computing at a crossroads wrt extreme scale
- Proceeded steadily for decades from giga- (1988) to terea- (1998) to peta- (2008)
- same SPMD programming model
- same assumptions about who is responsible for what
- same classes of algorithms (25 years of Gordon Bell Prizes)
- Exa- is qualitatively different and will be much harder
- Core numerical analysis and scientific computing will confront exascale to maintain sponsor relevance
- not a distraction, but an intellectual stimulus
Relevance of exascale to users today
- Modelers are on the front line
- without concurrent research, the passage to exascale hardware will yield little new scientific fruit
- Scientists will find the computational power to do things many have wanted
- exascale at the lab means petascale on the desk
- We suggest some mathematical opportunities, after reviewing the hardware challenges
Why exa- is different
Double precision FMADD flop: 11pJ Cross-die per word access (1.2 pJ/mm): 24pJ = 96 pJ overall
Today's power costs per operation
- DP FMADD flop: 100pJ DPDRAM read-to-register 4800pJ DP word transmit-to-neighbor 7500pJ Dp word transmit-across-system: 9000pJ
(care/of John Shalf)
- Moore's Law (1965) does not end but Dennard's MOSFET scaling (1972) does
What will first "general purpose" exaflop/s machines look like?
-
Hardware: many potentially exciting paths beyond today's CMOS silicon-etched logic, but not commercially at scale within the decade
-
Software: MPI+X
-
X is OpenMP, CUDA, OpenACC, pthreads, etc..
Current Number 1: Tianhe-2 by Inspur / NUDT / Intel
Ratios to go
- Nodes: 60
- Concurrency: 5
- Memory: 1
- Node peak: 1
- Concurrency: 320
- Memory: 40
- Peak performance: 20
- Power: (1)
Some exascale themes
- Clock rates cease to increase, arithmetic capacity continues to increase w/ concurrency
- Store capacity diverges exponentially below arithmetic intensity ...
Implications of operating on the edge
-
Dynamic reduction required in power per flop and per byte will make computing and copying data less reliable
-
More heterogeneity in performance
-
Expanding the number of nodes beyond 10^6 would not pose a serious threat to algorithms that lend themselves to well-amortized precise load balancing
-
A real challenge is usefully expanding the number of cores on a node to 10^3
-
It is already orders of mangitude slower to retrieve an operand from main DRAM memory than to perform an arithmetic operation - will get worse by another
Missing mathematics
-
New formulations with
- greater arithmetic intensity
- reduced communication
- reduced synchronization
- assured accuracy with adaptively less fp precision
- algorithmic resilience to many types of faults
-
Quantification of trades between limiting resources
Arithmetic intensity illustration
- S. Williams "Roofline Model"
- Fast multipole method buys you considerably more flops per byte
Research in progress: FMM vs AMG preconditioning, strong scaling on Stampede
- FMM eventually strong scales beyond automatic multigrid
Reduction of frequency of communication and synchronization for Ax=b
- Classical: amortize communication over many power/reduce steps
- s-step Krylov
- Block Krylov
- "tall skinny QR"
- Invade classical steps
- dynamic scheduling with DAGs
- NUMA-aware work-stealing
Reduction of domain of synchronization
-
Nonlinear Schwarz replaces a Newton method for a global nonlinear system, F(u) = 0
-
With a set of local problems on subsets of the global nonlinear system
How are most workhorse simulations impelemented at the infra-petascale today?
- Iterative methods based on data decomposition and message-passing
- Many names: SPMD/BSP/CSP
- PETSc, Trilinos, etc..
Estimating scalability
- Take complexity estimates and model of the architecture
- Estimate optimal concurrency and optimal execution time
- weak scaling
Workhorse innards: Krylov-Schwarz, a bulk synchronous implicit solver
- Idle time due to load imbalance becomes a challenge at one million cores
Our programming idiom is nested loops
Dataflow illustration - generalized eigensolver
(c/o H. Ltaief at KAUST)
These loops, with their artifactual orderings, can be expressed as a DAG
Co-variance matrix inversion on hybrid CPU-GPU environment with DAG scheduling
Research in progress: locality preserving work-stealing on Cholesky solver gets 93% of DGEMM performance.
Multiphysics w/ legacy codes: an endangered species:
- The data transfer cost may be much higher than the computation cost for codes
Many codes have the algebraic and software structure of multiphysics
- Exascale is motivated by these
- These may carry auxiliary data structures to/from which blackbox model data is passed
Multiphysics layouts must invade blackboxes
- Shows good example via W. Gropp
IJHPCA Special issue: Multiphysics challenges and opportunities
Multiphysics modeling of CO2 sequestration by coupling PDEs and molecular dynamics
- MD model does not scale as well as reservoir simulator
Bad news/good news
-
One may have to explicitly control data motion
-
One finally will get the privilege of controlling the vertical data motion
-
Optimal formulations and algorithms may lead to poorly proportioned computations for exascale hardware resource balances
-
Architecture may lure users into more arithmetically intensive formulations (fast multipole, lattice Boltzmann, rather than mainly PDEs)
-
Hardware nonuniformity may force abandonment of the BSP paradigm
-
Hardware and algorithmic nonuniformity will be indistinguishable at the performance level
Gordon Bell prize-winners rebalance almost every iteration
-
Fully deterministic algorithms may simply come to be regarded as too synchronization-vulnerable
-
A rich numerical analysis of algorithms that make use of statistically inferred "missing" quantities may emerge
-
Fully reliable executions may simple come to be regarded as too costly/synchronization-vulnerable
-
Algorithmic-based fault tolerance (ABFT) will be much cheaper than hardware and OS-mediated reliability
- Partition into two sets, large, fast and unreliable set where the errors can either be detected or rigorously bounded
- In 1956, Von Neumann gave 5 hours of lectures on "Synthesis of reliable organisms from unreliable components"
-
Default use of (uniform) high precision may come to an end, as wasteful of storage and bandwidth
- Representation of a smooth function in a hierarchical basis requires fewer bits than storing its nodal values
- We still have to compute and communicate "deltas" between states rather than the full state quantities, as we did when double precision was expensive
- A combining network node will have to remember not just the last address, but also the last values, and send just the deltas
-
Equidistributing errors properly while minimizing resource use will lead to innovative error analyses in numerical analysis
WIP: Reducing precision in the new QDHWeig eigensolver
How will PDE computations adapt?
-
Programming model will still be message-passing (due to large legacy code base), adapted to multicore or hybrid processors beneath a relaxed synchronization MPI-like interface.
-
Greater expressiveness in scientific programming
Evolution of Newton-Krylov-Schwarz
-
Critical path of a nonlinear implicit PDE solve is essentially updating the step
-
We often insert into this path things that could be done less synchronously
- Jacobian/preconditioner refresh
- convergence testing
- I/O and viz
Sources of nonuniformity
- Systme
- Algorithmic
- Effects of both types are similar when it comes to waiting at synchronization points ...
Programming practice
- Explore new programming models
- Attention to locality and reuse is valuable at all scales
- New algorithms and data structures can be explored under the reliable assumption that flop/s are cheap
Path for scaling up applications
- Weak scale applications up to distributed memory limits
- Strong scale applications beyond this
- Scale the workflow, itself
- Algorithm-architecture co-design process is staged, with any of these types of scaling valuable by themselves
Many Required Software Enabling Technologies
- Model-related
- Development-related
- Production-related
SciDAC philosophy
- Many applications drive
- enabling technologies respond to all
DOE Exascale Mathematis Working Group
-
74 2 page whitepapers contributed by the international community to the EMWG
-
Randomized algorithsm
-
On the fly data compression
-
Mining massive data sets
-
Algorithm based fault tolerance
-
Adaptive precision algorithms
-
Concurrency from dimensions beyond space
Randomized algorithms in subspace correction methods
- Solve AX=b by pointwise relaxation
GS - 1823
- deterministic and pre-ordered Southwell - 1935
- deterministic and dynamically ordered Griebel-Oswald - 2011
- random (and dynamically ordered)
- Excellent convergence w/ fault tolerance and synchronization reduction
Traditional highly-regimented and serial method can be decoupled in a concurrent way
Hiearchical representations for extreme data
- via Hans Bungartz
- Saving most simulation results to persistent storage will be impractical; instead hybrid in situ / in transit analysis
- Challenges: On-the-fly compression
- Storage complexity: O(n^d) -> O(n (logn)^(d-1))
- Representation accuracy: O(n^-p) -> ...
Q: What about testing and validation? A: It's harder, but if you look at uncertainty quantification, one option is to think of a computation as a "slightly unreliable map" and less of a direct operator.
Q: Would mathematicians be interested in a floating-point instruction that threw away bottom 64-bits of double-precision floating-point computation A: Sure!
Programming Models for High Performance Computing, Marc Snir
- Establish a taxonomy of HPC programming models and systems
- Introduce the main programming models and systems in current use
- Understand there is no magic bullet, but trade-offs
Some taxonomy
-
Programming model: Set of operations and constructs used to express parallelism
- Message passing, shared memory, bulk-synchronous
-
Programming system: Library or language that embodies one or more programming models
- MPI, OpenMP, CUDA
- C++ and Java are systems that implement the same programming model
- MPI supports multiple programming models: message passing, one-sided communication, bsp, asynchronous
-
Model: defines semantics of execution and defines performance of execution
- Usually, semantics are carefully defined, but not so performance model
Implementation Stack
- Language & Libraries
- Executable
- High-Level execution model (defined by run-time)
- Run-time
- Low-Level execution model (defined by HW)
- Hardware
Possibly more layers
...
Communication & Synchronization
- Load/store (from memory to cache or cache to cache)
- Hardware and software cache prefetch
- Read-modify-write operations
- Mutual exclusion - non-ordering, symmetric synchronization
- Monitor/wait - ordering, asymmetric synchronization
- CPUs provide extensive support to the former; HPC software as strong need for the latter
Node architecture -- evolution
- Increasing number of physical threads
- increasing cache levels
- lower level caches may be replaced with scratchpads
- Shared memory may not be coherent and will be NUMA
- Physical threads may not be all identical
- CPU-like vs. GPU-like physical threads
- May have same ISA and different performance, or different ISAs and different performance
Generic Hardware -- system
- Communication & Synchronization primitives:
- point-to-point: send/receive
- one-sided: put/get/accumulate
- collective: barrier, reduce, broadcast
- flat network vs. partially or completely exposed topology
Programming Models
Low-Level Programming Model
-
Fixed number of threads
-
Fixed number of nodes
-
Performance model:
- All threads compute at same speed
- Memory is flat
- Communication time predicted by simple model (a+bm, where m is message size)
-
Model can be implemented via MPI + pthreads, or MPI + restricted use of OpenMP
-
Pros
- Very hardware specific, can achieve best performance on current systems
-
Cons
- Very low-level, hard to get correct programs
- Very hardware specific: may only work on one machine
- Does not accommodate heterogeneous cores
- Performance model may not work in the future
Higher-level Programming Models
- Virtualize resource and have the run-time handle the mapping of virtual resources to physical resources
- "Virtual model execution" has well-defined semantics and performance model
- Encourages/mandates use of programming patterns that reduce the likelihood of errors and facilitates virtualization.
- Ensures that high-level performance model gives correct predictions
- Simplifies programming
- Provides more portability: program need not exactly match the physical resources - e.g., number of threads or nodes
- Can hide imperfections in the low-level model
(Cons)
- Can be inefficient
** You must understand the boundaries of where your model is valid **
Example: Dijkstra's Go-to Statement Considered Harmlful -- Restrict to Facilitate Programming
-
If program uses gotos arbitrarily then it is hard to understand relation between static program and dynamic execution state
-
If program uses only well-nested loops and function calls, the relation is simple: Execution state is defined by:
- Program counter
- Call stack
- Value of loop indices
-
Much easier to understand semantics and performance
-
Claim: not restrictive
Locality in Shared Memory: Virtualization and Performance model
-
Caching = memory virtualization - Name (address) of variable does not determine its location; caching hardware does so
-
Performance model: Memory capacity is DRAM size; memory latency is L1 speed
- reasonable approximation if cache hit rate is high
- good cache hit rate is essential to performance, because of high memory latency
- 99% of power consumed in future chips is spent moving data
-
For high cache hit rate, need good locality
- temporal locality, spatial locality, thread locality
-
Performance model is valid only for programs with good locality
Dynamic Task Scheduling: Virtualization and Performance Model
- It is convenient to express parallel computations as a set of tasks and dependencies between tasks (task graph)
- New threads could be spawned dynamically during execution
Mapping Tasks to Physical Threads
-
Work: Total number of operations
-
Depth: Critical path length
-
Desired performance model: $Time = max (Work/P, Depth)$
- Can be achieved, with suitable restrictions on task graph
-
Restrictions on task granularity: need to be large enough so to amortize cost of scheduling task
- cost: overhead of scheduling + overhead of moving data to place where task is scheduled (1000s of instructions in shared memory)
-
Restrictions on task graph
- static graph
- well-structured dynamic task graph
Well-Structured Program
- Task can spawn children tasks and wait for their completion
- spawn
- parallel loop
- No other dependencies exist, except parent to child
- Series-parallel (and/or) execution graph
- Parallel equivalent of a go-to less program
- Can be implemented in suitably restricted OpenMP (No synchronization between concurrent tasks, etc...)
- Scheduling can be done using work-stealing.
Work Stealing
- A thread appends a newly created task to local queue
- An idle thread picks a task from its local queue
- If local queue is empty, it steals a task from another queue
Pros
- Theoretically optimal and works well in practice
- Handles variance in thread speed well
Cons
- Needs many more tasks than physical threads (over-decomposition)
Can a Shared-Memory Programming Model Work on Distributed Memory?
- Naive mapping: Distributed virtual shared memory (DVSM):
- Software emulation of HW coherence
- cache line = page
- cache miss = page miss
- Abysmal performance - lines too large and coherence overhead too large
- Next step: Give up caching & coherence; give up dynamic task scheduling - Partitioned Global Address Space (PGAS)
- Languages: UPC, Co-Array Fortran
- Libraries: Global Arrays
PGAS model
- Fixed number of (single-threaded) locales
- Private memory at each locale
- Shared memory and shared arrays are partitioned across locales
- Library implementation: remote shared variables accessed via put/get operation
- Language implementation: distinct references between local and remote shared variables.
PGAS Performance Model
-
Not same as shared memory
-
Temporal/spatial/thread locality do not core the problem. No hardware caching & no efficient SW caching or aggregation of multiple remote accesses.
-
Performance model is same as distributed memory
- In order to obtain good performance need to explicitly get/put remote data in large chunks and copy to local memory.
-
Open problem, can we define a "Shared-memory like" programming model for distributed memory that provides shared memory performance model?
How About the Vector Units?
- Low-level: vector instructions
- High-level: compiler handles
- High-level performance model: floating-point performance is a (large) fraction of vector unit performance
- Requires suitable programming style to write vectorizable loops
- Direct addressing, aligned dense operands.
Distributed Memory Programming
- Shared memory - program does not control explicitly data location; focus is on control partitioning, and data follows control
- Performs well if code satisfies restrictions listed for efficient dynamic tasks scheduling and locality
- Distributed memory - data location is mostly static and controlled by program; focus is on data partitioning and control follows data
- Performs well if (mostly) static control partition works and communication across data partitions is limited
Data Partitioning
- Local = physical location (usually, node)
- Global address = <locale id, local address>
- Global view of data: Aggregates (arrays) are partitioned across locales
- Partition is static (or slow changing)
- Predefined partition types and (possibly) user-defined partitions
- Data is accessed with "global name" (e.g. array indices), compiler translates to global address.
- Local view of data: Data is accessed using global address (if remote) or local address (if local)
- Syntactic sugar: co-arrays, with a locale index
Computation is expensive (and even more so for fancier partitions)
Control Partitioning
- Local view of control - each local runs its own code, executions in distinct locales synchronize explicitly _ Variant: support remote task invocation (code on one local can spawn computation on another locale). Charm++, Chapel
- Global view of control - one global parallel loop; compiler & run-time maps each iterate to a locale chosen to minimize communication (e.g. owner compute rule)
- User can associate control with data using "on" statements
- No immediate view of what accesses are local and what accesses are global
This is not a settled issue.
Can a Distributed-Memory Programming Model Work on Shared Memory?
Sure - Partition memory into distinct locales, and associate each locale witha fixed set of physical threads (one or more)
- loses some flexibility in dynamic scheduling
- uses shared memory inefficiently to emulate send/receive or put/get
- achieves good locality
- Reducing communication is essential in order to reduce power
- We may not have coherence
Non-Coherent, Shared-Memory-Like Programming Model
- Cache
- Local memory
- Address translation
- Transfer is implicit
- Coherence
- is essential
- could be done more efficiently
- can be done, if maintain many context
- too expensive and superfluous for scientific computing
Is HPC Converging to One Architecture?
- Punctuated equilibrium
Grand Dream: Not One Programming Model
A multitude of programming models, ranging from domain specific to architecture specific.
Aside by Aron: I wish he'd spent more time guiding the scientists along the programming models, instead of the taxonomy overview, which didn't do a great job of categorizing things like CUDA, OpenCL, or DSLs.
The Evolution of GPU Accelerated Computing, Steve Parker, Senior Director (HPC & Rendering)
How did your graphics card end up getting used in supercomputers?
Driving Factor - Power Exponential increase in performance has been accompanied by an exponential increase in power.
200 KW CM5 in 93 replaced by 8.2 MW Titan in 2012
Classic Dennard Scaling 2x more transistors by scaling chip features down 0.7x per process generation buys 1.4x faster transistors 0.7x capacitance 0.7x voltage nets: ~2.5x increase in performance at constant power (Vast oversimplifications ahead)
Still get more transistors and lower capacitance, but we no longer get faster transistors or decreased voltage.
Now we get 2x performance for 20-30% more power.
The High Cost of Data Movement 64-bit DP FMA: 20 pJ (28nm IC) 256-bit access 8kB SRAM: 50 pJ Moving the data halfway across chip: 256 pJ One corner of chip to the other: 1000 pJ Moving it off the chip: 16000 pJ (DRAM) Efficient off-chip link: 500 pJ
If you want 50 GFLOP/s per W, your power budget is about 20 pJ. Don't move data.
What to Do? Stop making it worse. Emergence of multicore CPUs HPC must go Hybrid Do most work by cores optimized for extreme energy efficiency
Evolution of Graphics Processing (mostly skipped) The Pioneers: Early GPGPU (2002), Purcell, Strzodka, Rumpf, Larsen, McAllister, Thompson Ian Buck - Showed how hard it was to do GPGPU using current technologies Came up with Brook for GPUs: Stream Computing on Graphics Hardware
Ideas from Brook made it in to CUDA
GPU Computing Matures 2006 - first CUDA capable chips 2013 - Kepler architecture
CUDA C Programming Parallel C code looks like Serial C code CUDA Parallelism and Memory Model Fundamental unit of computation is a thread, extremely lightweight Blocked together in a thread block, grouped together in a kernel Kepler SMX - power efficiency Hyper-Q
7.1B transistors (largest processor-based chip ever built)
15 SMX unites
1.3 TFLOPS FP64
1.5 MB L2 Cache
384-bit GDDR5
PCI Express Gen3
Kepler GK110 SMX vs Fermi SM 3x sustained perf/W 1.4x for free from process 1.7x comes from wider, slower chip Ground up redesign for performance per Watt
Dynamic Parallelism Simpler code, more general, higher performance Capability for GPU to launch kernels from itself Quicksort allows launching parallel recursion Code size cut by 2, performance cut by 2 Titan is running 6 flagship scientific applications on Titan The Future of HPC Programming Computers are not getting faster... just wider Presents both a challenge and an opportunity for computational scientists Need to structure all HPC applications as throughput problems a 1 GhZ clocked exascale computer implies 1 billion way parallelism Locality within nodes much more important OpenACC Directives (Portable Parallelism) Programmer Focuses on Exposing Parallelism Performance improvements in GPU code drives performance improvements in CPU code
Changing Computing Landscape Processor shipments now dominated by ARMs Attack of the killer cell phones nVIDIA starts from Tegra devices and goes up to supercomputers Tegra Roadmap Currently shipping Tegra 4 Future is Logan: Fully programmable GPU (CUDA, OpenGL 4.4, no double-precision units) Logan runs at 1 Watt, allegedly delivers 100 GFlop/s Watt The Future of HPC is Green Power constraint is a Big Deal
Must move to simpler cores for most work Must pay more attention to intra-node locality HPC is increasingly supported by consumer markets
Both driven fundamentally by power Unprecedented architectural convergence GPUs evolving to a tightly integrated, hybrid processor
Hybrid cores, hybrid memory, integrated network Goal: efficient on any code with high parallelism This is simply how computers will be built 3 Ways to Accelerate GPUs Libraries OpenACC Directives Programming Languages (CUDA, Fortran) Conclusion Check out nVidia programming course on udacity and their SDK. Q: How do GPUs do with RDMA? What about GPU Direct?
A: GPU Direct allows you to move data from a GPU to a peer device without going through CPU memory. Huge latency and convenience win.
MPI for Scalable Computing, Bill Gropp, Rusty Lusk, Rajeev Thakur
- We assume everybody has some MPI experience
- We will focus more on understanding MPI concepts than on coding details
- Emphasis will be on issues affecting scalability
- There will be some code walkthroughs and exercises
- We will use MPICH on your (Linux or MacOS) laptop for initial experiments supports preliminary implementation of the new MPI-3 standard
- Vesta (BG/Q) will also be available for larger runs
What is MPI
- MPI is a message-passing library interface standard
- MPI was defined in 1994 by a broadly-based group of parallel computer vendors, computer scientists, and application developers.
- 2-year intensive process.
- Implementations appeared quickly and now MPI is taken for granted as vendor-supported software on any parallel machine
- Free, portable implementations exist for clusters and other environments (MPICH, Open MPI) MPI-1 1994, MPI-2 1997, MPI-2.1 2008, MPI-2.2 2009, MPI-3 2012
Basic definitions
- process - an address space, a program, and one or more threads of control, each with its own subroutine call stack and program counter. The threads share the address space, which has advantages and disadvantages.
- an old-fashioned UNIX process is a single-threaded process
- MPI-1 - a parallel program was thought of as a collection of old-fashioned UNIX processes, each identified by its MPI rank.
- In MPI-2, semantics were defined that enable MPI processes to be multithreaded (see "hybrid programming", later this week).
Programming and Address Spaces
- Sequential programming = one single-threaded process
- Parallel programming =
- One process, multiple threads
- Multiple single-threaded processes
- Multiple multiple-threaded processes
- Shared-memory parallel programming is harder than it looks
- Yet, processes (or threads) need to communicate, or else one has just a collection of sequential programs rather than a parallel program.
- MPI is for communication among processes (with separate address spaces)
MPI Communication
- MPI limits in both time and space the exposure of one process's address space to action by (the threads of) another process
MPI Non-Blocking Communication
- MPI_Irecv exposes part of its address space to the "system" (OS + MPI implementation code + non-portable communication hardware/software)
- The "system" may utilize internal buffers, perhaps smaller than the application's buffers, requiring multiple data transfers by the system
- MPI_Isend tells the system where the data is to be moved is located and into what process's receive buffer it is to be placed.
- Both buffers at this point belong to the "system"
- MPI_Wait on both sides delays its caller until the system no longer needs to access the buffer.
- Receiver can now make sure of the new data in the buffer
- Sender can now reuse the buffer
The blocking operations (MPI_Send, MPI_Recv) can be dangerous
- The MPI Forum only included them because users of earlier systems would expect them. Deadlock danger: exchanging large messages 0 1 MPI_Send(1) MPI_Send(0) MPI_Recv(1)
Some notes lost
Collective Operations
- MPI provides many collective communication patterns, some with computation included. Custom computation operations are possible.
- Multiple algorithms based on message sizes, machine topologies, machine capabilities.
- Scalable algorithms a research topic
- Common feature: called by all processes in a communicator
- Performance note: Measuring time taken by a collective operation can obscure what is really a load balancing problem.
- MPI-3 has non-blocking collective operations
MPI-2
- Introduced dynamic process management, remote memory access, parallel I/O, thread safety, C++ (now removed), and Fortran-90 bindings.
- We won't discuss here DPM (not universally implemented, particularly on large systems, since it involves process management at the OS level).
- Thread safety will be covered under hybrid programming
- A little about RMA now, a lot later!
MPI-2 RMA: Remote Memory ACcess, or One-Sided Operations
- The RMA window object can be thought of as a generalization of the MPI-1 communication buffer
- Allocating a window object exposes a larger part of a process's address space for access by other processes, and (usually) for a longer time.
- room for multiple, simultaneously active communication buffers
- MPI window = union of all process's window objects
- Separates "buffer" allocation, data movement initiation, and synchronization (checking for completion)
MPI_Win_create
MPI_Put
MPI_Get
MPI_Accumulate
MPI_Fence
MPI-2 Parallel I/O
-
MPI_IO is based on an analogy: Reading from and writing to files is "like" receiving and sending messages from/to the parallel file system.
-
Concepts from MPI-1 are reused