MLSys - yszheda/wiki GitHub Wiki

Parallelization

https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/13-ML-parallelization-part2.pdf

Data parallelism

Model parallelism

Inter-Operator Parallelism

Synchronous Pipeline Parallel Schedules

Pros:
- Keep the convergence semantics. The training process is exactly the same as training the neural network on a single device.
Cons:
- Pipeline bubbles.
- Reducing pipeline bubbles typically requires splitting inputs into smaller components, but too small input to the neural network will reduce the hardware efficiency.

GPipe

Improving Pipeline Parallelism Efficiency

m: number of micro-batches in a mini-batch
- Increase mini-batch size or reduce micro-batch size
- Caveat:
- - large mini-batch sizes can lead to accuracy loss
- - small micro-batch sizes reduce GPU utilization
𝑝: number of pipeline stages
- Decrease pipeline depth
- Caveat: increase stage size

1F1B (1 Forward 1 Backward) Schedule

Interleaved 1F1B

Pro: Higher pipeline efficiency with fewer pipeline bubbles.
Con: More communication overhead between stages.

TeraPipe

Chimera

Asynchronous Pipeline Parallel Schedule

Pros:
- No Pipeline bubbles.
Cons:
- Break the synchronous training semantics. Now the training will involve stalled gradient.
- Algorithms may store multiple versions of model weights for consistency.

AMPNet

Pipedream

Pipedream-2BW

Imbalanced Pipeline Stages

Automatic Stage Partitioning

Goal: Minimize maximum stage latency & maximize parallelization

Intra-op Parallelism

Parallelize One Operator

Parallelize All Operators in a Graph

Minimize Node costs (computation + communication) + Edge costs (re-partition communication)

Solution:

Manual design
Randomized search
Dynamic programming
Integer linear programming

Model-specific Intra-op Parallel Strategies

AlexNet: Assign a group convolution layer to 2 GPUs

Megaton-LM

GShard MoE

ZeRO Optimizer

Mesh-Tensorflow

GSPMD

Tofu

FlexFlow

Auto-parallelization

Search-based methods
- MCMC
- Heuristics
Learning-based methods
- Reinforcement Learning
- ML-based cost model
- Bayesian optimization
Optimization-based methods
- Dynamic programming
- Integer linear programming
- Hierarchical Optimization