MLSys - yszheda/wiki GitHub Wiki
- https://infrasys-ai.github.io/
- https://mlsys-learner-resources.github.io/Awesome-MLSys-Blogger/
- https://github.com/gpu-mode/awesomeMLSys
Parallelization
https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/13-ML-parallelization-part2.pdf
Data parallelism
Model parallelism
Inter-Operator Parallelism
Synchronous Pipeline Parallel Schedules
- Pros:
-
- Keep the convergence semantics. The training process is exactly the same as training the neural network on a single device.
- Cons:
-
- Pipeline bubbles.
-
- Reducing pipeline bubbles typically requires splitting inputs into smaller components, but too small input to the neural network will reduce the hardware efficiency.
GPipe
Improving Pipeline Parallelism Efficiency
m: number of micro-batches in a mini-batch-
- Increase mini-batch size or reduce micro-batch size
-
- Caveat:
-
-
- large mini-batch sizes can lead to accuracy loss
-
-
-
- small micro-batch sizes reduce GPU utilization
-
𝑝: number of pipeline stages-
- Decrease pipeline depth
-
- Caveat: increase stage size
1F1B (1 Forward 1 Backward) Schedule
Interleaved 1F1B
- Pro: Higher pipeline efficiency with fewer pipeline bubbles.
- Con: More communication overhead between stages.
TeraPipe
Chimera
Asynchronous Pipeline Parallel Schedule
- Pros:
-
- No Pipeline bubbles.
- Cons:
-
- Break the synchronous training semantics. Now the training will involve stalled gradient.
-
- Algorithms may store multiple versions of model weights for consistency.
AMPNet
Pipedream
Pipedream-2BW
Imbalanced Pipeline Stages
Automatic Stage Partitioning
Goal: Minimize maximum stage latency & maximize parallelization
Intra-op Parallelism
Parallelize One Operator
Parallelize All Operators in a Graph
Minimize Node costs (computation + communication) + Edge costs (re-partition communication)
Solution:
- Manual design
- Randomized search
- Dynamic programming
- Integer linear programming
Model-specific Intra-op Parallel Strategies
AlexNet: Assign a group convolution layer to 2 GPUs
Megaton-LM
GShard MoE
ZeRO Optimizer
Mesh-Tensorflow
GSPMD
Tofu
FlexFlow
Auto-parallelization
- Search-based methods
-
- MCMC
-
- Heuristics
- Learning-based methods
-
- Reinforcement Learning
-
- ML-based cost model
-
- Bayesian optimization
- Optimization-based methods
-
- Dynamic programming
-
- Integer linear programming
-
- Hierarchical Optimization