MLSys - yszheda/wiki GitHub Wiki

Parallelization


https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/13-ML-parallelization-part2.pdf


Data parallelism

Model parallelism

Inter-Operator Parallelism

Synchronous Pipeline Parallel Schedules

  • Pros:
    • Keep the convergence semantics. The training process is exactly the same as training the neural network on a single device.
  • Cons:
    • Pipeline bubbles.
    • Reducing pipeline bubbles typically requires splitting inputs into smaller components, but too small input to the neural network will reduce the hardware efficiency.
GPipe
Improving Pipeline Parallelism Efficiency
  • m: number of micro-batches in a mini-batch
    • Increase mini-batch size or reduce micro-batch size
    • Caveat:
      • large mini-batch sizes can lead to accuracy loss
      • small micro-batch sizes reduce GPU utilization
  • 𝑝: number of pipeline stages
    • Decrease pipeline depth
    • Caveat: increase stage size
1F1B (1 Forward 1 Backward) Schedule
Interleaved 1F1B
  • Pro: Higher pipeline efficiency with fewer pipeline bubbles.
  • Con: More communication overhead between stages.
TeraPipe
Chimera

Asynchronous Pipeline Parallel Schedule

  • Pros:
    • No Pipeline bubbles.
  • Cons:
    • Break the synchronous training semantics. Now the training will involve stalled gradient.
    • Algorithms may store multiple versions of model weights for consistency.
AMPNet
Pipedream
Pipedream-2BW

Imbalanced Pipeline Stages

Automatic Stage Partitioning

Goal: Minimize maximum stage latency & maximize parallelization

Intra-op Parallelism

Parallelize One Operator

Parallelize All Operators in a Graph

Minimize Node costs (computation + communication) + Edge costs (re-partition communication)

Solution:

  • Manual design
  • Randomized search
  • Dynamic programming
  • Integer linear programming

Model-specific Intra-op Parallel Strategies

AlexNet: Assign a group convolution layer to 2 GPUs
Megaton-LM
GShard MoE
ZeRO Optimizer
Mesh-Tensorflow
GSPMD
Tofu
FlexFlow

Auto-parallelization

  • Search-based methods
    • MCMC
    • Heuristics
  • Learning-based methods
    • Reinforcement Learning
    • ML-based cost model
    • Bayesian optimization
  • Optimization-based methods
    • Dynamic programming
    • Integer linear programming
    • Hierarchical Optimization