Docs AI Engineering Knowledge Hub

Pipeline Parallelism: Comprehensive Technical Exposition


1. Motivation: Why Pipeline Parallelism?

1.1 The Inter-Node Communication Bottleneck

Tensor Parallelism (TP) partitions individual layer weight matrices across GPUs within a single node. However, scaling TP beyond the GPUs on a single node (typically Nintra=4N_{\text{intra}} = 4or88) forces communication across inter-node network links. These links operate at significantly lower bandwidth than intra-node interconnects (e.g., NVLink at $\sim$900 GB/s vs. InfiniBand at $\sim$50–400 GB/s), degrading performance during collective operations such as all-reduce, all-gather, and reduce-scatter.

Key empirical observation: When benchmarking all-reduce across multiple nodes (each with 8 GPUs), median bandwidth drops substantially as node count increases, and variance (5th–95th percentile spread) widens—confirming that inter-node communication is a primary scaling bottleneck for TP.

1.2 The Model Size Problem

For large models (70B\geq 70\text{B}parameters), the memory footprint of weights alone can exceed the aggregate GPU memory of a single node. Specifically, for a model withPP parameters stored in mixed precision (e.g., 2 bytes per parameter for BF16):

Weight Memory=P×2  bytes \text{Weight Memory} = P \times 2 \;\text{bytes}

For P=70×109P = 70 \times 10^9:

Weight Memory=70×109×2=140  GB \text{Weight Memory} = 70 \times 10^9 \times 2 = 140 \;\text{GB}

This exceeds the memory capacity of 4–8 GPUs (e.g., 4 × 80 GB = 320 GB leaves little room for optimizer states, gradients, and activations). Sequence parallelism and context parallelism address long-sequence memory pressure but do not address the fundamental weight memory constraint.

Pipeline Parallelism (PP) resolves this by partitioning the model along the depth (layer) dimension across multiple GPUs.


2. Core Concept: Partitioning Layers Across GPUs

2.1 Definition

Pipeline parallelism splits a model’s LLtransformer layers intoppcontiguous groups called stages, distributing each stage to a separate GPU (or device). If we haveppGPUs andLL layers:

Layers per stage=Lp \text{Layers per stage} = \frac{L}{p}

Example: For L=32L = 32layers andp=8p = 8 GPUs:

GPU Layers
GPU 0 Layers 1–4
GPU 1 Layers 5–8
GPU 2 Layers 9–12
GPU 3 Layers 13–16
GPU 4 Layers 17–20
GPU 5 Layers 21–24
GPU 6 Layers 25–28
GPU 7 Layers 29–32

2.2 Memory Reduction for Model Parameters

Each GPU stores only 1p\frac{1}{p}of the model’s parameters. For an 8B parameter model withp=4p = 4:

Parameters per GPU=8×1094=2×109 \text{Parameters per GPU} = \frac{8 \times 10^9}{4} = 2 \times 10^9
Weight memory per GPU=P×2p  bytes \text{Weight memory per GPU} = \frac{P \times 2}{p} \;\text{bytes}

2.3 Activation Memory: No Savings

A critical and initially counterintuitive observation: activation memory is NOT reduced by pipeline parallelism.

Explanation: Each GPU handles 1p\frac{1}{p}of the layers, so the activation memory per micro-batch per stage isactivsp\frac{\text{activs}}{p}. However, in any PP schedule, each GPU must perform pp (or more) forward passes on successive micro-batches before beginning the first backward pass. Therefore, the total activation memory stored simultaneously on each GPU is:

Activation memory per GPU=p×activsp=activs \text{Activation memory per GPU} = p \times \frac{\text{activs}}{p} = \text{activs}

where activs\text{activs} denotes the total activation memory for a single micro-batch across all layers. The activation memory per GPU thus remains approximately equal to the non-parallelized case.

2.4 Communication Pattern

Unlike data parallelism (which communicates gradients) or ZeRO-3 (which communicates parameters), pipeline parallelism communicates activation tensors sequentially between adjacent stages. Between stage ssand stages+1s+1, the output activation tensor of the last layer on GPU ssis sent as input to the first layer on GPUs+1s+1.

PP Communication=point-to-point send/recv of activation tensors at stage boundaries \text{PP Communication} = \text{point-to-point send/recv of activation tensors at stage boundaries}

Advantage over TP: Communication occurs only at p1p - 1stage boundaries (once per stage transition), rather than multiple times within every layer. The volume per communication is moderate (one hidden-state tensor of shape[Bμ,S,H][B_{\mu}, S, H]whereBμB_{\mu}is micro-batch size,SSis sequence length,HH is hidden dimension).


3. The Pipeline Bubble: Fundamental Inefficiency

3.1 Naive Scheduling (Single Micro-Batch)

When a single batch is processed through pp stages sequentially, only one GPU is active at any given time. All other GPUs are idle.

Timing definitions:

  • tft_f: Time for a forward pass through one stage for one micro-batch
  • tbt_b: Time for a backward pass through one stage for one micro-batch
  • Common approximation: tb2×tft_b \approx 2 \times t_f (backward involves computing gradients w.r.t. both inputs and weights)

Ideal time (if perfectly parallelized):

tideal=tf+tb t_{\text{ideal}} = t_f + t_b

Pipeline bubble time (additional idle time):

tbubble=(p1)×(tf+tb) t_{\text{bubble}} = (p - 1) \times (t_f + t_b)

This represents the cumulative idle time across all stages where GPUs wait while others compute.

Bubble ratio (fraction of wasted time relative to ideal):

rbubble=(p1)×(tf+tb)tf+tb=p1 \boxed{r_{\text{bubble}} = \frac{(p - 1) \times (t_f + t_b)}{t_f + t_b} = p - 1}

For p=8p = 8: rbubble=7r_{\text{bubble}} = 7, meaning the bubble time is 7×7\times the ideal compute time. This is catastrophically inefficient.


4. All Forward, All Backward (AFAB) Schedule

4.1 Concept

To mitigate the bubble, we split the global batch into mm micro-batches. The schedule proceeds as:

  1. All Forward Phase: Process all mm micro-batches through the forward pass sequentially across stages.
  2. All Backward Phase: Process all mm micro-batches through the backward pass sequentially.

When GPU s+1s+1begins processing micro-batchii, GPU sscan immediately start processing micro-batchi+1i+1. This creates a pipelined overlap.

4.2 Bubble Analysis

Ideal time for mm micro-batches:

tideal=m×(tf+tb) t_{\text{ideal}} = m \times (t_f + t_b)

Bubble time remains:

tbubble=(p1)×(tf+tb) t_{\text{bubble}} = (p - 1) \times (t_f + t_b)

Bubble ratio:

rbubble=(p1)×(tf+tb)m×(tf+tb)=p1m \boxed{r_{\text{bubble}} = \frac{(p - 1) \times (t_f + t_b)}{m \times (t_f + t_b)} = \frac{p - 1}{m}}

By increasing mm, the bubble ratio decreases inversely. For p=8p = 8, m=32m = 32:

rbubble=732=0.2187521.9% r_{\text{bubble}} = \frac{7}{32} = 0.21875 \approx 21.9\%

4.3 Memory Problem

AFAB requires storing activations for all mm micro-batches simultaneously, because no backward pass begins until all forward passes complete. The activation memory requirement on each GPU is:

Activation memoryAFAB=m×activsp \text{Activation memory}_{\text{AFAB}} = m \times \frac{\text{activs}}{p}

Since we want large mm to reduce the bubble, this creates a memory explosion—activations for many micro-batches must be retained until the backward phase begins.


5. One Forward, One Backward (1F1B) Schedule

5.1 Concept

The 1F1B schedule addresses the activation memory problem by starting backward passes as early as possible. The schedule has three phases:

  1. Warm-up phase: Each GPU performs successive forward passes to fill the pipeline (the number of warm-up forward passes depends on the stage position).
  2. Steady-state phase: Each GPU alternates between one forward pass and one backward pass (hence “1F1B”).
  3. Cool-down phase: Each GPU drains remaining backward passes.

5.2 Bubble Analysis

The bubble size in 1F1B is identical to AFAB:

rbubble=p1m \boxed{r_{\text{bubble}} = \frac{p - 1}{m}}

The bubble is not reduced because the total amount of idle time at the start (warm-up) and end (cool-down) remains the same—it is simply rearranged.

5.3 Memory Advantage

The critical improvement is in activation memory. In 1F1B, each GPU stores activations for at most ppmicro-batches (notmm micro-batches as in AFAB):

Activation memory1F1B=p×activsp=activs \text{Activation memory}_{\text{1F1B}} = p \times \frac{\text{activs}}{p} = \text{activs}

In contrast to AFAB where activation memory was m×activspm \times \frac{\text{activs}}{p}, 1F1B limits it to p×activspp \times \frac{\text{activs}}{p}. Since typically mpm \gg p in practical configurations, this is a substantial memory reduction.

Because 1F1B uses less activation memory, we can increase mmfurther (without running out of memory), which indirectly reduces the bubble ratiop1m\frac{p-1}{m}.

5.4 Practical Scaling Behavior

Empirical benchmarks reveal two regimes:

Configuration Behavior
mp1m \leq p - 1 Bubble dominates; performance degrades aspp increases
m=32p1m = 32 \gg p - 1 Performance improves at lowpp; still limited at very large pp

Cross-node scaling advantage: When scaling from one node (p=8p = 8) to two nodes (p=16p = 16), PP shows only $\sim$14% performance drop, compared to $\sim$43% for TP. This is because PP communicates only point-to-point activation tensors at stage boundaries, whereas TP requires bandwidth-intensive collective operations (all-reduce) within every layer.

5.5 Implementation Complexity

In 1F1B, forward and backward passes are no longer globally sequential. Different GPUs execute forward and backward passes for different micro-batches concurrently. This requires:

  • Per-device scheduling logic to decide when to switch between forward and backward execution.
  • Extensive modifications to both training loop code and model code.
  • Careful management of micro-batch indexing and gradient accumulation.

6. Interleaved Stages

6.1 Concept

Instead of assigning contiguous layer blocks to each GPU, interleaved PP assigns non-contiguous layer subsets. Each GPU hosts vvmodel chunks (also called virtual stages), wherevv is the number of chunks per GPU.

Example with L=8L = 8layers,p=2p = 2GPUs,v=2v = 2 chunks per GPU:

GPU Chunks Layers
GPU 0 Chunk 0, Chunk 2 Layers 1–2, Layers 5–6
GPU 1 Chunk 1, Chunk 3 Layers 3–4, Layers 7–8

A micro-batch now loops through the GPUs multiple times during a single forward pass: GPU 0 → GPU 1 → GPU 0 → GPU 1 → …

6.2 Bubble Reduction

Each forward and backward pass through a single chunk is vv times shorter than a full stage pass. The pipeline bubble time becomes:

tbubble=(p1)×(tf+tb)v t_{\text{bubble}} = \frac{(p - 1) \times (t_f + t_b)}{v}

The bubble ratio:

rbubble=1v(p1)×(tf+tb)m×(tf+tb)=p1vm \boxed{r_{\text{bubble}} = \frac{1}{v} \cdot \frac{(p - 1) \times (t_f + t_b)}{m \times (t_f + t_b)} = \frac{p - 1}{v \cdot m}}

where:

  • pp = pipeline parallelism degree (number of GPUs)
  • mm = number of micro-batches
  • vv = number of model chunks (virtual stages) per GPU

6.3 Communication Trade-off

The number of point-to-point communications increases by a factor of vv, since each micro-batch traverses each GPU vv times instead of once. This introduces a direct trade-off:

Communication volumev×(p1) \text{Communication volume} \propto v \times (p - 1)
Bubble sizep1vm \text{Bubble size} \propto \frac{p - 1}{v \cdot m}

The optimal vv balances reduced idle time against increased communication overhead.

6.4 Scheduling Policies: Depth-First vs. Breadth-First

With interleaved stages, a scheduling decision arises at each time step for each GPU: should it prioritize:

Policy Description Effect
Depth-first Advance earlier micro-batches through later layers first Minimizes per-micro-batch latency; completes individual micro-batches faster, freeing activation memory sooner
Breadth-first Advance later micro-batches through earlier layers first Maximizes pipeline filling; keeps all stages busy

The Llama 3.1 training pipeline uses a 1F1B schedule with interleaved stages, with a tunable priority parameter between depth-first and breadth-first policies.

6.5 Special Cases Summary

mm vv Schedule Type
11 11 Naive PP (single micro-batch, single chunk)
m>1m > 1 11 AFAB or 1F1B
m>1m > 1 v>1v > 1 Interleaved 1F1B

7. Zero Bubble Pipeline Parallelism

7.1 Key Insight: Decomposing the Backward Pass

The backward pass through a linear layer Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W} involves two independent gradient computations:

  1. Input gradient (BB): Gradient w.r.t. the input activations X\mathbf{X}, needed to propagate gradients to earlier layers.
LX=LYW \frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \cdot \mathbf{W}^{\top}
  1. Weight gradient (WW): Gradient w.r.t. the weight matrix W\mathbf{W}, needed for the optimizer update.
LW=XLY \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^{\top} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{Y}}

Critical observation: Operation BBmust complete before the backward pass of the preceding stage can begin (it is on the critical path). However, operationWW has no such dependency—it only needs to complete before the optimizer step. Therefore:

W can be scheduled anywhere after the corresponding B of the same stage W \text{ can be scheduled anywhere after the corresponding } B \text{ of the same stage}

7.2 Exploiting the Decomposition

By splitting the coarse-grained backward pass into fine-grained BBandWWoperations, we gain scheduling flexibility. TheWW operations can be strategically placed into bubble slots that would otherwise be idle.

Timing decomposition:

tb=tB+tW t_b = t_B + t_W

where tBt_Bis the time for the input gradient computation andtWt_W is the time for the weight gradient computation.

7.3 ZB-H1 and ZB-H2 Schedules

The Zero Bubble paper proposes two schedules:

Schedule Description Bubble
ZB-H1 Handcrafted schedule with BB/WW decomposition Significantly reduced
ZB-H2 Optimized schedule filling all bubbles with WW operations Theoretically zero

In ZB-H2, every idle slot is filled with a WW computation, achieving:

rbubble0 \boxed{r_{\text{bubble}} \approx 0}

7.4 Optimal Scheduling via Integer Linear Programming (ILP)

Finding the optimal placement of FF(forward),BB(input backward), andWW(weight backward) operations acrossppstages andmm micro-batches is formulated as an Integer Linear Programming problem:

Objective:

min  tbubble \min \; t_{\text{bubble}}

Subject to constraints:

  1. Data dependency: Fis+1F_i^{s+1}cannot start beforeFisF_i^{s} completes (forward propagation order).
  2. Data dependency: BisB_i^{s}cannot start beforeBis+1B_i^{s+1} completes (backward propagation order).
  3. Ordering: BisB_i^{s}cannot start beforeFisF_i^{s} completes.
  4. Flexible scheduling: WisW_i^{s}must occur afterBisB_i^{s} but can occur at any later time.
  5. Optimizer constraint: All WisW_i^{s}for all micro-batchesii must complete before the optimizer step.
  6. Non-overlap: No two operations on the same GPU can overlap in time.

Here, FisF_i^{s}denotes the forward pass for micro-batchiiat stagess, and similarly for BisB_i^{s}andWisW_i^{s}.


8. DualPipe (DeepSeek-V3/R1)

8.1 Concept

DualPipe extends the zero-bubble decomposition by introducing two concurrent pipeline streams propagating from both ends of the pipeline dimension simultaneously:

  • Stream 1: Micro-batches flow forward from stage 0 → stage p1p-1 (left to right)
  • Stream 2: Micro-batches flow forward from stage p1p-1 → stage 0 (right to left)

These bidirectional streams are interleaved on each GPU, ensuring that when one stream encounters a dependency stall, the other stream can utilize the idle compute cycles.

8.2 Fine-Grained Operation Decomposition

DualPipe further decomposes operations beyond FF, BB, WW to include communication operations (all-to-all for MoE expert routing in DeepSeek-V3). The scheduling interleaves:

  • Forward computation (FF)
  • Input backward computation (BB)
  • Weight backward computation (WW)
  • All-to-all communication (for expert parallelism)

By overlapping communication with computation from the opposing stream, DeepSeek-V3 achieved:

near-zero all-to-all communication overhead \text{near-zero all-to-all communication overhead}

8.3 Complexity

The DualPipe schedule is significantly more complex than 1F1B or interleaved schedules. Its design requires:

  1. Precise profiling of individual operation durations (tFt_F, tBt_B, tWt_W, tcommt_{\text{comm}}).
  2. Solving an ILP or heuristic optimization problem for operation placement.
  3. Bidirectional pipeline infrastructure with careful synchronization.

9. Comparative Summary of Pipeline Schedules

Schedule Bubble Ratio rbubbler_{\text{bubble}} Activation Memory per GPU Communication Volume Implementation Complexity
Naive (single micro-batch) p1p - 1 activs\text{activs} (p1)(p-1) sends Low
AFAB (mmmicro-batches) p1m\dfrac{p - 1}{m} mactivspm \cdot \dfrac{\text{activs}}{p} (p1)(p-1) sends per micro-batch Low
1F1B (mmmicro-batches) p1m\dfrac{p - 1}{m} pactivsp=activsp \cdot \dfrac{\text{activs}}{p} = \text{activs} (p1)(p-1) sends per micro-batch Medium
Interleaved 1F1B (vvchunks) p1vm\dfrac{p - 1}{v \cdot m} activs\text{activs}(reduced per chunk) v(p1)v \cdot (p-1) sends per micro-batch High
Zero Bubble (ZB-H2) 0\approx 0 Similar to 1F1B Similar to 1F1B Very High
DualPipe 0\approx 0 Similar to 1F1B Bidirectional + overlapped Very High

10. Key Mathematical Relations: Consolidated Reference

Pipeline Bubble (General)

rbubble=p1vm \boxed{r_{\text{bubble}} = \frac{p - 1}{v \cdot m}}

where:

  • pp = number of pipeline stages (GPUs allocated to PP)
  • mm = number of micro-batches
  • vv= number of interleaved chunks per GPU (v=1v = 1 for non-interleaved)

Backward Pass Decomposition

tb=tB+tW t_b = t_B + t_W
LX=LYW(B:on critical path) \frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \cdot \mathbf{W}^{\top} \quad (B: \text{on critical path})
LW=XLY(W:flexibly schedulable) \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^{\top} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \quad (W: \text{flexibly schedulable})

Memory per GPU (Parameters)

Param memory per GPU=P×bytes_per_paramp \text{Param memory per GPU} = \frac{P \times \text{bytes\_per\_param}}{p}

Communication Advantage over TP

PP: (p1)×v point-to-point transfers per micro-batch \text{PP: } (p - 1) \times v \text{ point-to-point transfers per micro-batch}
TP: Multiple all-reduce operations per layer per micro-batch \text{TP: Multiple all-reduce operations per layer per micro-batch}

11. Practical Design Principles

  1. Choose mp1m \gg p - 1 to minimize the bubble ratio, subject to the constraint that mm divides the global batch size.

  2. PP excels across nodes because point-to-point activation transfers tolerate lower inter-node bandwidth far better than TP’s collective all-reduce operations ($\sim$14% degradation vs. $\sim$43% for TP when crossing node boundaries).

  3. Interleaving (v>1v > 1) trades communication for compute efficiency. Increase vvonly when intra-stage communication bandwidth is sufficient to absorb thevv-fold increase in transfers.

  4. 1F1B is strictly preferred over AFAB when activation memory is the binding constraint, as it reduces peak activation storage from O(m)\mathcal{O}(m)toO(p)\mathcal{O}(p) micro-batches.

  5. Zero-bubble methods require fine-grained profiling and ILP-based scheduling, making them implementation-intensive but near-optimal for large-scale deployments (e.g., DeepSeek-V3/R1).

  6. PP does not reduce activation memory per se—it reduces parameter memory. Activation recomputation (gradient checkpointing) remains the primary tool for activation memory reduction and is orthogonal to PP.

PreviousPipeline Parallelism: A Comprehensive Technical Exposition NextScaling Distributed Training: Foundations and First Principles

Generated from llm_training_at_scale at .