Docs AI Engineering Knowledge Hub

Pipeline Parallelism: A Comprehensive Technical Exposition


1. Motivation and Context

1.1 The Inter-Node Communication Bottleneck

In tensor parallelism (TP), every transformer layer requires collective communication operations (all-reduce, all-gather, reduce-scatter) within each layer’s computation. When TP is confined to a single node—typically housing 4 or 8 GPUs connected via high-bandwidth interconnects such as NVLink (up to ~900 GB/s bidirectional on modern hardware)—the overhead remains tolerable. However, scaling TP across nodes forces communication over lower-bandwidth fabrics (InfiniBand, typically 25–100 GB/s per link), producing severe performance degradation.

Empirical measurements on multi-node clusters reveal a characteristic pattern: as the number of nodes increases, the effective bandwidth for collective operations (all-reduce, all-gather, reduce-scatter) drops substantially due to:

  • Network topology constraints: inter-node links have lower bandwidth and higher latency than intra-node NVLink/NVSwitch.
  • Congestion and contention: multiple simultaneous all-reduce operations compete for shared network resources.
  • Protocol overhead: TCP/IP or RDMA stack latency accumulates across hops.

1.2 Why Not Simply Use Sequence/Context Parallelism or ZeRO?

Parallelism Strategy What It Partitions Limitation
Sequence/Context Parallelism Activation tensors along the sequence dimension Helps only when sequence length is the memory bottleneck, not model size
ZeRO-3 (Data Parallelism) Model parameters, gradients, optimizer states across DP ranks Requires all-gather of full parameters before every forward/backward computation; high communication volume
Pipeline Parallelism Model layers across pipeline stages Introduces pipeline bubbles; requires careful scheduling

For models with 70B+ parameters, the weight memory alone can exceed the aggregate capacity of 4–8 GPUs on a single node. Pipeline parallelism addresses this by distributing layers (not shards of every layer) across devices.


2. Fundamental Concept of Pipeline Parallelism

2.1 Layer Partitioning

Given a model with LLtransformer layers andpppipeline stages (each mapped to one GPU or a group of GPUs), we assign approximatelyL/p\lceil L/p \rceil consecutive layers to each stage:

Stage k holds layers {:(k1)Lp+1kLp},k=1,2,,p \text{Stage } k \text{ holds layers } \left\{ \ell : \left\lfloor \frac{(k-1) \cdot L}{p} \right\rfloor + 1 \leq \ell \leq \left\lfloor \frac{k \cdot L}{p} \right\rfloor \right\}, \quad k = 1, 2, \ldots, p

Example: For L=32L = 32layers andp=4p = 4 GPUs:

  • GPU 0: Layers 1–8
  • GPU 1: Layers 9–16
  • GPU 2: Layers 17–24
  • GPU 3: Layers 25–32

2.2 Memory Decomposition Under Pipeline Parallelism

For a model with total parameter count Φ\Phi, the memory per GPU for model parameters (in mixed precision with optimizer states) scales as:

Mparams(per GPU)=Φpκ M_{\text{params}}^{(\text{per GPU})} = \frac{\Phi}{p} \cdot \kappa

where κ\kappais the per-parameter memory multiplier (e.g.,κ=18\kappa = 18bytes for AdamW with fp32 master weights + fp32 momentum + fp32 variance + fp16 parameters + fp16 gradients, orκ=2\kappa = 2 bytes for inference-only fp16 weights).

Critical observation: While parameter memory is divided by pp, activation memory is NOT reduced. The reason is subtle and requires careful analysis.

2.3 Why Activation Memory Remains Constant

Each pipeline stage processes 1/p1/p of the model’s layers per micro-batch, so the activation memory for a single micro-batch through one stage is:

Asingle=Atotalp A_{\text{single}} = \frac{A_{\text{total}}}{p}

where AtotalA_{\text{total}} is the total activation memory for all layers on a single micro-batch.

However, in standard pipeline schedules, each GPU must complete forward passes on pp micro-batches before any backward pass begins (as we will see in the AFAB schedule). Therefore, the total activation memory stored on each GPU is:

Aper GPU=p×Atotalp=Atotal A_{\text{per GPU}} = p \times \frac{A_{\text{total}}}{p} = A_{\text{total}}

This cancellation is a fundamental property of naive pipeline parallelism: the memory savings from partitioning layers are exactly offset by the need to buffer multiple micro-batches’ activations.

2.4 Communication Pattern

Unlike tensor parallelism (which communicates within layers via all-reduce/all-gather) or ZeRO-3 (which communicates parameters), pipeline parallelism communicates activation tensors sequentially between adjacent stages:

Stage ksend h(k)Stage k+1 \text{Stage } k \xrightarrow{\text{send } \mathbf{h}^{(k)}} \text{Stage } k+1

where h(k)Rb×s×d\mathbf{h}^{(k)} \in \mathbb{R}^{b \times s \times d}is the hidden state tensor (batch sizebb, sequence length ss, hidden dimension dd). During the backward pass, gradients flow in the reverse direction:

Stage k+1send Lh(k)Stage k \text{Stage } k+1 \xrightarrow{\text{send } \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(k)}}} \text{Stage } k

Key advantage: These are point-to-point (P2P) communications occurring only at p1p - 1 stage boundaries, not collective operations at every layer. The communication volume per boundary is:

VP2P=b×s×d×bytes_per_element V_{\text{P2P}} = b \times s \times d \times \text{bytes\_per\_element}

This is dramatically less frequent than TP’s per-layer collectives, making PP particularly well-suited for inter-node scaling.


3. The Pipeline Bubble Problem

3.1 Naive Sequential Execution

In the simplest implementation, a single batch passes through all stages sequentially: Stage 1 computes the forward pass, sends activations to Stage 2, which computes, sends to Stage 3, and so on. Then the backward pass reverses direction.

During forward computation on Stage kk, all stages {1,,k1,k+1,,p}\{1, \ldots, k-1, k+1, \ldots, p\} are idle. This idle time constitutes the pipeline bubble.

3.2 Bubble Quantification (Single Micro-Batch)

Let:

  • tft_f = time for one forward pass through one pipeline stage for one micro-batch
  • tbt_b = time for one backward pass through one pipeline stage for one micro-batch
  • A common empirical approximation: tb2tft_b \approx 2 \cdot t_f (backward requires computing gradients w.r.t. both inputs and weights)

Ideal time (perfect parallelization, no bubble):

tideal=tf+tb t_{\text{ideal}} = t_f + t_b

Actual pipeline bubble time (naive schedule):

tbubble=(p1)(tf+tb) t_{\text{bubble}} = (p - 1) \cdot (t_f + t_b)

Bubble ratio (bubble time relative to ideal time):

rbubble=tbubbletideal=(p1)(tf+tb)tf+tb=p1 r_{\text{bubble}} = \frac{t_{\text{bubble}}}{t_{\text{ideal}}} = \frac{(p - 1)(t_f + t_b)}{t_f + t_b} = p - 1

This is devastating: with p=8p = 8pipeline stages, we waste7×7\times the useful computation time. The pipeline efficiency is:

ηpipeline=tidealtideal+tbubble=11+(p1)=1p \eta_{\text{pipeline}} = \frac{t_{\text{ideal}}}{t_{\text{ideal}} + t_{\text{bubble}}} = \frac{1}{1 + (p-1)} = \frac{1}{p}

With p=8p = 8, efficiency is only 12.5%12.5\%.


4. All Forward All Backward (AFAB) Schedule

4.1 Micro-Batching

The first mitigation strategy borrows from data parallelism: split the global batch into mm micro-batches. When Stage 2 processes micro-batch 1, Stage 1 can begin processing micro-batch 2. This creates a “pipeline fill” phase, a “steady state,” and a “pipeline drain” phase.

4.2 AFAB Schedule Description

In AFAB:

  1. Forward phase: All mmmicro-batches execute their forward passes through all stages. As soon as Stagekkfinishes the forward pass for micro-batchii, it begins micro-batch i+1i+1.
  2. Backward phase: After all forward passes complete, all mm micro-batches execute their backward passes in reverse order.

4.3 Bubble Analysis with Micro-Batching

Ideal time for mm micro-batches:

tideal=m(tf+tb) t_{\text{ideal}} = m \cdot (t_f + t_b)

Bubble time remains (p1)(tf+tb)(p-1)(t_f + t_b) since the pipeline fill and drain phases are unchanged:

tbubble=(p1)(tf+tb) t_{\text{bubble}} = (p - 1)(t_f + t_b)

Bubble ratio:

rbubbleAFAB=(p1)(tf+tb)m(tf+tb)=p1m \boxed{r_{\text{bubble}}^{\text{AFAB}} = \frac{(p-1)(t_f + t_b)}{m(t_f + t_b)} = \frac{p - 1}{m}}

By increasing mpm \gg p, we can make the bubble arbitrarily small. For example, with p=8p = 8andm=32m = 32:

rbubble=732=21.875% r_{\text{bubble}} = \frac{7}{32} = 21.875\%

Pipeline efficiency:

ηAFAB=mm+p1 \eta_{\text{AFAB}} = \frac{m}{m + p - 1}

4.4 Memory Problem in AFAB

AFAB requires storing activations for all mm micro-batches simultaneously during the forward phase, since no backward pass begins until all forwards complete:

MactivationsAFAB=m×Atotalp M_{\text{activations}}^{\text{AFAB}} = m \times \frac{A_{\text{total}}}{p}

As mm increases to reduce the bubble, activation memory grows linearly—a direct and painful trade-off.


5. One Forward One Backward (1F1B) Schedule

5.1 Core Idea

The 1F1B schedule addresses AFAB’s memory explosion by starting backward passes as soon as possible. Instead of completing all mm forward passes before any backward, 1F1B enters a steady state where each GPU alternates between one forward pass and one backward pass.

5.2 Schedule Phases

For a pipeline with ppstages andmmmicro-batches (mpm \geq p):

  1. Warmup phase: Stage kk(0-indexed) performspk1p - k - 1 forward passes to fill the pipeline.
  2. Steady state: Each stage alternates: one forward pass, one backward pass (1F1B).
  3. Cooldown phase: Remaining backward passes drain the pipeline.

5.3 Bubble Analysis

The bubble size in 1F1B is identical to AFAB:

rbubble1F1B=p1m \boxed{r_{\text{bubble}}^{\text{1F1B}} = \frac{p - 1}{m}}

The bubble is not reduced. The improvement is purely in memory.

5.4 Activation Memory Improvement

In 1F1B, the maximum number of in-flight micro-batches (those whose activations must be stored) at any given stage is at most pp, not mm:

Mactivations1F1B=p×Atotalp=Atotal \boxed{M_{\text{activations}}^{\text{1F1B}} = p \times \frac{A_{\text{total}}}{p} = A_{\text{total}}}

Compare to AFAB:

MactivationsAFAB=m×Atotalp M_{\text{activations}}^{\text{AFAB}} = m \times \frac{A_{\text{total}}}{p}

Since typically mpm \gg p, 1F1B achieves a memory reduction factor of m/pm/p. This is significant: with m=32m = 32andp=4p = 4, AFAB stores 8×8\times more activations than 1F1B per stage.

Crucially, because 1F1B decouples memory from mm, we can freely increase mm to reduce the bubble without proportionally increasing memory.

5.5 Implementation Complexity

1F1B breaks the clean separation between forward and backward phases. Each device independently schedules forward and backward operations according to its position in the pipeline. This requires:

  • Asynchronous scheduling logic: Each stage must track which micro-batches have completed forward passes and which are ready for backward passes.
  • Modified training loops: The standard forward() → loss() → backward() → step() paradigm must be replaced with a state-machine-like scheduler.
  • Model code modifications: The model must be sliceable into stages, each independently callable.

5.6 Empirical Scaling Behavior

Benchmark results reveal two regimes:

Configuration Behavior
mp1m \leq p - 1 Performance degrades with increasingpp; bubble dominates
mp1m \gg p - 1(e.g.,m=32m = 32) Reasonable scaling at low pp; still limited at very large pp

Inter-node scaling advantage: When crossing node boundaries (e.g., p=8p = 8on one node top=16p = 16 across two nodes), PP shows only ~14% throughput degradation versus ~43% for TP. This is because PP communicates point-to-point activation tensors at stage boundaries rather than all-reduce across all GPUs within every layer.


6. Interleaved Stages (Interleaved 1F1B)

6.1 Concept: Non-Contiguous Layer Assignment

Instead of assigning contiguous blocks of layers to each GPU, we assign vvchunks (also called virtual stages) per GPU. Each GPU holdsvv non-contiguous groups of layers, and a micro-batch “loops” through GPUs multiple times during a single forward pass.

Example with L=16L = 16, p=4p = 4, v=2v = 2:

GPU Chunk 1 (layers) Chunk 2 (layers)
GPU 0 1–2 9–10
GPU 1 3–4 11–12
GPU 2 5–6 13–14
GPU 3 7–8 15–16

A micro-batch traverses: GPU 0 → GPU 1 → GPU 2 → GPU 3 → GPU 0 → GPU 1 → GPU 2 → GPU 3 during the forward pass.

6.2 Bubble Reduction

Each forward and backward pass through a single chunk takes tf/vt_f/vandtb/vt_b/vrespectively (since each chunk has1/v1/v-th of the layers). The bubble time becomes:

tbubbleinterleaved=(p1)(tf+tb)v t_{\text{bubble}}^{\text{interleaved}} = \frac{(p-1)(t_f + t_b)}{v}

Bubble ratio:

rbubbleinterleaved=p1vm \boxed{r_{\text{bubble}}^{\text{interleaved}} = \frac{p - 1}{v \cdot m}}

6.3 Communication Trade-Off

Interleaving increases the number of P2P communications by a factor of vv. Previously, each micro-batch required p1p - 1send/receive pairs per direction. With interleaving, each micro-batch requiresvp1v \cdot p - 1communication steps (since it traverses the full pipelinevv times).

Total communication volume per micro-batch:

Vcomminterleaved=v(p1)VP2P=v(p1)bsdbytes_per_element V_{\text{comm}}^{\text{interleaved}} = v \cdot (p - 1) \cdot V_{\text{P2P}} = v \cdot (p - 1) \cdot b \cdot s \cdot d \cdot \text{bytes\_per\_element}

This is a factor of vv increase over non-interleaved PP.

6.4 Depth-First vs. Breadth-First Scheduling

With interleaved stages, a scheduling ambiguity arises. At any given moment on a GPU, we must choose between:

  • Depth-first: Prioritize advancing earlier micro-batches through later chunks. This minimizes the end-to-end latency for individual micro-batches, releasing their activations sooner (lower memory, lower time-to-first-backward).

  • Breadth-first: Prioritize advancing later micro-batches through earlier chunks. This fills the pipeline more uniformly, potentially improving steady-state utilization.

The optimal choice depends on the specific pp, vv, mm configuration and the relative costs of computation versus communication.

6.5 Configuration Space

The design space for pipeline parallelism can be parameterized as (p,m,v)(p, m, v):

mm vv Schedule Type
1 1 Naive PP (single batch, single chunk)
m>1m > 1 1 AFAB or 1F1B
m>1m > 1 v>1v > 1 Interleaved 1F1B

Llama 3.1’s training infrastructure uses an interleaved 1F1B schedule with a tunable depth-first/breadth-first priority.


7. Zero Bubble and DualPipe Schedules

7.1 Decomposing the Backward Pass

The breakthrough enabling near-zero bubble schedules is the observation that the backward pass through a linear (matrix multiplication) layer decomposes into two independent operations:

Given a linear layer computing Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W}with incoming gradientLY\frac{\partial \mathcal{L}}{\partial \mathbf{Y}}:

  1. Input gradient (BB):
LX=LYW \frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}} \cdot \mathbf{W}^{\top}

This is required by the preceding pipeline stage to continue its backward pass.

  1. Weight gradient (WW):
LW=XLY \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{X}^{\top} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{Y}}

This is required only for the optimizer step at the end of the iteration. It has no downstream dependency in the pipeline.

7.2 Dependency Analysis

The critical insight is the asymmetry in dependencies:

Operation Depends On Required By
Forward (FF) of stage kk FFof stagek1k-1 FFof stagek+1k+1, BBof stagekk
Input backward (BB) of stage kk BBof stagek+1k+1 BBof stagek1k-1
Weight backward (WW) of stage kk BBof stagekk (same stage) Only the optimizer step

Since WWhas no inter-stage dependency, it can be scheduled anywhere after the correspondingBBof the same stage and before the optimizer step. This flexibility allowsWW operations to be placed into what would otherwise be bubble slots.

7.3 Zero Bubble Schedule (ZB-H2)

By decomposing backward passes into BBandWW and solving a scheduling optimization problem, it is possible to construct schedules where the bubble is theoretically zero:

rbubbleZB-H20 \boxed{r_{\text{bubble}}^{\text{ZB-H2}} \approx 0}

The total time per iteration approaches the ideal:

ttotalm(tf+tB+tW) t_{\text{total}} \approx m \cdot (t_f + t_B + t_W)

where tBt_BandtWt_Ware the times for the input-backward and weight-backward operations respectively, withtB+tW=tbt_B + t_W = t_b.

7.4 Scheduling as Integer Linear Programming

Finding the optimal placement of FF, BB, and WWoperations acrossppstages andmm micro-batches is formulated as an Integer Linear Programming (ILP) problem:

Decision variables: Start time si,k(op)s_{i,k}^{(\text{op})}for each operation typeop{F,B,W}\text{op} \in \{F, B, W\}, micro-batch i{1,,m}i \in \{1, \ldots, m\}, stage k{1,,p}k \in \{1, \ldots, p\}.

Objective:

mink=1p(tend(k)tstart(k)i=1m(tf+tB+tW)) \min \sum_{k=1}^{p} \left( t_{\text{end}}^{(k)} - t_{\text{start}}^{(k)} - \sum_{i=1}^{m} (t_f + t_B + t_W) \right)

This minimizes total idle (bubble) time across all stages.

Constraints:

  1. Dependency constraints: BBof stagekkcannot start beforeBBof stagek+1k+1 finishes (for the same micro-batch).
  2. Non-overlap: No two operations on the same GPU overlap in time.
  3. Ordering: Wi,kW_{i,k}must followBi,kB_{i,k}.
  4. All WW complete before optimizer step.

7.5 DualPipe (DeepSeek-V3/R1)

DualPipe extends the zero-bubble concept with bidirectional pipeline streams:

  • Stream A: Micro-batches propagate forward from Stage 1 → Stage pp.
  • Stream B: Micro-batches propagate forward from Stage pp → Stage 1.

Both streams are interleaved on each GPU, with the FF, BB, WW decomposition applied to both. The key benefits:

  1. Doubled pipeline utilization: Two independent forward-backward chains fill each other’s bubbles.
  2. Near-zero all-to-all communication overhead: As reported in the DeepSeek-V3 technical report, the overlap between computation and communication is nearly perfect.

The resulting schedule is substantially more complex—each GPU may be simultaneously handling:

  • Forward of micro-batch iifrom Stream A (chunkjj)
  • Input-backward of micro-batch ii'from Stream B (chunkjj')
  • Weight-backward of micro-batch ii''from Stream A (chunkjj'')

8. Summary of Pipeline Parallelism Schedules

Schedule Bubble Ratio rbubbler_{\text{bubble}} Peak Activation Memory per Stage Communication Overhead Implementation Complexity
Naive (1 micro-batch) p1p - 1 AtotalA_{\text{total}} Minimal Trivial
AFAB (mmmicro-batches) p1m\dfrac{p-1}{m} mAtotalp\dfrac{m \cdot A_{\text{total}}}{p} Minimal Low
1F1B (mmmicro-batches) p1m\dfrac{p-1}{m} AtotalA_{\text{total}} Minimal Moderate
Interleaved 1F1B (vvchunks) p1vm\dfrac{p-1}{v \cdot m} Atotal\leq A_{\text{total}} v×v \times baseline High
Zero Bubble (ZB-H2) 0\approx 0 Atotal\leq A_{\text{total}} Baseline + scheduling Very High
DualPipe 0\approx 0 Atotal\leq A_{\text{total}} Bidirectional streams Extremely High

Pipeline Efficiency Summary

ηpipeline=11+rbubble=vmvm+p1 \boxed{\eta_{\text{pipeline}} = \frac{1}{1 + r_{\text{bubble}}} = \frac{v \cdot m}{v \cdot m + p - 1}}

For the zero-bubble case:

ηpipelineZB1 \eta_{\text{pipeline}}^{\text{ZB}} \approx 1

9. Pipeline Parallelism vs. Other Parallelism Strategies

9.1 PP vs. TP for Inter-Node Scaling

Property Tensor Parallelism Pipeline Parallelism
Communication type All-reduce / all-gather (collective) Point-to-point (P2P)
Communication frequency Multiple times per layer Once per stage boundary
Communication volume per operation O(bsd)O(b \cdot s \cdot d)per layer O(bsd)O(b \cdot s \cdot d) per stage boundary
Sensitivity to bandwidth Very high (many operations) Low (few operations)
Empirical inter-node degradation ~43% throughput loss ~14% throughput loss

9.2 PP vs. ZeRO-3

Property ZeRO-3 Pipeline Parallelism
What is partitioned Parameters, gradients, optimizer states Model layers
Communication pattern All-gather parameters before each forward/backward P2P activations at stage boundaries
Activation memory Unchanged Unchanged (for 1F1B); reduced (for interleaved)
Compute efficiency No bubble (but high communication) Bubble exists (but low communication)

10. Practical Design Considerations

10.1 Choosing pp, mm, and vv

The practitioner must balance:

  1. Bubble minimization: Requires mp1m \gg p - 1and largevv.
  2. Memory constraints: 1F1B caps activation memory at Atotal\sim A_{\text{total}}; AFAB grows with mm.
  3. Communication overhead: Interleaving (v>1v > 1) multiplies P2P volume by vv.
  4. Global batch size constraint: mmis bounded above byBglobal/(b×DP_degree)B_{\text{global}} / (b \times \text{DP\_degree}), where bb is the micro-batch size.

10.2 Optimal Stage Partitioning

Layers may have heterogeneous computation costs (e.g., the embedding layer, the final language modeling head). Optimal partitioning minimizes the maximum per-stage computation time:

minPmaxk=1,,pPkc \min_{P} \max_{k=1,\ldots,p} \sum_{\ell \in P_k} c_\ell

where P={P1,,Pp}P = \{P_1, \ldots, P_p\}is a partition of layers andcc_\ellis the computation cost of layer\ell. Imbalanced stages exacerbate the bubble, as the slowest stage becomes the bottleneck.


This completes the technical exposition of pipeline parallelism, from its fundamental motivation through naive schedules, micro-batch–based AFAB and 1F1B schedules, interleaved stages, and the frontier zero-bubble / DualPipe methods that approach theoretically optimal GPU utilization through fine-grained backward-pass decomposition and ILP-based schedule optimization.

PreviousHigh-Level Overview of Distributed Training: Foundations, Memory Analysis, and First-Step Techniques NextPipeline Parallelism: Comprehensive Technical Exposition

Generated from llm_training_at_scale at .