Docs AI Engineering Knowledge Hub

Expert Parallelism and 5D Parallelism: A Comprehensive Technical Treatment


1. Expert Parallelism (EP)

1.1 Prerequisite: Mixture of Experts (MoE) Architecture

In a standard Transformer layer, every token passes through a single feedforward network (FFN). The Mixture of Experts paradigm replaces this monolithic FFN with NEN_Eparallel expert sub-networks{E1,E2,,ENE}\{E_1, E_2, \ldots, E_{N_E}\}, each structurally identical but with independent learned parameters. A gating (router) network GG determines which experts process each token.

For a given input hidden state hRdmodel\mathbf{h} \in \mathbb{R}^{d_{\text{model}}}, the router computes gating scores:

g(h)=Softmax ⁣(Wgh)RNEg(\mathbf{h}) = \text{Softmax}\!\bigl(W_g \cdot \mathbf{h}\bigr) \in \mathbb{R}^{N_E}

where WgRNE×dmodelW_g \in \mathbb{R}^{N_E \times d_{\text{model}}} is the learnable gating weight matrix.

Under a Top-kk routing strategy (introduced in Shazeer et al., 2017, and refined in Switch Transformers), only kkexperts with the highest gating scores are selected. LetTk(h)\mathcal{T}_k(\mathbf{h})denote the index set of the top-kk experts. The MoE layer output is:

MoE(h)=iTk(h)gi(h)Ei(h)\text{MoE}(\mathbf{h}) = \sum_{i \in \mathcal{T}_k(\mathbf{h})} g_i(\mathbf{h}) \cdot E_i(\mathbf{h})

where gi(h)g_i(\mathbf{h})is the (renormalized) gating weight for expertii, and Ei(h)E_i(\mathbf{h})is the output of expertiiapplied toh\mathbf{h}.

Key property: Each token activates only kNEk \ll N_Eexperts, meaning total FLOPs per token remain comparable to a dense model, while the total parameter count scales asO(NEdffndmodel)\mathcal{O}(N_E \cdot d_{\text{ffn}} \cdot d_{\text{model}}).


1.2 Definition of Expert Parallelism

Expert Parallelism (EP) is a distributed training and inference strategy that partitions the NEN_Eexpert sub-networks acrossWEPW_{\text{EP}} workers (GPUs), such that each worker holds a disjoint subset of experts.

Formally, if we have NEN_Eexperts andWEPW_{\text{EP}}workers in the expert-parallel group, workerww hosts experts:

Ew={Ei  |  i[wNEWEP,  (w+1)NEWEP1]}\mathcal{E}_w = \left\{ E_i \;\middle|\; i \in \left[\frac{w \cdot N_E}{W_{\text{EP}}},\; \frac{(w+1) \cdot N_E}{W_{\text{EP}}} - 1\right] \right\}

Each worker stores NEWEP\displaystyle\frac{N_E}{W_{\text{EP}}} experts’ parameters in its local memory.


1.3 Communication Pattern: All-to-All Dispatch and Combine

Since tokens on any given worker may be routed to experts residing on any other worker, EP requires All-to-All collective communication operations:

Step 1 — Dispatch (All-to-All scatter): Each worker determines, for every local token, which remote worker hosts the assigned expert(s). The token hidden states h\mathbf{h} are sent to the appropriate workers.

If worker wwhasBwB_wtokens and tokenjjis routed to expertEiE_ion workerww', then hjRdmodel\mathbf{h}_j \in \mathbb{R}^{d_{\text{model}}}is transmitted from workerwwto workerww'.

Step 2 — Expert Computation: Each worker processes all tokens routed to its local experts. Worker ww computes:

oj=Ei(hj),  hj routed to EiEw\mathbf{o}_j = E_i(\mathbf{h}_j), \quad \forall\; \mathbf{h}_j \text{ routed to } E_i \in \mathcal{E}_w

Step 3 — Combine (All-to-All gather): The expert outputs oj\mathbf{o}_j are sent back to the originating workers. The originating worker aggregates:

MoE(hj)=iTk(hj)gi(hj)oj(i)\text{MoE}(\mathbf{h}_j) = \sum_{i \in \mathcal{T}_k(\mathbf{h}_j)} g_i(\mathbf{h}_j) \cdot \mathbf{o}_j^{(i)}

The total communication volume per MoE layer for EP is:

VEP=2×Blocal×k×dmodel×(11WEP)V_{\text{EP}} = 2 \times B_{\text{local}} \times k \times d_{\text{model}} \times \left(1 - \frac{1}{W_{\text{EP}}}\right)

The factor of 2 accounts for both dispatch and combine phases. The term (11/WEP)(1 - 1/W_{\text{EP}}) reflects that tokens routed to local experts require no communication.


1.4 Contrast with Tensor Parallelism

A critical distinction: EP does not shard individual matrix multiplications. In Tensor Parallelism (TP), a single linear layer’s weight matrix WRdout×dinW \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} is partitioned across workers, requiring AllReduce or ReduceScatter/AllGather operations to reconstruct the full output. In EP, each expert’s weight matrices remain intact on a single worker — the parallelism arises from placing different complete experts on different workers.

Aspect Tensor Parallelism Expert Parallelism
What is sharded A single weight matrix WW Distinct expert networksEiE_i
Communication primitive AllReduce / ReduceScatter + AllGather All-to-All
Communication content Partial matrix products (activations) Full token hidden states
Requires model-specific logic Yes (column/row split patterns) Minimal (only routing logic)
Weight integrity per worker Partial weights Complete expert weights

1.5 Why EP Alone Is Insufficient: Combination with Data Parallelism

EP only partitions the MoE (FFN) layers. All non-MoE components — embedding layers, attention layers, LayerNorm, output heads — remain fully replicated across EP workers. Without additional parallelism, every EP worker processes the same input batch through these shared components, resulting in redundant computation.

This is resolved by combining EP with Data Parallelism (DP). With WDPW_{\text{DP}}data-parallel replicas andWEPW_{\text{EP}} expert-parallel workers, the total worker count for these two dimensions is:

Wtotal=WDP×WEPW_{\text{total}} = W_{\text{DP}} \times W_{\text{EP}}

Under this hybrid scheme:

  • Non-MoE layers: Each worker processes a distinct micro-batch shard (standard DP behavior). Gradients are synchronized via AllReduce across WDPW_{\text{DP}} replicas.
  • MoE layers: Tokens are routed across WEPW_{\text{EP}} workers hosting different experts via All-to-All communication.

This eliminates redundancy: each GPU processes a unique data shard through the shared layers, and expert computation is distributed without replication.


1.6 Practical Engineering: Communication-Aware Routing Constraints

Naive expert routing can create prohibitive All-to-All traffic across nodes. DeepSeek-V3 (with NE=256N_E = 256experts, Top-k=8k=8routing) introduces a node-bounded routing constraint: each token is restricted to be sent to at mostMMnodes (in their case,M=4M = 4).

Formally, let N(i)\mathcal{N}(i)denote the node hosting expertEiE_i. The constrained routing enforces:

{N(i)  |  iTk(h)}M\left|\left\{\mathcal{N}(i) \;\middle|\; i \in \mathcal{T}_k(\mathbf{h})\right\}\right| \leq M

This bound reduces cross-node communication volume by a factor of approximately Mtotal nodes\frac{M}{\text{total nodes}}, keeping most All-to-All traffic within the high-bandwidth intra-node interconnect (e.g., NVLink at 900 GB/s) rather than the lower-bandwidth inter-node fabric (e.g., InfiniBand at 400 Gb/s).


1.7 Memory Impact of Expert Parallelism

Per-worker parameter memory for the MoE layers reduces linearly:

MemEPexperts=NEParams(Ei)WEP\text{Mem}_{\text{EP}}^{\text{experts}} = \frac{N_E \cdot \text{Params}(E_i)}{W_{\text{EP}}}

where Params(Ei)=2×dmodel×dffn\text{Params}(E_i) = 2 \times d_{\text{model}} \times d_{\text{ffn}}for a standard two-layer FFN expert (ignoring biases). For DeepSeek-V3 with 256 experts andWEP=64W_{\text{EP}} = 64, each worker holds only 4 experts.

Activation memory per worker depends on the number of tokens routed to local experts, which is governed by the load balancing properties of the router.


2. 5D Parallelism: Unified Framework

2.1 The Five Parallelism Dimensions

Modern large-scale training decomposes the computation along five orthogonal dimensions, each addressing a distinct axis of the training tensor:

Strategy Abbreviation Parallel/Sharding Dimension What Is Partitioned
Data Parallelism DP Batch dimension BB Input samples
Tensor Parallelism TP Hidden dimension dmodeld_{\text{model}} Weight matrices and activations
Sequence/Context Parallelism SP/CP Sequence dimension ss Token sequences
Pipeline Parallelism PP Model depth (layers) LL Transformer layers
Expert Parallelism EP Expert dimension NEN_E Expert sub-networks

The total number of GPUs WW satisfies:

W=WDP×WTP×WPP×WCP×WEPW = W_{\text{DP}} \times W_{\text{TP}} \times W_{\text{PP}} \times W_{\text{CP}} \times W_{\text{EP}}

2.2 ZeRO Strategies (Orthogonal Memory Optimizations on DP)

ZeRO (Zero Redundancy Optimizer) is not a separate parallelism dimension but a set of memory optimization stages applied within the DP group of size WDPW_{\text{DP}}:

Stage What Is Sharded Across DP Replicas Memory Reduction Factor (Approx.)
ZeRO-1 Optimizer states OO Optimizer memory÷WDP\div W_{\text{DP}}
ZeRO-2 Optimizer states OO+ GradientsGG (O+G)÷WDP(O + G) \div W_{\text{DP}}
ZeRO-3 Optimizer states OO+ GradientsGG+ ParametersΘ\Theta (O+G+Θ)÷WDP(O + G + \Theta) \div W_{\text{DP}}

For a model with Φ\Phi parameters in mixed precision (fp16 params + fp32 optimizer), the per-GPU memory without ZeRO is:

Mbase=2Φ+2Φ+(4Φ+4Φ+4Φ)=16Φ bytesM_{\text{base}} = 2\Phi + 2\Phi + (4\Phi + 4\Phi + 4\Phi) = 16\Phi \text{ bytes}

where 2Φ2\Phifor fp16 parameters,2Φ2\Phifor fp16 gradients, and12Φ12\Phi for Adam optimizer states (fp32 copy, first moment, second moment).

With ZeRO-3 across WDPW_{\text{DP}} workers:

MZeRO-3=16ΦWDP+MactivationsM_{\text{ZeRO-3}} = \frac{16\Phi}{W_{\text{DP}}} + M_{\text{activations}}

3. Comparative Analysis: Pipeline Parallelism vs. ZeRO-3

Both PP and ZeRO-3 distribute model parameters across GPUs along the model depth axis, but they differ fundamentally in mechanism:

3.1 Side-by-Side Comparison

Property ZeRO-3 Pipeline Parallelism
Per-worker storage A fraction of each layer’s parameters Complete layers (one or more full layers)
Communication transfers Weights (AllGather before forward, ReduceScatter after backward) Activations (point-to-point between pipeline stages)
Orchestration complexity Model-agnostic (automatic parameter gathering) Model-agnostic but requires schedule design (1F1B, interleaved, etc.)
Implementation challenge Managing parameter partitioning, prefetching, and communication overlap Managing micro-batch scheduling to minimize pipeline bubble
Scaling preference Large mbs\text{mbs}andseq_len\text{seq\_len}to amortize weight communication Largegrad_acc\text{grad\_acc} (many micro-batches) to minimize bubble ratio

3.2 Why Combining PP + ZeRO-3 Is Rare

When combining PP and ZeRO-3, both weight communication (ZeRO-3) and activation communication (PP) occur simultaneously. The total communication cost becomes:

Ccombined=CZeRO-3weights+CPPactivationsC_{\text{combined}} = C_{\text{ZeRO-3}}^{\text{weights}} + C_{\text{PP}}^{\text{activations}}

To amortize both costs, the global batch size BglobalB_{\text{global}} must be increased significantly:

Bglobal=mbs×WDP×grad_accB_{\text{global}} = \text{mbs} \times W_{\text{DP}} \times \text{grad\_acc}

This creates a multi-dimensional trade-off between global batch size, model size, network bandwidth, and training convergence (since excessively large batch sizes can degrade final model quality).

Practical guidance: If combining them, ZeRO-3 should be configured to retain parameters in memory during the sequence of PP micro-batches, avoiding repeated AllGather operations for the same parameters across micro-batches.

3.3 PP + ZeRO-1/ZeRO-2: Natural Combination

ZeRO-1 and ZeRO-2 shard only optimizer states (and gradients), which are only needed during the optimizer step — not during forward/backward computation. This means they introduce no additional communication during the PP micro-batch processing loop, making them naturally complementary.

Real-world example: DeepSeek-V3 training uses PP + ZeRO-1.


4. Tensor Parallelism + Sequence Parallelism: Interaction with Other Strategies

4.1 Natural Complementarity with PP and ZeRO-3

TP exploits the distributive property of matrix multiplication. For a linear layer Y=XWY = XW, with column-parallel sharding of WWinto[W1,W2][W_1, W_2] across 2 workers:

Y=X[W1,W2]=[XW1,XW2]Y = X[W_1, W_2] = [XW_1, XW_2]

Each partial computation XWiXW_i is independent, and results are combined via AllGather or ReduceScatter. This operates on a sub-layer granularity, making it orthogonal to PP (which operates at layer granularity) and ZeRO-3 (which also operates at layer granularity for parameter gathering).

4.2 Two Fundamental Limitations of TP

Limitation 1 — Communication on Critical Path:

TP communication (AllReduce or equivalent) lies on the critical path of computation. For each Transformer layer, the communication cost scales as:

TcommTP=4×2(WTP1)WTP×b×s×dmodelBWintraT_{\text{comm}}^{\text{TP}} = 4 \times \frac{2 \cdot (W_{\text{TP}} - 1)}{W_{\text{TP}}} \times \frac{b \times s \times d_{\text{model}}}{\text{BW}_{\text{intra}}}

where the factor 4 accounts for two linear layers × (forward + backward), bbis micro-batch size,ssis sequence length, andBWintra\text{BW}_{\text{intra}}is the intra-node bandwidth. AsWTPW_{\text{TP}}grows, the compute per worker shrinks asO(1/WTP)\mathcal{O}(1/W_{\text{TP}})while communication remainsO(1)\mathcal{O}(1), leading to diminishing returns.

Limitation 2 — Model-Specific Implementation:

TP requires explicit knowledge of where to shard along the hidden dimension (TP regions) vs. where to shard along the sequence dimension (SP regions). Attention projections, FFN layers, LayerNorm, and dropout each require different sharding strategies, making TP non-trivially model-specific.

4.3 Consequence: TP Confined to Intra-Node

Given these limitations, TP is kept within high-bandwidth intra-node interconnects (e.g., 8 GPUs connected via NVLink at 900 GB/s per GPU), while PP or ZeRO-3 handles inter-node distribution over lower-bandwidth fabrics (e.g., InfiniBand at 400 Gb/s per link).


5. Context Parallelism (CP): Complementary to TP

5.1 What CP Targets

CP shards activations along the sequence length dimension ssacrossWCPW_{\text{CP}}workers. Each worker processes a contiguous subsequence of lengths/WCPs / W_{\text{CP}}.

  • MLP, LayerNorm: These are point-wise or token-independent operations and process sharded sequences without any communication.
  • Attention layers: Each token’s query must attend to keys/values from the full sequence, requiring communication.

5.2 Ring Attention for CP

Ring Attention organizes WCPW_{\text{CP}}workers in a logical ring. Each worker holds a local KV shard and iteratively passes KV blocks to the next worker while computing partial attention on the current KV block. AfterWCP1W_{\text{CP}} - 1 communication steps, every worker has attended to all KV blocks.

The communication is overlapped with computation, so the effective overhead is:

TCPmax(Tcomputeattn,  TcommKV)(per ring step)T_{\text{CP}} \approx \max\left(T_{\text{compute}}^{\text{attn}},\; T_{\text{comm}}^{\text{KV}}\right) \quad \text{(per ring step)}

where:

TcommKV=2×b×(s/WCP)×dmodelBWT_{\text{comm}}^{\text{KV}} = \frac{2 \times b \times (s / W_{\text{CP}}) \times d_{\text{model}}}{\text{BW}}

CP is specifically valuable for extreme sequence lengths (s128ks \geq 128\text{k}tokens), where even with full activation recomputation, attention activation memoryO(b×nh×s2)\mathcal{O}(b \times n_h \times s^2)(wherenhn_h is the number of attention heads) would exceed single-GPU memory.


6. Expert Parallelism (EP): Complementary to TP

EP targets the MoE FFN layers exclusively. Attention layers, LayerNorm, embeddings, and output heads are completely unaffected by EP. This makes EP orthogonal to:

  • TP/SP (which shards attention and FFN weight matrices along hidden/sequence dims)
  • CP (which shards attention KV along sequence dim)
  • PP (which shards entire layers along depth)

6.1 EP vs. DP: Structural Similarity

There is a notable structural similarity between EP and DP regarding input handling:

  • In DP, each worker processes different data through identical model copies.
  • In EP (without additional DP), each worker processes the same data through different experts.

This duality is why some frameworks treat EP as a specialized variant of DP, where the “replication” is replaced by “expert routing.” The critical difference is that EP workers hold non-identical model components (different experts), while DP workers hold identical model copies.


7. Scope of Each Parallelism Strategy Within a Transformer Layer

Strategy Attention Layers FFN / MoE Layers LayerNorm Embeddings
TP + SP ✅ Shards WQ,WK,WV,WOW_Q, W_K, W_V, W_O along heads/hidden dim ✅ Shards FFN weights along hidden dim ✅ (SP: sharded along seq dim) ✅ Shards embedding matrix
CP Primary impact — requires KV communication ⚪ Independent processing ⚪ Independent ⚪ Independent
EP ⚪ Unchanged Primary impact — experts distributed ⚪ Unchanged ⚪ Unchanged
PP Entire layers assigned to stages Entire layers assigned to stages Part of the assigned stage Often first/last stage (special handling)
ZeRO Parameters sharded across DP replicas Parameters sharded across DP replicas Parameters sharded Parameters sharded

Legend: ✅ = directly affected, ⚪ = unaffected/independent operation.


8. Comprehensive Comparison Table

Method Memory Savings Target Parallel Dimension Primary Disadvantage
DP Activations (reduced local batch) Batch BB Limited by maximum effective batch size
PP Model parameters Model layers LL Pipeline bubble and complex scheduling
TP + SP Parameters and activations Hidden dd/ sequencess Requires high-bandwidth intra-node interconnect
CP Activations Sequence length ss Communication overhead in attention
EP Expert parameters Expert dimension NEN_E Requires MoE architecture; routing communication
ZeRO-1 Optimizer states Sharded across DP replicas Parameter communication overhead
ZeRO-2 Optimizer states + gradients Sharded across DP replicas Parameter communication overhead
ZeRO-3 Optimizer states + gradients + parameters Sharded across DP replicas Parameter communication overhead

9. Interaction and Combination Rules: Practical Summary

9.1 Naturally Complementary Combinations

Combination Why It Works
TP + PP TP shards within layers (intra-node); PP shards across layers (inter-node). Orthogonal axes.
TP + DP TP within node, DP across nodes. Standard combination.
PP + ZeRO-1/2 ZeRO-1/2 shard optimizer/gradients (used only at optimizer step), not interfering with PP micro-batch processing.
TP + CP TP shards hidden dim; CP shards sequence dim. Orthogonal.
EP + DP EP distributes experts; DP distributes input batches. Eliminates redundant computation on shared layers.
EP + CP EP targets MoE layers; CP targets attention. No interference.

9.2 Combinations Requiring Caution

Combination Issue
PP + ZeRO-3 Both introduce communication on the depth axis. Requires very large batch sizes to amortize dual communication costs. Rarely used in practice.
TP at large scale (WTP>8W_{\text{TP}} > 8) Communication dominates compute. Typically restricted to 8\leq 8 GPUs within a single node.

9.3 Typical Hierarchy in Practice

For training a model on a large GPU cluster with HHnodes ofGG GPUs each:

WTP=Gintra-node×WPP×WDP×WCP×WEPinter-node=H×G\underbrace{W_{\text{TP}} = G}_{\text{intra-node}} \times \underbrace{W_{\text{PP}} \times W_{\text{DP}} \times W_{\text{CP}} \times W_{\text{EP}}}_{\text{inter-node}} = H \times G
  • Innermost (fastest interconnect): TP (NVLink, ~900 GB/s)
  • Middle tier: PP, CP, EP (InfiniBand, ~50–100 GB/s effective)
  • Outermost (most tolerant of latency): DP with ZeRO (communication only at gradient sync / optimizer step)

10. Unified 5D Parallelism Diagram: Mathematical Formulation

For a single MoE Transformer layer, the computation on worker wwidentified by its 5D coordinate(wDP,wTP,wPP,wCP,wEP)(w_{\text{DP}},\, w_{\text{TP}},\, w_{\text{PP}},\, w_{\text{CP}},\, w_{\text{EP}}) proceeds as:

Input: A micro-batch shard X(wDP)R(mbs)×(s/WCP)×dmodel\mathbf{X}^{(w_{\text{DP}})} \in \mathbb{R}^{(\text{mbs}) \times (s/W_{\text{CP}}) \times d_{\text{model}}}

Attention block (TP + CP active):

Qw=X(wDP)WQ(wTP),Kw=X(wDP)WK(wTP),Vw=X(wDP)WV(wTP)\mathbf{Q}_w = \mathbf{X}^{(w_{\text{DP}})} \cdot W_Q^{(w_{\text{TP}})}, \quad \mathbf{K}_w = \mathbf{X}^{(w_{\text{DP}})} \cdot W_K^{(w_{\text{TP}})}, \quad \mathbf{V}_w = \mathbf{X}^{(w_{\text{DP}})} \cdot W_V^{(w_{\text{TP}})}

CP Ring Attention gathers full K,V\mathbf{K}, \mathbf{V}acrossWCPW_{\text{CP}}workers for attention computation. TP AllReduce (or ReduceScatter) combines partial attention outputs acrossWTPW_{\text{TP}} workers.

MoE FFN block (EP + TP active):

hpost-attnRouterAll-to-All dispatch to WEP workersEi(wEP)All-to-All combinehout\mathbf{h}_{\text{post-attn}} \xrightarrow{\text{Router}} \text{All-to-All dispatch to } W_{\text{EP}} \text{ workers} \xrightarrow{E_{i}^{(w_{\text{EP}})}} \text{All-to-All combine} \rightarrow \mathbf{h}_{\text{out}}

Pipeline dimension: The above describes a single stage wPPw_{\text{PP}}in the pipeline. Activations flow from stagewPPw_{\text{PP}}to stagewPP+1w_{\text{PP}} + 1 via point-to-point communication.

Gradient synchronization (DP + ZeRO): After backward pass through all micro-batches, gradients are synchronized across WDPW_{\text{DP}} replicas via AllReduce (DP) or ReduceScatter (ZeRO-2/3).


11. Key Takeaway

No single parallelism strategy is a universal solution. Each addresses a specific dimension of the training tensor and introduces its own communication overhead. The art of large-scale training lies in composing these strategies such that:

  1. Communication-intensive strategies (TP) use the fastest interconnects.
  2. Computation-tolerant strategies (DP, ZeRO) span slower interconnects.
  3. The global batch size remains within the convergence-optimal range.
  4. Memory is balanced across all workers to avoid stragglers.

The optimal configuration is determined by the interplay of model architecture (dmodel,L,NE,sd_{\text{model}}, L, N_E, s), hardware topology (intra-node vs. inter-node bandwidth), and training hyperparameters (Bglobal,mbs,grad_accB_{\text{global}}, \text{mbs}, \text{grad\_acc}).

PreviousDiving into the GPUs — Fusing, Threading, and Mixing NextExpert Parallelism and 5D Parallelism: A Comprehensive Technical Treatment

Generated from llm_training_at_scale at .