Expert Parallelism and 5D Parallelism: A Comprehensive Technical Treatment
1. Expert Parallelism (EP)
1.1 Prerequisite: Mixture of Experts (MoE) Architecture
In a standard Transformer layer, every token passes through a single feedforward network (FFN). The Mixture of Experts paradigm replaces this monolithic FFN with
For a given input hidden state
where
Under a Top-
where
Key property: Each token activates only
1.2 Definition of Expert Parallelism
Expert Parallelism (EP) is a distributed training and inference strategy that partitions the
Formally, if we have
Each worker stores
1.3 Communication Pattern: All-to-All Dispatch and Combine
Since tokens on any given worker may be routed to experts residing on any other worker, EP requires All-to-All collective communication operations:
Step 1 — Dispatch (All-to-All scatter): Each worker determines, for every local token, which remote worker hosts the assigned expert(s). The token hidden states
If worker
Step 2 — Expert Computation: Each worker processes all tokens routed to its local experts. Worker
Step 3 — Combine (All-to-All gather): The expert outputs
The total communication volume per MoE layer for EP is:
The factor of 2 accounts for both dispatch and combine phases. The term
1.4 Contrast with Tensor Parallelism
A critical distinction: EP does not shard individual matrix multiplications. In Tensor Parallelism (TP), a single linear layer’s weight matrix
| Aspect | Tensor Parallelism | Expert Parallelism |
|---|---|---|
| What is sharded | A single weight matrix |
Distinct expert networks |
| Communication primitive | AllReduce / ReduceScatter + AllGather | All-to-All |
| Communication content | Partial matrix products (activations) | Full token hidden states |
| Requires model-specific logic | Yes (column/row split patterns) | Minimal (only routing logic) |
| Weight integrity per worker | Partial weights | Complete expert weights |
1.5 Why EP Alone Is Insufficient: Combination with Data Parallelism
EP only partitions the MoE (FFN) layers. All non-MoE components — embedding layers, attention layers, LayerNorm, output heads — remain fully replicated across EP workers. Without additional parallelism, every EP worker processes the same input batch through these shared components, resulting in redundant computation.
This is resolved by combining EP with Data Parallelism (DP). With
Under this hybrid scheme:
- Non-MoE layers: Each worker processes a distinct micro-batch shard (standard DP behavior). Gradients are synchronized via AllReduce across
replicas. - MoE layers: Tokens are routed across
workers hosting different experts via All-to-All communication.
This eliminates redundancy: each GPU processes a unique data shard through the shared layers, and expert computation is distributed without replication.
1.6 Practical Engineering: Communication-Aware Routing Constraints
Naive expert routing can create prohibitive All-to-All traffic across nodes. DeepSeek-V3 (with
Formally, let
This bound reduces cross-node communication volume by a factor of approximately
1.7 Memory Impact of Expert Parallelism
Per-worker parameter memory for the MoE layers reduces linearly:
where
Activation memory per worker depends on the number of tokens routed to local experts, which is governed by the load balancing properties of the router.
2. 5D Parallelism: Unified Framework
2.1 The Five Parallelism Dimensions
Modern large-scale training decomposes the computation along five orthogonal dimensions, each addressing a distinct axis of the training tensor:
| Strategy | Abbreviation | Parallel/Sharding Dimension | What Is Partitioned |
|---|---|---|---|
| Data Parallelism | DP | Batch dimension |
Input samples |
| Tensor Parallelism | TP | Hidden dimension |
Weight matrices and activations |
| Sequence/Context Parallelism | SP/CP | Sequence dimension |
Token sequences |
| Pipeline Parallelism | PP | Model depth (layers) |
Transformer layers |
| Expert Parallelism | EP | Expert dimension |
Expert sub-networks |
The total number of GPUs
2.2 ZeRO Strategies (Orthogonal Memory Optimizations on DP)
ZeRO (Zero Redundancy Optimizer) is not a separate parallelism dimension but a set of memory optimization stages applied within the DP group of size
| Stage | What Is Sharded Across DP Replicas | Memory Reduction Factor (Approx.) |
|---|---|---|
| ZeRO-1 | Optimizer states |
Optimizer memory |
| ZeRO-2 | Optimizer states |
|
| ZeRO-3 | Optimizer states |
For a model with
where
With ZeRO-3 across
3. Comparative Analysis: Pipeline Parallelism vs. ZeRO-3
Both PP and ZeRO-3 distribute model parameters across GPUs along the model depth axis, but they differ fundamentally in mechanism:
3.1 Side-by-Side Comparison
| Property | ZeRO-3 | Pipeline Parallelism |
|---|---|---|
| Per-worker storage | A fraction of each layer’s parameters | Complete layers (one or more full layers) |
| Communication transfers | Weights (AllGather before forward, ReduceScatter after backward) | Activations (point-to-point between pipeline stages) |
| Orchestration complexity | Model-agnostic (automatic parameter gathering) | Model-agnostic but requires schedule design (1F1B, interleaved, etc.) |
| Implementation challenge | Managing parameter partitioning, prefetching, and communication overlap | Managing micro-batch scheduling to minimize pipeline bubble |
| Scaling preference | Large |
Large |
3.2 Why Combining PP + ZeRO-3 Is Rare
When combining PP and ZeRO-3, both weight communication (ZeRO-3) and activation communication (PP) occur simultaneously. The total communication cost becomes:
To amortize both costs, the global batch size
This creates a multi-dimensional trade-off between global batch size, model size, network bandwidth, and training convergence (since excessively large batch sizes can degrade final model quality).
Practical guidance: If combining them, ZeRO-3 should be configured to retain parameters in memory during the sequence of PP micro-batches, avoiding repeated AllGather operations for the same parameters across micro-batches.
3.3 PP + ZeRO-1/ZeRO-2: Natural Combination
ZeRO-1 and ZeRO-2 shard only optimizer states (and gradients), which are only needed during the optimizer step — not during forward/backward computation. This means they introduce no additional communication during the PP micro-batch processing loop, making them naturally complementary.
Real-world example: DeepSeek-V3 training uses PP + ZeRO-1.
4. Tensor Parallelism + Sequence Parallelism: Interaction with Other Strategies
4.1 Natural Complementarity with PP and ZeRO-3
TP exploits the distributive property of matrix multiplication. For a linear layer
Each partial computation
4.2 Two Fundamental Limitations of TP
Limitation 1 — Communication on Critical Path:
TP communication (AllReduce or equivalent) lies on the critical path of computation. For each Transformer layer, the communication cost scales as:
where the factor 4 accounts for two linear layers × (forward + backward),
Limitation 2 — Model-Specific Implementation:
TP requires explicit knowledge of where to shard along the hidden dimension (TP regions) vs. where to shard along the sequence dimension (SP regions). Attention projections, FFN layers, LayerNorm, and dropout each require different sharding strategies, making TP non-trivially model-specific.
4.3 Consequence: TP Confined to Intra-Node
Given these limitations, TP is kept within high-bandwidth intra-node interconnects (e.g., 8 GPUs connected via NVLink at 900 GB/s per GPU), while PP or ZeRO-3 handles inter-node distribution over lower-bandwidth fabrics (e.g., InfiniBand at 400 Gb/s per link).
5. Context Parallelism (CP): Complementary to TP
5.1 What CP Targets
CP shards activations along the sequence length dimension
- MLP, LayerNorm: These are point-wise or token-independent operations and process sharded sequences without any communication.
- Attention layers: Each token’s query must attend to keys/values from the full sequence, requiring communication.
5.2 Ring Attention for CP
Ring Attention organizes
The communication is overlapped with computation, so the effective overhead is:
where:
CP is specifically valuable for extreme sequence lengths (
6. Expert Parallelism (EP): Complementary to TP
EP targets the MoE FFN layers exclusively. Attention layers, LayerNorm, embeddings, and output heads are completely unaffected by EP. This makes EP orthogonal to:
- TP/SP (which shards attention and FFN weight matrices along hidden/sequence dims)
- CP (which shards attention KV along sequence dim)
- PP (which shards entire layers along depth)
6.1 EP vs. DP: Structural Similarity
There is a notable structural similarity between EP and DP regarding input handling:
- In DP, each worker processes different data through identical model copies.
- In EP (without additional DP), each worker processes the same data through different experts.
This duality is why some frameworks treat EP as a specialized variant of DP, where the “replication” is replaced by “expert routing.” The critical difference is that EP workers hold non-identical model components (different experts), while DP workers hold identical model copies.
7. Scope of Each Parallelism Strategy Within a Transformer Layer
| Strategy | Attention Layers | FFN / MoE Layers | LayerNorm | Embeddings |
|---|---|---|---|---|
| TP + SP | ✅ Shards |
✅ Shards FFN weights along hidden dim | ✅ (SP: sharded along seq dim) | ✅ Shards embedding matrix |
| CP | ✅ Primary impact — requires KV communication | ⚪ Independent processing | ⚪ Independent | ⚪ Independent |
| EP | ⚪ Unchanged | ✅ Primary impact — experts distributed | ⚪ Unchanged | ⚪ Unchanged |
| PP | Entire layers assigned to stages | Entire layers assigned to stages | Part of the assigned stage | Often first/last stage (special handling) |
| ZeRO | Parameters sharded across DP replicas | Parameters sharded across DP replicas | Parameters sharded | Parameters sharded |
Legend: ✅ = directly affected, ⚪ = unaffected/independent operation.
8. Comprehensive Comparison Table
| Method | Memory Savings Target | Parallel Dimension | Primary Disadvantage |
|---|---|---|---|
| DP | Activations (reduced local batch) | Batch |
Limited by maximum effective batch size |
| PP | Model parameters | Model layers |
Pipeline bubble and complex scheduling |
| TP + SP | Parameters and activations | Hidden |
Requires high-bandwidth intra-node interconnect |
| CP | Activations | Sequence length |
Communication overhead in attention |
| EP | Expert parameters | Expert dimension |
Requires MoE architecture; routing communication |
| ZeRO-1 | Optimizer states | Sharded across DP replicas | Parameter communication overhead |
| ZeRO-2 | Optimizer states + gradients | Sharded across DP replicas | Parameter communication overhead |
| ZeRO-3 | Optimizer states + gradients + parameters | Sharded across DP replicas | Parameter communication overhead |
9. Interaction and Combination Rules: Practical Summary
9.1 Naturally Complementary Combinations
| Combination | Why It Works |
|---|---|
| TP + PP | TP shards within layers (intra-node); PP shards across layers (inter-node). Orthogonal axes. |
| TP + DP | TP within node, DP across nodes. Standard combination. |
| PP + ZeRO-1/2 | ZeRO-1/2 shard optimizer/gradients (used only at optimizer step), not interfering with PP micro-batch processing. |
| TP + CP | TP shards hidden dim; CP shards sequence dim. Orthogonal. |
| EP + DP | EP distributes experts; DP distributes input batches. Eliminates redundant computation on shared layers. |
| EP + CP | EP targets MoE layers; CP targets attention. No interference. |
9.2 Combinations Requiring Caution
| Combination | Issue |
|---|---|
| PP + ZeRO-3 | Both introduce communication on the depth axis. Requires very large batch sizes to amortize dual communication costs. Rarely used in practice. |
| TP at large scale ( |
Communication dominates compute. Typically restricted to |
9.3 Typical Hierarchy in Practice
For training a model on a large GPU cluster with
- Innermost (fastest interconnect): TP (NVLink, ~900 GB/s)
- Middle tier: PP, CP, EP (InfiniBand, ~50–100 GB/s effective)
- Outermost (most tolerant of latency): DP with ZeRO (communication only at gradient sync / optimizer step)
10. Unified 5D Parallelism Diagram: Mathematical Formulation
For a single MoE Transformer layer, the computation on worker
Input: A micro-batch shard
Attention block (TP + CP active):
CP Ring Attention gathers full
MoE FFN block (EP + TP active):
Pipeline dimension: The above describes a single stage
Gradient synchronization (DP + ZeRO): After backward pass through all micro-batches, gradients are synchronized across
11. Key Takeaway
No single parallelism strategy is a universal solution. Each addresses a specific dimension of the training tensor and introduces its own communication overhead. The art of large-scale training lies in composing these strategies such that:
- Communication-intensive strategies (TP) use the fastest interconnects.
- Computation-tolerant strategies (DP, ZeRO) span slower interconnects.
- The global batch size remains within the convergence-optimal range.
- Memory is balanced across all workers to avoid stragglers.
The optimal configuration is determined by the interplay of model architecture (