Expert Parallelism and 5D Parallelism: A Comprehensive Technical Treatment
1. Expert Parallelism (EP)
1.1 Prerequisite: Mixture of Experts (MoE) Architecture
In a standard Transformer layer, every token passes through a single feedforward network (FFN). In a Mixture of Experts (MoE) layer, the single FFN is replaced by NEparallel expert FFNs{E1,E2,…,ENE}and a gating (router) networkG(⋅) that determines which expert(s) each token should be sent to.
Given an input hidden state h∈Rd for a single token, the MoE layer output is:
MoE(h)=i=1∑NEgi(h)⋅Ei(h)
where gi(h)is the gating weight for experti, computed by the router:
Here Wg∈RNE×dis the learnable router weight matrix, andTopKselects theKexperts with highest router logits. TypicallyK≪NE(e.g.,K=2out ofNE=256 in DeepSeek-V3), meaning each token activates only a sparse subset of experts.
Key property: Each expert Ei is a standard FFN:
Ei(h)=Wi(2)⋅σ(Wi(1)h+bi(1))+bi(2)
where Wi(1)∈Rdff×d, Wi(2)∈Rd×dff, and σ(⋅) is a nonlinearity (e.g., SiLU, GeLU). Since each expert is fully independent of every other expert, this creates a natural axis for parallelism.
1.2 Definition of Expert Parallelism
Expert Parallelism (EP) distributes the NEexperts acrossWEP workers (GPUs), where each worker holds a disjoint subset of experts:
Worker w holds experts: {Eii∈[WEP(w−1)⋅NE+1,WEPw⋅NE]}
For NEexperts distributed acrossWEP workers, each worker stores:
WEPNE experts
Contrast with Tensor Parallelism (TP): In TP, a single weight matrix W∈Rm×n is split (column-wise or row-wise) across workers, requiring synchronized partial matrix multiplications followed by collective operations (AllReduce or AllGather). In EP, each expert’s weight matrices are kept intact on a single worker — no matrix splitting is needed. The only communication required is routing tokens to the correct worker.
1.3 Communication Pattern: All-to-All
The fundamental communication primitive in EP is the All-to-All operation, which occurs twice per MoE layer:
Forward Pass
Step 1 — Dispatch (All-to-All): After the router computes gating decisions, each worker sends tokens destined for remote experts to the appropriate worker. Formally, if worker whas a local batch of tokens{h1,…,hB}and the router assigns tokenhjto expertEiresiding on workerw′, then hjmust be communicated from workerwto workerw′.
Step 2 — Compute: Each worker processes received tokens through its local experts.
Step 3 — Combine (All-to-All): The expert outputs are sent back to the originating workers, weighted by gating scores, and summed.
The communication volume per MoE layer for a single token routed to K experts is:
Vcomm=2⋅K⋅d(dispatch + combine, per token)
For a local batch of Btokens acrossWEP workers:
Vtotal=2⋅B⋅K⋅d⋅(1−WEP1)
The factor (1−WEP1) accounts for the fact that tokens assigned to local experts do not require inter-worker communication.
1.4 EP Combined with Data Parallelism (DP)
EP alone only parallelizes the MoE layers. All non-MoE components — self-attention, layer normalization, embeddings — remain replicated across all EP workers. This means:
Without DP, every worker processes the same input batch through non-MoE layers → redundant computation.
With DP, the input batch is sharded across workers, and each worker processes a different micro-batch through non-MoE layers → no redundancy.
Given Wtotal GPUs,WEPGPUs for expert parallelism, andWDP GPUs for data parallelism:
W=WEP×WDP
The effective batch size per GPU for non-MoE layers becomes:
Blocal=WDPBglobal
while each GPU holds WEPNE experts for MoE layers.
Gradient synchronization: After the backward pass, gradients for non-MoE parameters are synchronized via AllReduce across the WDP data-parallel group, while expert gradients are not synchronized across the EP group (since each expert exists on exactly one worker).
1.5 Router Constraints for Communication Efficiency
DeepSeek-V3 node-bounded routing constraint: To minimize inter-node communication, the router enforces that each token is sent to at most Mnodes (DeepSeek-V3 usesM=4). Formally, define the set of nodes as {N1,…,NNnodes}, where each node contains a subset of experts. The router first selects the top-M nodes by aggregate affinity:
Then the top-Kexperts are selected only from experts withinMselected. This ensures that the All-to-All communication is bounded to at most M nodes, dramatically reducing cross-node bandwidth requirements:
Vinter-node≤2⋅B⋅K⋅d⋅NnodesM−1≪Vunbounded
2. 5D Parallelism: Complete Taxonomy
2.1 The Five Axes of Parallelism
Modern large-scale training combines five orthogonal parallelism strategies, each sharding along a different dimension of the computation:
Strategy
Symbol
Sharding Dimension
What is Distributed
Data Parallelism
DP
Batch
Input micro-batches
Tensor Parallelism
TP
Hidden dimensiond
Weight matrices & activations
Sequence/Context Parallelism
SP/CP
Sequence lengthL
Activations along sequence
Pipeline Parallelism
PP
Model depth (layers)
Consecutive layer groups
Expert Parallelism
EP
Expert index
MoE expert FFNs
The total number of GPUs satisfies:
Wtotal=WDP×WTP×WPP×WCP×WEP
2.2 The Three ZeRO Stages (Complementary to DP)
ZeRO (Zero Redundancy Optimizer) progressively eliminates memory redundancy within the DP group:
Stage
Sharded Among DP Replicas
Memory per GPU
ZeRO-1
Optimizer states O
WDP∣O∣+∣Θ∣+∣G∣
ZeRO-2
Optimizer states O+ GradientsG
WDP∣O∣+∣G∣+∣Θ∣
ZeRO-3
Optimizer states O+ GradientsG+ ParametersΘ
WDP∣O∣+∣G∣+∣Θ∣
For a model with Φ parameters in mixed-precision training (fp16 params + fp32 optimizer states with Adam):
3. Pipeline Parallelism vs. ZeRO-3: Detailed Comparison
Both PP and ZeRO-3 distribute model parameters across GPUs along the depth axis, but they differ fundamentally in what is communicated and how computation is organized.
3.1 Side-by-Side Analysis
Aspect
ZeRO-3
Pipeline Parallelism
Storage per device
A fraction of every layer’s parameters: WDP∣Θℓ∣per layerℓ
Full parameters of assigned layers:∣Θℓ∣forℓ∈Lw
Communication transfers
Weights (AllGather before forward, ReduceScatter after backward)
Activations (point-to-point between pipeline stages)
Prefers largegrad_acc steps to amortize pipeline bubble
3.2 Communication Volume Comparison
ZeRO-3 per layer (forward + backward): Each layer’s full parameters must be AllGathered before computation and optionally ReduceScattered during backward:
VZeRO-3total=2ℓ=1∑L∣Θℓ∣⋅WDPWDP−1≈2∣Θ∣for large WDP
Pipeline Parallelism per micro-batch boundary: Only the activation tensor at stage boundaries is communicated (point-to-point):
VPP(boundary)=Bμ⋅Lseq⋅d(per boundary, per micro-batch)
Total PP communication across S−1stage boundaries andnμ micro-batches:
VPPtotal=nμ⋅(S−1)⋅Bμ⋅Lseq⋅d
where S=WPPis the number of pipeline stages,Bμis the micro-batch size, andd is the hidden dimension.
3.3 Combinability
ZeRO-3 + PP: Possible but rarely practical. Combining them requires inflating the global batch size to amortize both weight-transfer overhead (ZeRO-3) and bubble overhead (PP). If combined, ZeRO-3 should cache (keep in memory) the gathered parameters across all PP micro-batches to avoid re-gathering per micro-batch:
Redundant gathers avoided=(nμ−1)×VZeRO-3total
ZeRO-1/ZeRO-2 + PP: Naturally complementary. ZeRO-1/2 shard optimizer states and gradients (which are only needed at the update step), while PP shards model layers. No conflicting communication patterns. DeepSeek-V3 uses this combination (PP + ZeRO-1).
TP shards weight matrices along the hidden dimension d:
W=[W(1)∣W(2)∣⋯∣W(WTP)],W(k)∈Rm×(n/WTP)
The matrix multiplication y=Wx is computed as:
y=k=1∑WTPW(k)x(k)(row-parallel)
or
y(k)=W(k)x(column-parallel)
requiring AllReduce or AllGather/ReduceScatter operations.
SP complements TP by sharding activations along the sequence dimensionL in regions where TP is not active (e.g., LayerNorm, Dropout):
X∈RL×d→X(k)∈R(L/WTP)×d,k=1,…,WTP
Why TP should be intra-node: TP operations (AllReduce, AllGather) lie on the critical computation path — the forward pass cannot proceed until the collective is complete. Thus TP demands the highest bandwidth interconnect (NVLink/NVSwitch at 900 GB/s), making it suitable only for intra-node communication:
TTP-comm=WTP2⋅(WTP−1)⋅BWintra-node∣a∣
where ∣a∣ is the activation size per collective operation.
4.2 Context Parallelism (CP)
CP shards the full sequence across WCP workers:
X∈RL×d→X(k)∈R(L/WCP)×d,k=1,…,WCP
MLP, LayerNorm: Process sharded chunks independently — no communication needed.
Self-Attention: Each token’s query must attend to all keys/values across the full sequence. This requires the Ring Attention pattern:
At each step tof the ring (t=0,1,…,WCP−1), worker k:
Computes partial attention with the locally available K(j),V(j) block.
Sends its KV block to neighbor (k+1)modWCPand receives from(k−1)modWCP.
Accumulates partial attention using the online softmax correction.
The online softmax update at step tfor queryqon workerk is:
Both EP and DP involve multiple workers processing different data through different (or identical) parameters:
DP
EP
Each worker processes
Different micro-batch
Different tokens (routed by expert assignment)
Model weights per worker
Identical full copy
Unique subset of experts
Gradient sync
AllReduce across DP group
None across EP group (each expert is unique)
Some frameworks treat EP as a specialized form of DP where, instead of identical model replicas processing different batches, each replica holds different experts and receives dynamically routed tokens.
5. Scope and Focus of Each Strategy
5.1 Per-Component Impact
Strategy
Attention Layers
FFN / MoE Layers
LayerNorm / Embeddings
TP + SP
Shards WQ,WK,WV,WOalongd
ShardsW1,W2alongdfford
SP shards alongL
CP
Requires Ring Attention communication
Independent on sharded sequences
Independent on sharded sequences
EP
Unchanged
Shards experts across workers
Unchanged
PP
Assigned to a pipeline stage
Assigned to a pipeline stage
Assigned (often treated specially due to embedding tying)
ZeRO
Shards params/grads/optim states uniformly
Shards params/grads/optim states uniformly
Shards params/grads/optim states uniformly
5.2 Detailed Comparison Table
TP + SP
CP
EP
What is sharded
Weights & activations along d/L
Activations alongL
Expert weights & activations
Communication ops
AllReduce / AllGather / ReduceScatter for matmul
P2P ring for attention KV
All-to-All for token dispatch/combine
Implementation
Model-specific (sharding patterns vary per layer type)
This exemplifies the core advantage of MoE + EP: model capacity scales with NE⋅Φper_expert(memory distributed via EP), while compute cost scales only withK⋅Φper_expert (sparse activation).
This completes the technical treatment of Expert Parallelism and 5D Parallelism, covering the mathematical foundations, communication patterns, memory models, inter-strategy interactions, and real-world deployment configurations.