Mixture of Experts (MoE)

From llmref.wiki
Mixture of Experts (MoE) — Architecture that routes each token through a learned subset of specialized sub-networks, reducing computational cost while maintaining capacity.

Overview

Mixture of Experts (MoE) is a foundation model architecture in which a large language model delegates token processing to a sparse subset of specialized sub-networks called "experts." Rather than passing every token through all model layers, a learned routing mechanism—typically a gating network—selects which experts should process each token. This design permits a model to maintain high parameter count while keeping inference latency and throughput practical by activating only a fraction of those parameters per forward pass.

The core motivation for MoE is computational efficiency without proportional loss of model capacity. Traditional dense neural networks scale both parameter count and compute cost linearly; MoE decouples these quantities. A model with 200 billion total parameters might activate only 20 billion per token, reducing batch inference cost and enabling deployment on constrained inference infrastructure.

MoE architectures have been adopted in production LLMs including Google's Switch Transformers and Mixtral models. The approach introduces new bias detection considerations around expert load balancing and can affect knowledge cutoff artifacts if training data is unevenly distributed across expert specializations.

How it works

In a typical MoE layer:

  1. A gating network (often a simple feedforward layer followed by softmax) receives the current token embedding and outputs a probability distribution over experts.
  2. The top-k experts by probability are selected (commonly k=2 or k=4). Unselected experts receive zero activation.
  3. The token is passed through each selected expert independently. Experts are typically identical or similar feedforward networks with distinct learned parameters.
  4. Outputs from selected experts are combined via a weighted sum, using the gating network probabilities as weights.

To prevent load imbalance—where a few experts consistently receive high routing probability—auxiliary loss terms are added during training. These losses penalize unequal expert utilization, encouraging the gating network to distribute tokens more evenly.

A variant called "switching" (in Switch Transformers) routes each token to exactly one expert, further reducing compute. Another variant, "hard MoE," uses discrete routing without continuous probability weighting, trading differentiability for speed.

The sparsity pattern changes per token, making MoE incompatible with standard dense-matrix multiplication. Efficient MoE requires specialized inference infrastructure or custom kernels that gather inputs for active experts, process them in parallel, and scatter outputs back.

Distinction from related terms

Term Distinction
Quantization Quantization reduces parameter precision (e.g., from float32 to int8) uniformly across all parameters. MoE selectively deactivates entire sub-networks per token. Quantization reduces memory and compute; MoE reduces only compute per forward pass while maintaining full parameter count in memory.
Fine-tuning Fine-tuning updates all model parameters on task-specific data. MoE is an architectural design choice that affects which parameters are active during inference. A MoE model can be fine-tuned, but fine-tuning does not create MoE sparsity.
Dense inference Dense models activate all parameters for every token. MoE activates a learned subset, reducing FLOPs per token. Dense models are simpler to deploy; MoE requires routing and expert-selection logic but scales to larger total parameter counts.
Ensemble models Ensemble models train multiple independent models and aggregate their outputs. MoE trains a single model with multiple experts sharing a gating mechanism and joint optimization. MoE is more integrated; ensembles are typically post-hoc combinations.
In-context learning In-context learning adapts a model to new tasks by conditioning on examples in the context window. MoE is an internal architectural mechanism that does not depend on context length. The two are orthogonal; a MoE model can perform in-context learning.

Examples

  • Google Switch Transformers (2021) — Early large-scale MoE model using top-1 routing (each token to one expert). Switch-C had 1.6 trillion parameters with 2 trillion FLOPs per token for inference, enabling faster fine-tuning on downstream evaluated tasks.
  • Mixtral 8x7B (2023) — Open-weight model with 8 experts per layer, each expert 7 billion parameters, routing top-2 experts per token. Total 46.7 billion parameters; approximately 12.9 billion activated per token. Widely used for prompt engineering and fine-tuning experiments due to availability and moderate cost.
  • Google Gemini (partial MoE) — Large multimodal model reportedly using MoE components, combining dense and sparse layers. Demonstrates MoE in foundation models supporting vision and text across batched and online serving.

See also

References