Jul 22 2025

MoE Fundamentals: Sparse Models are the Future

Why Choose MoE

Here's a counterintuitive fact: the most powerful language models today use less than 10% of their parameters for any given token (Yang et al. 2025, DeepSeek-AI et al. 2024). This isn't a bug - it's the feature that makes trillion-parameter models possible.

Why does this matter? We've hit a scaling wall. The progression from GPT-3's parameter scaling (Brown et al. 2020) to Chinchilla's compute-optimal ratios (Hoffmann et al. 2022) drove AI training compute up by a factor of 10^21 since AlexNet (Krizhevsky et al. 2012). But we can't just keep making models bigger forever - as we increase parameters, we need to proportionally increase compute. Eventually it becomes prohibitively expensive to train these models and impossible to sustain.

Mixture-of-Experts (MoE) is the path forward. The key mechanism is conditional activation of parameters. While dense transformers activate all parameters for every token (often multiplying many weights by zero activations (Mirzadeh et al. 2023)), MoE models selectively route tokens to a subset of parameters, which allows us to increase total parameter count without a proportional increase in computations per token.

The results speak for themselves, Figure 1 shows that MoE models outperform Chinchilla-optimal dense models. Performance improves logarithmically with more experts while keeping training FLOPs constant. This scaling behavior represents something fundamentally different from what's possible with dense models - you simply cannot achieve comparable efficiency gains by making them larger.

To reach trillion-parameter models that are trainable and deployable, sparsity through MoE is becoming the only viable approach. As we can see on Figure 1 with just 32 experts (a modest number compared to recent models like DeepSeek's 256-expert architectures), the results show non-trivial efficiency gains: approximately 5% relative performance improvement at iso-FLOP setup compared to dense models, or conversely, approximately 62% reduction in FLOPs to the same loss (3x faster training!).

But how does MoE achieve these gains? Let's understand exactly how they work.

The Fundamentals of MoE

The MoE concept originated with Jacobs et al. 1991, was later adapted for LSTM networks by Shazeer et al., 2017 and eventually applied to transformer architectures (Vaswani et al. 2017), with the Switch Transformer (Fedus et al. 2021). This guide focuses specifically on MoE in the context of transformer architectures, which has become the dominant approach used in today's largest language models (Brown et al. 2020; Grattafiori et al. 2024; DeepSeek-AI et al. 2024).

The core insight is simple: different types of inputs benefit from different types of processing. Code tokens need different transformations than natural language tokens. Math expressions need different handling than creative writing. A model processing English text shouldn't need to activate the same transformations as when it's handling Russian or Arabic.

In dense transformer, feedforward layers do the heavy lifting –they decide which neurons to activate for a given input token. They're where specialization can have the biggest impact. With MoEs, instead of forcing everything through the same feedforward block, we create multiple copies of it – each copy becomes an expert network. A router then decides which expert should handle each token, as illustrated in Figure 2.

Now, how do we implement this? Here is what happens when a token hits an MoE layer:

Figure 3: Pseudocode for MoE layer implementation. With N=256 experts but only top_k=8 activated per token, you get 256x the model capacity for just 8x the compute cost – this is how you get efficiency gains.

Implementation in Figure 3 is just one way to do MoE routing - the approach from Shazeer et al. 2017 that is widely adopted today. In practice, you have options: swap softmax for other normalization functions (DeepSeek-AI et al. 2024; Anthony et al. 2024), replace top_k with top_p sampling (Huang et al. 2024), or even go full dense and use all experts.

The router component is critical - it’s what makes or breaks your MoE model. The scaling performance shown in Figure 1 is for a specific routing strategy, but your choice of routing can dramatically impact both performance and scaling behavior. In Router Wars: Which MoE Routing Strategy Actually Works, we will dive deep into which routing approaches work best in practice and how to choose the right one for your use case.

References

Anthony, Quentin, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. 2024. “BlackMamba: Mixture of Experts for State-Space Models.”

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.”

DeepSeek-AI, Aixin Liu, Bei Feng, and others. 2024. “DeepSeek-V3 Technical Report.”

Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” arXiv Preprint arXiv:2101.03961.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, and others. 2024. “The Llama 3 Herd of Models.”

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.”

Huang, Quzhe, Zhenwei An, Nan Zhuang, et al. 2024. “Harder Tasks Need More Experts: Dynamic Routing in MoE Models.”

Jiang, Albert Q, Alexandre Sablayrolles, Antoine Roux, et al. 2024. “Mixtral of Experts.” arXiv Preprint arXiv:2401.04088.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

Mirzadeh, Iman, Keivan Alizadeh, Sachin Mehta, et al. 2023. “ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models.” arXiv:2310.04564. Preprint, arXiv, October 6. https://doi.org/10.48550/arXiv.2310.04564.

Muennighoff, Niklas, Luca Soldaini, Dirk Groeneveld, et al. 2024. “OLMoE: Open Mixture-of-Experts Language Models.”

OpenAI, Josh Achiam, Steven Adler, and others. 2023. “GPT-4 Technical Report.”

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”

Team, Gemini, Petko Georgiev, Ving Ian Lei, and others. 2024. “Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context.”

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.”

Yang, An, Anfeng Li, Baosong Yang, et al. 2025. “Qwen3 Technical Report.”

MoE Fundamentals: Sparse Models are the Future

Why Choose MoE

The Fundamentals of MoE

References

Router Wars: Which MoE Routing Strategy Actually Works

Debugging Dead MoE Models: A Step-by-Step Guide

MoE Math Demystified: What does 8x7B Actually Mean?

MoE at Scale: Making Sparse ModelsFast on Real Hardware