Jul 18 2025

Moe 101

Motivation

You’ve probably heard that GPT-4, Claude, Gemini, and most other frontier models use Mixture-of-Experts (MoE) architectures. MoE is everywhere now – but good luck implementing one.

Here is the problem: despite widespread adoption and numerous MoE publications, most resources are either outdated research papers or high-level overviews that skip the messy details. You’re left with critical questions like: how many experts should you use? Which routing strategy is go to for you? How do you debug a diverged MoE model?

This series gives you the practical knowledge to build MoE systems that work. Part one covers the fundamentals and why MoE is inevitable. Part 2-5 dive into routing strategies, training gotchas, scaling math, and hardware considerations. By the end, you will know how to implement production MoE models – not just understand them in theory.

Why Choose MoE

Here's a counterintuitive fact: the most powerful language models today use less than 10% of their parameters for any given token (Yang et al. 2025, DeepSeek-AI et al. 2024). This isn't a bug - it's the feature that makes trillion-parameter models possible.

Why does this matter? We've hit a scaling wall. The progression from GPT-3's parameter scaling (Brown et al. 2020) to Chinchilla's compute-optimal ratios (Hoffmann et al. 2022) drove AI training compute up by a factor of 10^21 since AlexNet(Krizhevsky et al. 2012). But we can't just keep making models bigger forever - the costs become prohibitive, and the efficiency gains disappear.

Mixture-of-Experts (MoE) representsthe path forward. The key mechanism enabling this divergence is conditional activation of parameters. While dense transformers activate all parameters for every token (often multiplying many weights by zero activations (Mirzadeh et al. 2023)), MoE models selectively route tokens to a subset of parameters, enabling an increase in total parameter count without a proportional increase in computations per token.

The results speak for themselves, Figure 1 shows that MoE models outperform compute-efficient dense models from Chinchilla (Hoffmann et al. 2022). Performance improves logarithmically with more experts while keeping training FLOPs constant. This scaling behavior represents something fundamentally different from what's possible with dense models - you simply cannot achieve comparable efficiency gains by making them larger.

This isn't just an optimization - it's the path forward for the field. To reach trillion-parameter models that are trainable and deployable, sparsity through MoE is becoming the only viable approach. As we can see on Figure 1 with just 32 experts (a modest number compared to recent models like DeepSeek's 256-expert architectures), the results show non-trivial efficiency gains: approximately 5% relative performance improvement at iso-FLOP setup compared to dense models, or conversely, approximately 62% reduction in FLOPs to the same loss (3x faster training!).

But how does MoE achieve these gains? Let's examine exactly how they work.

The Fundamentals of MoE

The MoE concept originated with Jacobs et al. 1991, was later adapted for LSTM networks by Shazeer et al., 2017 and eventually applied to transformer architectures (Vaswani et al. 2017), with the Switch Transformer(Fedus et al. 2021). This guide focuses specifically on MoE in the context of transformer architectures, which has become the dominant approach powering today's largest language models (Brown et al. 2020; Grattafiori et al. 2024; DeepSeek-AI et al. 2024).

The core insight is simple: different types of inputs benefit from different types of processing. Code tokens need different transformations than natural language tokens. Math expressions need different handling than creative writing. A model processing English text shouldn't need to activate the same transformations as when it's handling Russian or Arabic.

In dense transformer, feedforward layers do the heavy lifting –they decide which neurons to activate for a given input token. They're where specialization can have the biggest impact. With MoEs, instead of forcing everything through the same feedforward layer, we create multiple copies of them, which we call experts. These experts collectively solve complex tasks by sending each token to the most relevant expert as illustrated in Figure 2.

Now, how do we implement this? Here is what happens when a token hits an MoE layer:

Implementation in Figure 3 is just one way to do MoE routing - the approach from Shazeer et al. 2017 that is widely adopted today. In practice, you have options: swap softmaxfor other normalization functions(DeepSeek-AI et al. 2024; Anthony et al. 2024), replace top_kwith top_psampling (Huang et al. 2024), or even go full dense and use all experts.

The router component is critical - it's what makes or breaks your MoE model. The scaling performance shown in Figure 1 is specifically for Switch routing, but different routing strategies can dramatically impact both performance and scaling behavior. In part 2, we will dive deep into which routing approaches work best in practice and how to choose the right one for your use case.

References

Anthony, Quentin, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. 2024. “BlackMamba: Mixture of Experts for State-Space Models.”

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.”

DeepSeek-AI, Aixin Liu, Bei Feng, and others. 2024. “DeepSeek-V3 Technical Report.”

Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” arXiv Preprint arXiv:2101.03961.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, and others. 2024. “The Llama 3 Herd of Models.”

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.”

Huang, Quzhe, Zhenwei An, Nan Zhuang, et al. 2024. “Harder Tasks Need More Experts: Dynamic Routing in MoE Models.”

Jiang, Albert Q, Alexandre Sablayrolles, Antoine Roux, et al. 2024. “Mixtral of Experts.” arXiv Preprint arXiv:2401.04088.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

Mirzadeh, Iman, Keivan Alizadeh, Sachin Mehta, et al. 2023. “ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models.” arXiv:2310.04564. Preprint, arXiv, October 6. https://doi.org/10.48550/arXiv.2310.04564.

Muennighoff, Niklas, Luca Soldaini, Dirk Groeneveld, et al. 2024. “OLMoE: Open Mixture-of-Experts Language Models.”

OpenAI, Josh Achiam, Steven Adler, and others. 2023. “GPT-4 Technical Report.”

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”

Team, Gemini, Petko Georgiev, Ving Ian Lei, and others. 2024. “Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context.”

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.”

Yang, An, Anfeng Li, Baosong Yang, et al. 2025. “Qwen3 Technical Report.”