The MoE 101 Guide:

From Theory to Production

This series teaches you how to build MoE systems that work. We’ll cover the fundamentals, routing strategies, training gotchas, scaling math, and hardware. By the end, you will implement production MoE models – not just read about them in papers.

Impact of MoE

You've probably heard that GPT-4, Claude, Gemini, and most other frontier models use Mixture-of-Experts (MoE) architectures. MoE is everywhere now – but good luck implementing one.

Here is the problem: despite widespread adoption and numerous MoE publications, most resources are either outdated research papers or high-level overviews that skip the messy details. You’re left with critical questions. How many experts should you use? Which routing strategy is go to for you? How do you debug a diverged MoE model? This series fixes that. Just keep scrolling.

July 22, 2025

Understanding the theory behind trillion-parameter models you can train.

Part One

Aug 4, 2025

Which routing strategy to pick and how to implement it correctly.

Part Two

Aug 19, 2025

A survival guide for bringing broken MoE models back to life.

Part Three

Sep 3, 2025

How Cerebras and other accelerators handle the challenges of MoE training.

Part Four

Oct 14, 2025

Understanding model sizes, memory usage, and what you can realistically run.

Part Five

Notify me of the next post