The MoE 101 Guide:
From Theory to Production
This series teaches you how to build MoE systems that work. We’ll cover the fundamentals, routing strategies, training gotchas, scaling math, and hardware. By the end, you will implement production MoE models – not just read about them in papers.

Impact of MoE
You've probably heard that GPT-4, Claude, Gemini, and most other frontier models use Mixture-of-Experts (MoE) architectures. MoE is everywhere now – but good luck implementing one.
Here is the problem: despite widespread adoption and numerous MoE publications, most resources are either outdated research papers or high-level overviews that skip the messy details. You're left with critical questions like: how many experts should you use? Which routing strategy is go to for you? How do you debug a diverged MoE model? This series fixes that. Just keep scrolling.

July 22, 2025
Understanding the theory behind trillion-parameter models you can train.
Aug 4, 2025
Which routing strategy to pick and how to implement it correctly.
Coming soon
A survival guide for bringing broken MoE models back to life.
Coming soon
Understanding model sizes, memory usage, and what you can realistically run.
Coming Soon
How Cerebras and other accelerators handle the challenges of MoE deployment.
Notify me of the next post
