In this video we discuss scaling MoE models on modern hardware and address key optimization challenges. If you can’t open the video displayed above, please use this link to open it on YouTube: https://youtu.be/MXo9LEYzwkg
Mixture-of-Experts (MoE) models allow you to increase total parameter count without proportional increase in compute, letting you train bigger and better models efficiently (Soboleva, 2025a). You might wonder if extracting theoretical benefits from MoE models requires significant engineering work. After all, your part 3 implementation (Soboleva and Tiwari, 2025) trained perfectly fine on a small acceleration node (and even your laptop). An important point here is that you only used 4 experts and 124M backbone parameters, but production systems like DeepSeek-V3, Qwen3, etc., use hundreds of experts and huge backbones. Try scaling to their sizes with our previous implementation on the GPU, and you will quickly hit your device’s memory limit.
Let’s understand why this happens. Remember the code on Figure 1 that we implemented in part 3? There we warned you that it isn’t efficient and we only used it for pedagogy. Why? On Figure 1, you can see that there is a sequential loop over all experts, even though we only need the experts that were selected for a given token. You might wonder, why so wasteful? You already know that the industry standard routing is not deterministic (Soboleva, 2025b), and thus it is impossible to predict which expert will be activated for a given token in advance. This means that we have to load all experts into the memory just in case we might need them later. Because of that, we have a linear growth in memory requirements as you add more experts, and super quickly, it becomes impossible to train on a single GPU device (even a dozen of experts won’t fit with our previous implementation).
The rest of this guide covers solutions to this scaling challenge. First, we will talk about GPU’s solution and its limitations, second, we’ll cover how Cerebras Wafer Scale Engine (WSE) solves this problem, and finally how to combine both to maximize the benefits.
Figure 1: Expert mixing for batch processing (see train train_gpt moe.pyL208-L225).
GPU Solution and Its Problems
When your MoE exceeds a single GPU memory capacity, the most popular solution is to use some sort of model parallelism (an example of which is expert parallelism, or EP in short (Lepikhin et al., 2020; DeepSeek-AI et al., 2024)). With EP, we typically assign an equal number of experts to each GPU device, while all other layers (attention, router, etc.) are replicated on each device. In this workflow, the router predicts which tokens should be routed to which experts, then an all-to-all communication operation shuffles tokens to the correct devices based on the router assignments, executes each expert in parallel and then shuffles results back to tokens’ original devices to perform expert mixing. Given that we shard only expert layers across devices, for all other layers GPUs are doing repetitive work.
To make EP efficient, your experts have to be load-balanced across all layers in the network. This creates a tension between model quality and infrastructure efficiency. ML researchers optimizing for the best possible model quality often prefer routing strategies that allow experts to specialize heavily (Soboleva, 2025b), even if this creates load-imbalance. Meanwhile, infrastructure teams need predictable, balanced workloads to achieve optimal hardware utilization, and minimize training costs. Both goals are critical, but they pull in different directions. Aggressive load balancing can hurt the model’s quality by forcing tokens to suboptimal experts (like the case we’ve seen with hash routing). But imbalanced experts create expensive infrastructure bottlenecks, where some GPUs sit idle, while others are overloaded, significantly decreasing overall hardware utilization.
Additionally, EP introduces communication overheads that worsen as we increase number of experts. Modern MoE architectures activate many small experts per token rather than fewer but larger ones for better compute efficiency (Krajewski et al., 2024). However, in this setup, intensive load balancing is required to parallelize with EP, which impacts model quality. If you can’t trade off quality for speed with aggressive load balancing, EP’s communication/computation ratio worsens, as we spend most of the time moving tokens around, rather than experts performing useful operations with them. Overall, training MoE models on GPU devices remains challenging, even when applying parallelization techniques, like EP.
Cerebras WSE vs GPU (Architectural Differences)
GPUs running large MoE models typically need multiple parallelism strategies in addition to expert parallelism. For example, DeepSeek-V3 used a combination of pipeline, expert, and data parallelism (DeepSeek-AI et al., 2024). This creates a complex 3D parallel implementation that you must carefully tune, and that you must re-tune when either your model or the cluster changes. With Cerebras WSE, distributed computing is not required. We utilize data parallelism to train and scale our models. What makes it possible? We have roughly 1000x more on-chip memory (SRAM) than on a GPU device, which allows us to store much bigger models on the chip directly (roughly up to 1B in total parameter count). When scaling to larger models, we employ a technique called weight streaming (Hall et al., 2023) that disaggregates memory and compute on WSE. With weight streaming, we remove model parameters (those heavy tensors) from the wafer entirely. They now live in the external memory units, and we stream them to the wafer during training to compute gradients. The wafer streams gradients back to the memory units to update the weights. This technique allows us to train today’s trillion parameter MoE models (Kimi Team, 2025) on just a single device.
Memory Problem Solved, Now Compute
So we have established that Cerebras WSE can fit large MoE models on the chip directly with weight streaming and no model parallelism required. But just because you can train a trillion-parameter MoE doesn’t mean it is efficient. As models get more sparse (more experts, lower top_k), we now hit a different issue, which is compute utilization. Remember that promised 62% reduction in FLOPs (Floating Point Operations per Second) with 32 experts in the part 1 of this series? There we noted that it would result in MoE model training 3x faster than the dense baselines to achieve the same loss. But this is not what we see in practice. What happens is that we fail to transfer these FLOP reductions into the actual wall-clock speedups. It becomes inefficient to train MoE models at scale.
Why does it happen? With a sparse MoE model, our routing subdivides the batch size across many experts. It results in most experts only seeing a tiny portion of the original batch. Small batches mean most of your experts sit idle, not computing useful operations. You’re loading big expert networks’ matrices, but barely using them (you become I/O bound in these layers). At the same time, attention layers can’t handle larger batches because they are activation memory- bound (Anthony et al., 2023; Korthikanti et al., 2022). This means that they must store large intermediate tensors like attention scores that scale quadratically with sequence length [1]. Thus we have this persistent utilization gap between expert and attention layers. Expert networks are starving and waiting until more data comes in for them to process, and attention layers struggling to efficiently process larger bulks of tokens, preventing us from moving data quicker to the expert networks.
Batch Tiling on Attention
To solve this problem, on Cerebras WSE we decouple batch size requirements between the attention and expert networks. We call this technique Batch Tiling on Attention (BTA). With BTA, we split attention’s input batch size into G tiles of size B, processing each tile independently, then concatenating the results back together for a large batch size of G × B that we feed into the expert networks (Figure 2).
Figure 2: Batch Tiling on Attention (BTA) architecture. The input batch is split into G tiles, each processed through attention with small batch size B. Outputs are concatenated to form a large batch G × B for expert processing, decoupling batch size constraints between attention and expert layers.
Similarly to how FlashAttention (Dao et al., 2022) tiles along the sequence dimension to reduce memory usage, BTA tiles along the batch dimension to improve imbalanced utilization between expert and attention networks. BTA’s dual batch strategy directly targets MoE’s compute requirements. We use smaller batches for attention layers, reducing their activation memory, and larger batches for expert networks, improving their compute utilization.
Let’s stress-test our BTA technique on one of the recent MoE models called Qwen3 with 3B active parameters (Yang et al., 2025) on a single CS-3 machine. Originally this model has 128 experts and top_k = 8 activated ones per each token. We are going to vary top_k and the number of experts, thereby varying the sparsity of the model, and observe the effect of sparsity on throughput both with and without BTA.
First, we’ll keep top_k fixed at 8 and vary the number of experts from 16 to 128, measuring throughput degradation compared to the dense model with the same number of active parameters. Figure 3a shows that “red” line throughput (conventional batching without BTA) degrades severely, reaching up to 53% (2x slowdown!) with 128 experts. The “blue” line (with BTA and G values tuned to achieve the best performance) shows that throughput stays high, close to the dense model, across all expert counts. In the second experiment, let’s fix number of experts to 128 and vary number of activated experts top_k from 1 to 8. In this experiment, we also maintain a fixed number of active parameters by adjusting expert network sizes with a change of top_k accordingly. Figure 3b shows that conventional batching demonstrates even larger degradation in throughput as we increase levels of sparsity in the network (lower top_k), reaching up to 86% worse performance (7x slowdown!) than our dense model. BTA again maintains high throughput regardless of top_k that we used.
Figure 3: BTA prevents throughput degradation as MoE models become sparser. Conventional batching (“red” line) shows severe performance drops (up to 53% with more experts, 86% with smaller top_k), while BTA (“blue” line) maintains stable high, close to dense model performance across all configurations.
Bringing It All Together
MoE models are hard to train efficiently. Especially when we go to higher levels of sparsity. Highly sparse MoE models have many more parameters than dense models. With only a few dozen experts, they can hit memory limits on GPUs (a very tiny number given that production MoE models use hundreds of experts nowadays). With larger number of experts GPUs require certain model parallelization techniques implemented to load all experts into the memory. You need to utilize expert parallelism to distribute expert networks across devices and make sure that experts stay load balanced.
Given that memory constraint is less critical on Cerebras hardware, we looked at compute inefficiencies that sparser networks experience. We talked about how techniques like BTA can be employed, significantly improving compute utilization of MoE models on the wafer. It’s worth mentioning that BTA is not a silver bullet. It is one optimization technique in the broader optimization toolkit. The expert parallelism that we discussed for GPU clusters can be applied equally well on the Cerebras hardware. You can distribute experts across multiple WSE devices while using BTA
on each device to maintain high compute utilization.
That’s it. You know how the complete picture of MoE training at scale. You understand the fundamental concepts from part 1, know how to choose the right routing method from part 2, can debug MoE training issues with part 3, and have seen how to solve memory management and compute utilization problems on modern hardware (whether you’re training on GPUs with EP or
on Cerebras WSE with BTA). In the next part “MoE Math Demystified: What Does 8x7B Actually Mean?”, we’ll shift focus from training to deployment. We will provide you with practical tools to make informed decisions whether to deploy an MoE model or stick to the dense architectures, depending on your constraints.
Citation
Questions? Find me at: https://soboleva-daria.github.io/
Footnotes
[1] On Cerebras hardware, we don’t use FlashAttention and offload activation tensors into the memory units whenever they are not being used.
References
Anthony, Q., Biderman, S., & Schoelkopf, H. (2023). Transformer Math 101. Eleuther AI Blog. https://blog.eleuther.ai/
Cerebras Systems (2023). Training giant neural networks using weight streaming. White Paper. https://www.cerebras.ai
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. arXiv preprint arXiv:2205.14135. https://doi.org/10.48550/arXiv.2205.14135
DeepSeek-AI (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. https://doi.org/10.48550/arXiv.2412.19437
Kimi Team (2025). Kimi K2: Open agentic intelligence. arXiv preprint arXiv:2507.20534. https://doi.org/10.48550/arXiv.2507.20534
Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2022). Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198. https://doi.org/10.48550/arXiv.2205.05198
Krajewski, J., Ludziejewski, J., Adamczewski, K., et al. (2024). Scaling laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871. https://doi.org/10.48550/arXiv.2402.07871
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. https://doi.org/10.48550/arXiv.2006.16668
Qwen Team (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388. https://doi.org/10.48550/arXiv.2505.09388
Soboleva, D. (2025). Debugging dead MoE models: A step-by-step guide. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-debug
Soboleva, D. (2025). MoE fundamentals: Sparse models are the future. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-why-moe
Soboleva, D. (2025). Router wars: Which MoE routing strategy actually works. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-router