Skip to main content

Oct 14 2025

MoE Math Demystified: What Does 8x7B Actually Mean?

This video breaks down MoE inference arithmetic and deployment bottlenecks across different hardware setups. If you can’t open the video displayed above, please use this link to open it on YouTube: https://youtu.be/gHpDBoyCOrE

What does 8x7B actually mean? You probably thought it meant 8 experts with 7B active parameters per token. We did too. Turns out it is actually 13B active parameters. But wait, where does 13B come from? This is exactly the kind of confusion this post clears up.

We'll explain what those numbers actually mean for inference by answering how much memory you need, how many GPUs, and what the commonly hit bottlenecks are in production deployment. We'll show that single-GPU deployment is memory-bound, multi-GPU setups are communication-bound, and specialized hardware like Cerebras WSE is compute-bound.

Originally, we set out to write a simple post on MoE arithmetic. Then we kept digging. And digging. What started as basic math turned into a full explanation of inference bottlenecks, hardware architectures, and deployment strategies. Welcome to MoE inference 101. The title stayed, but the scope didn't.

The MoE series so far has focused on training. In part 3, you trained your own MoE model and scaled it in part 4 to production size. Now what? We shift to inference. During inference, model weights are frozen, no gradients or optimizer states. This sounds much simpler compared to training mode. But! Despite doing less work, MoE inference has its own unique challenges.

Fun fact: people who train models rarely think about inference costs, and vice versa. Your authors are no exception, but we are trying to be better. So, if you're deploying MoE models, or you trained an MoE model and want to know what your design choices have led to, or you simply want to understand how to run MoE inference efficiently, keep scrolling.

Table 1: Notation.

How much storage do I need?

Want to avoid hitting an OOM error during inference deployment? Let’s examine two components that dominate storage space: model weights and kv-cache. To simplify our calculations, we are going to use standard modern transformer setup. We use RoPE positional embeddings (Su et al., 2023), SwiGLU nonlinearity (Shazeer, 2020; Dauphin et al., 2016), layer norms, multi-head attention, untied embeddings, and industry standard learned routing (Soboleva, 2025a).

Model weights

Let's walk through the math. First, we'll calculate a single decoder block storage requirement, then account for all layers , and finally add the remaining network parameters not included in the decoder blocks (bottom to top). Bias terms are omitted from the following equations as they are negligible. Our MoE model consists of an embedding layer, followed by decoder blocks, and an unembedding layer. Each decoder block contains two layer norms, an attention layer, a router, and an MoE layer with expert networks (Figure 1).

Figure 1: Visual breakdown of MoE model decoder architecture. The MoE model consists of an embedding layer, followed by decoder blocks, and an unembedding layer. Each decoder block includes two layer norms, an attention layer, a router, and an MoE layer with expert networks.

We start with embedding weights:

The embedding layer consists of an input embedding matrix of size . Following the embedding layer, next comes the layer norm:

You can see that we have a multiplier here. This is because we have two layer norms per decoder block, and each layer norm stores gains and biases of the same size . Next, let's do the math for the attention layer:

In attention, we account for four weight matrices (query, key, value, and output) each with size . Perfect, now comes the first MoE component, the router:

In the case of learned routing, we need to store a weight matrix of size with learnable router weights. The last thing we need to account for is the MoE layer weights:

Each expert network uses SwiGLU nonlinearity and thus requires three linear transformations (gated linear unit, up-projection, and down projection). With dynamic routing, we can’t predict which experts will be activated, so you must provision enough device memory to store all experts. Thus, MoE models require more memory capacity than dense networks with equivalent active parameters. Finally, we use untied embeddings, so we need to account for an additional storage of an unembedding matrix of size :

Combining all of the weights together in the decoder block, we get:

Accounting for decoder blocks in the network, and adding embedding and unembedding weights, we get the total amount of storage we need in bytes to hold model weights:

Next, let's answer the question of how much compute we need to run inference.

How much compute do I need?

Similar to how we estimated the amount of storage required to run inference with our MoE model, here we will focus on estimating the amount of FLOPs (Floating Point Operations). Later, we will convert our FLOPs into throughput and latency metrics. These metrics will help us answer how fast our MoE model will run. There are two important stages in inference. The first one is called prefill. During prefill, the model processes the entire prompt all at once. During the second stage, we start generating our answer. With autoregressive transformers, generation happens one token at a time. Each new token depends on all the previous tokens it has generated. We call the second stage decode. Let's focus on calculating the FLOPs we spend during the prefill stage and then do a separate analysis for decode.

Most of the operations in the transformer consist of matrix-vector, matrix-matrix, and Hadamard computations. So let's establish how many FLOPs are needed for these base operations.

Prefill

Cool, now we are ready to dive into the actual computations. Note that we will ignore an embedding layer. It is usually implemented as lookup table and doesn't require arithmetic operations. We start with layer norms:

We need 7 FLOPs per element (for sequence with tokens and hidden dimension ) to calculate layer norm FLOPs. This comes from computing the mean and variance across the embedding dimension, then performing normalization and scaling. Feel free to do the math on your own and verify us! The multiplier is accounting for both layer norms. Let's move on to the attention FLOPs:

Here we need to account for multiplications between the input matrix of size and each of the query, key, and value projection matrices (each of size ). After attention computation, we multiply the attention output with the output projection matrix (also of size ). These four matrix multiplications give us the first term in equation .

The second term calculates FLOPs for the attention computation itself. We calculate attention logits (by multiplying query and key projection vectors of size each), requiring FLOPs. Then we multiply the result with the value projection vector. Thus, the total number of FLOPs for these two matrix-matrix computations equals FLOPs (second term).

The final term accounts for softmax operations, which require approximately FLOPs per element (exponentiation, normalization, and scaling). Since we perform separate softmax operations for each of the attention heads over elements, this contributes (third term). Next, let's account for RoPE positional embeddings:

With RoPE, we add an additional matrix-matrix multiplication in the attention layer between the query projection vector of size and the rotation matrix of size . Due to the sparsity in the rotation matrix, this product can be decomposed into the sum of two Hadamard products between matrices of size Su et al., 2021), which gives us FLOPs. Following Black et al. (2022), we perform rotations only on of the embedding vector dimensions, which leads us to total FLOPs. Next come the FLOPs we spend in the router network:

With industry standard learned routing, we need to multiply an input matrix of size by the matrix of size , giving us the first term in the equation (12). The second term accounts for the softmax FLOPs that we spend for each token across the sequence with elements, across all experts . The next layer to account for in our FLOPs calculations is MoE layer:

In the MoE layer, each expert contains two consecutive feed-forward layers (up and down projections). The up projection with SwiGLU performs two matrix-matrix operations, multiplying the input matrix of size by the weight matrix of size . This requires FLOPs. The down projection performs one matrix-matrix operation, multiplying the intermediate result of size by a weight matrix of size , requiring FLOPs. Now, we were able to calculate the first term in the equation 13. For SwiGLU nonlinearity, we also need to perform a Hadamard product between two matrices of size , requiring FLOPs. Swish function requires 5 FLOPs per activation (negation, exponentiation, addition, division, and multiplication), thus we need remaining FLOPs (second term). Note that we only activate out of experts per token, thus MoE models require significantly less fewer FLOPs than dense networks with the same total parameter count. The reduced compute means faster training and inference. The last layer we need to account for is the unmbedding layer:

It transforms our token hidden representation back into the vocabulary space (multiplying an input vector of size by the matrix of size ).

Overall, during the prefill stage, we require the total number of FLOPs to spend per sequence:

Decode

During decode phase we're operating token-by-token level, not on full sequences. This means that all the computations from the prefill stage get reduced by a factor of . Except attention. Here we have a problem. In attention, we need to maintain the context of all previous tokens in our window. So even though we are processing a single token, we still need to compute pairwise relationships between all tokens in the sequence that came before it. Pretty soon it blows up (we're talking growth rate as the context length increases). Check equation (10) because this is exactly how many FLOPs you will spend on just a single token. This is clearly unsustainable.

To avoid this, we'll trade memory for compute and introduce a kv-cache. With kv-cache we'll cache and vectors from all the previous tokens in the sequence, so we never have to recompute them. Now, attention compute will look like this:

Notice what happened. The quadratic term is gone. We cached all the past token-token relationships needed for attention. We only compute the new token relationship to all those present in the sequence. Now compute grows linearly with . Much better. Thus, overall FLOPs we spend during decode per token:

For MoE models, kv-cache storage requirements are identical to those of dense networks. Following the analysis from (Chen, 2022), the additional storage that we need to provision can be estimated with:

We are accounting for 2 vectors ( and ) that we store across layers , across the full sequence , with separate vectors for each of the attention heads, where each vector has a dimension .

Why my GPU doesn’t go brrr?

We've calculated how much storage we need in section 2, and how many FLOPs our model requires in section 3. Our theory looks solid, but if you've run experiments, you know the math doesn't always check out in practice. Super frustrating. So let's verify everything together and figure out why GPUs often sit idle instead of going brrr. We'll load Qwen3 (an MoE with 3B active and 30B total parameters) (Yang et al., 2025) onto an H100 GPU and measure how fast it generates tokens at the decode stage:

Figure 2: Measuring how fast Qwen3 generates tokens at the decode stage on H100 GPU.

You run it and get roughly 3 ms. Is this good or bad? Let's figure it out. Grab the decode formulas from section 3 and plug in Qwen3's parameters (). We get roughly 7.7 GFLOPs of compute per token. Your H100 has a peak of 800 TFLOPs/s Bekman (2023-2024), so we'd expect to process one token in 9.6 s. This is a tiny fraction of the total 3 ms measured. Where does the rest of the time go?

During inference, two things happen in parallel. We stream weights from HBM memory into the GPU's on-chip SRAM, while compute executes on these weights. Whichever is slower becomes the bottleneck. Let's estimate the weight movement time. Plug Qwen3's numbers into the storage formulas in 2 and calculate how many parameters you activate per token (you can simply replace with the multiplier in the MoE Layer Weights calculations), since we only need to load the experts that we are actually using. With bf16 precision (), we need to move roughly 6 GB of weights per token. At H100's realistic memory bandwidth of 2.8 TB/s, loading these weights should take around 2.1 ms. We are spending 99.5% of our time moving the weights and only 0.5% on actual compute. That's bad.

So why doesn't my GPU go brrr? It's because currently we are memory-bound. In the memory-bound regime, most of the time is spent on data movement, whereas GPU utilization stays low. You're wasting compute and $$$. The regime that we want to be in is called compute-bound. In the compute-bound regime, computations dominate and GPU utilization is high. So how do we shift from the memory-bound to the compute-bound regime? Let's find out. But first, let's introduce some useful metrics that will help with our analysis.

Am I measuring the right thing?

The indicator that you're switching from memory-bound to compute-bound regime is that you make your GPU go brrr. But how do you quantify this effect? We need a metric. The metric is called throughput. Throughput measures tokens generated per second. Low throughput means your GPU is sitting idle, high throughput indicates your GPU is busy, doing useful work.

In the previous section we measured that it roughly takes 3 ms to generate one token. Thus, in this setup where the batch size is 1, your throughput is approximately 333 tokens per second. Here is how you can obtain the same number from theory. Since at batch size 1 we are in memory-bound regime, as we observed in the previous section, loading model weights dominates total time spent. Thus, your throughput can be defined as:

where Bandwidth is your achievable memory bandwidth on a particular device (for H100 we take 2.8 TB/s) and Model Weights define how many bytes you need to move to process a single token (we measured its 6 GB of active weights for Qwen3 MoE model). Thus, we should expect throughput to be around 466 tokens per second. With our 333 tokens per second We're 70% close! Another confirmation we are in memory-bound regime. Now, when we switch to compute-bound regime, most of the time will be spent on useful compute, thus throughput formulation will change to:

where Peak is achievable peak FLOPs per your device (for H100 we take 800 TFLOPs) and Decode is how many algorithmic FLOPs a single decode step requires (in the previous section we got 7.7 GFLOPs of compute). So you can see that we have a huge potential to get up to 100k from our 333 tokens per second!

We know you probably can't wait to see what it takes to speed up our Qwen3 model to realize its full potential on the H100 GPU. But! Throughput is only half of the story. We also need to track inter-token latency, which measures the time each user waits between consecutive tokens (and it directly correlates with user happiness):

This is exactly what we measured in the previous section (at ). You might wonder why track both metrics if one derives from the other? Good question! In the memory-bound regime, as you increase batch size, throughput scales up because you're amortizing weight loads across more tokens. Latency stays flat, because weight loading still dominates the time spent generating each token. But eventually, at large batch sizes, you hit the compute-bound regime. Now throughput plateaus at your GPU's peak FLOPs per second (you've juiced out everything you can). Meanwhile, latency starts climbing linearly because processing the large batch now dominates overall time. Each additional token you add to the batch directly increases everyone's wait time. Now, that we are armed up with metrics, let's answer the question what it takes to transition into the compute-bound regime.

How do I switch to compute-bound regime?

One way to shift from the memory-bound to compute-bound regime is to increase amount of compute while keeping memory movement constant. This can be done by batching multiple prompts together. In fact, this is how production-level inference works. In production, you serve multiple users simultaneously (for example using user's requests batched into sequences). This way, we reuse the same model weights across all sequences in the batch. Cool, so I should just increase my batch size to a very large number? Well, not quite. Let's run a sweep of different batch sizes for our Qwen3 model and measure both the latency and throughput that we defined in section 5. Our goal is to find the batch size that maximizes throughput without excessive latency increase. When you finish your sweep, you should see something similar to Figure 3.

Figure 3: Latency and throughput vs. batch size for Qwen3-30B-A3B with 4 tokens in the prompt, generating 12 new tokens. Results are averaged across 20 runs. To reproduce this plot, get yourself an H100 GPU and run plot.ipynb

Below the batch size of around 256, you can see that our latency stays relatively flat below 29 ms. You're still waiting for the weights. But throughput grows linearly. As you continue increasing the batch size during this stage, you become closer to amortizing that 25.2 ms weight loading cost across multiple sequences (in our example, we generate 12 tokens during the decode phase, with each token requiring 2.1 ms to load the model weights). Around batch size 256, the regime shifts. Latency starts climbing linearly, throughput growth slows down at around 6000 tokens per second. You've maxed out compute. Adding more sequences doesn't improve throughput substantially; it just makes each sequence (or user) wait longer. This 256 is your crossover point. Below it you are memory-bound; above it, you're compute-bound. We basically just found the best batch size to use for Qwen3 inference by doing empirical analysis. There is a way to do this analysis analytically. Let's see if the math checks out.

Consider a single matrix multiplication we do in the attention layer with weights . To process a batch with sequences, we need seconds and seconds to load. We're compute-bound when . Thus, solving this equation, you can find that:

You can see the optimal batch size is pretty universal and is not based on the particular model architecture. It only depends on the particular precision and hardware you use. Plugging in H100's peak FLOPs of 800 TFLOPs per second, bandwidth at 2.8 TB/s, and using bytes per parameter with bf16 precision, we get sequences. Pretty close to our 256 number in Figure 3! Theory matches practice!

The formula applies to every matrix multiplication we have in the network (those are the dominant operations). Different operations cross over at slightly different points, and the visible system-wide crossover is when enough operations have shifted to compute-bound. Useful to note that MoE layers are particularly problematic as they subdivide the batch across experts, making it difficult to find a universally good for the entire network. MoE layers can remain memory-bound, even when other layers are compute-bound. This is why, even though we improved throughput from 333 to 6,000 tokens/s, we did not come close to the peak throughput at 100,000 tokens/s. Thus, to continue improving throughput, we would need to add more GPUs.

What happens if I add more GPUs?

In the previous section, we found the sweet spot for a single GPU. At a batch size of 256, we transitioned from the memory-bound to the compute-bound regime. Now, let's consider what happens with multiple GPUs. When you split a model across multiple GPUs, you need some form of model parallelism. For MoEs, let's use Expert Parallelism (EP). With EP, we shard Qwen3's 128 experts across multiple GPUs. Every MoE layer then performs an expensive all-to-all shuffle. It dispatches tokens to expert networks, and gathers results back after compute is finished (see Figure 4).

Figure 4: MoE model with 6 experts and expert parallelism across 2 GPUs. Experts are equally assigned across accelerator devices. All other parameters (attention, layer norms, router, etc.) are duplicated on each device.

We showed in part 4 of MoE 101 series (Soboleva, 2025b) that this all-to-all communication becomes the bottleneck for MoE models. There you can't simply batch your way out of it. Unlike the single-GPU case, increasing batch size scales both communication and computation proportionally. So we’ve hit two walls. MoE models are memory-bound on a single GPU, and communication-bound in multi-GPU setups. Both come from the same architectural constraint, where compute and memory and separate, connected by bandwidth-limited wires. But what if your hardware was compute-bound from the start? Let's look at what happens when we run inference on Cerebras hardware.

What if we start compute-bound with Cerebras WSE?

To run inference on the Cerebras WSE, you can lay out the entire model on the wafer in pipeline mode (Figure 5). Each layer is mapped to a physical region with its weights and kv-cache stored locally in on-chip SRAM. Regions are placed adjacently so that token flow between layers happens with almost no latency overhead. It's all on the same chip. The output from the last layer cycles back to generate the next token.

Figure 5: MoE model with 6 experts laid out on Cerebras WSE device. Each layer (including experts) is mapped to a physical wafer region with weights stored locally. Tokens flow through regions in a pipeline.

Remember in the single GPU analysis (section 4), we spent 99.5% of total time loading Qwen3-30B-A3B's weights from HBM. With the Cerebras WSE, instead of storing weights in off-chip HBM like a GPU does, the entire model lives in on-chip SRAM. This way model weights never need to be moved on/off-chip, so you’re never memory-bound. Even for MoEs! You're compute-bound immediately at batch size 1. Thus, you can skip the experiments we did for the single GPU in section 6.

With multi-GPU case (section 7), we showed that expert parallelism introduces a communication bottleneck. Every forward pass requires expensive all-to-all operations to route tokens to the correct experts across different GPUs. With Cerebras WSE, we don't have this communication bottleneck. Since all experts live on the same chip with weights stored locally across PEs, there is no need for all-to-all operations.

And this is where we'll stop. As much as we enjoyed writing it, our blog has gotten way longer than we planned, so let’s summarize what you’ve learned so far.

What you've learned so far

Okay, so MoEs seemed impossibly complex at the start, right? Trillion-parameter models that only big labs can touch. But you just solved it, piece by piece. You learned why MoE works and built the fundamentals. You picked the right routing strategy and understood the tradeoffs. You debugged a broken MoE from scratch, fixing the gradient flow bug that kills most implementations. You scaled to production sizes on both GPUs and Cerebras WSE. And you cracked inference math and the bottlenecks. You went from theory to production.

More importantly, you learned the methodology. How to ask the right questions. Which routing strategy and why? What's actually broken when your router is collapsing? How do you scale without hitting bottlenecks? Hypothesis-driven debugging and measuring the right things is really what matters.

Alright, go build something cool.

Calculator

Your authors couldn’t find a good tool that estimates both compute and storage requirements for MoE models. Here is a simple calculator that does both. You’ve seen the methodology in the above sections, now just plug in your numbers and let the calculator do the work.

Acknowledgments

Before we close, I want to say thank you. This MoE 101 series has been quite a journey, and I couldn't have done it alone.

Big thanks to my co-authors Aman Tiwari, Quentin Anthony, and Etienne Goffinet for their collaboration and expertise.

To our reviewers Rob Schreiber, Mostafa Elhoushi, Golara Azar, Filipp Nikitin, Aaron Gokaslan, Dylan Finch, Vijay Thiruvengadam, Connor Anderson, Landon Noll, Sangamesh Ragate, Gavia Gray, Eric Sather, and Mikhail Yurochkin. Thank you for pushing us to explain things better.

To our marketing team Tin Hoang, Sneha Khanvilkar, Chris Kim, Daniel Kim, Pearl Hulbert, Isaac Tai, and Sarah Chieng. Thanks for helping these posts reach the community.

And to everyone who has been following along, reading, asking questions. Your engagement made this journey really fun. Thank you.

Citation

Questions? Find me at: https://soboleva-daria.github.io/

References

Bekman, S. (2023-2024). Machine Learning Engineering Open Book. GitHub repository. https://github.com/stas00/ml-engineeringBlack, S., Biderman, S.,

Hallahan, E., et al. (2022). GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745. https://doi.org/10.48550/arXiv.2204.06745

Chen, C. (2022). Transformer inference arithmetic. https://kipp.ly/blog/transformer-inference-arithmetic/

Dauphin, Y. N., Fan, A., Auli, M., et al. (2017). Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083. https://doi.org/10.48550/arXiv.1612.08083

Shazeer, N. (2020). GLU variants improve transformer. arXiv preprint arXiv:2002.05202. https://doi.org/10.48550/arXiv.2002.05202

Soboleva, D. (2025). Router wars: Which MoE routing strategy actually works. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-router

Soboleva, D., & Anthony, Q. (2025). MoE at scale: Making sparse models fast on real hardware. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-scale

Su, J., Lu, Y., Pan, S., et al. (2023). RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864. https://doi.org/10.48550/arXiv.2104.09864

Yang, A., Li, A., Yang, B., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388. https://doi.org/10.48550/arXiv.2505.09388

Cerebras