Mar 26 2026

The GPU Is Being Split in Half

The entire way we run AI inference is being rearchitected right now. AWS and Cerebras just announced a partnership around it. NVIDIA spent $20 billion acquiring Groq to catch up. Jensen Huang stood on stage at GTC 2026 and effectively validated what companies like Cerebras have been saying for years: general-purpose GPUs aren't enough for inference at scale.

The thing they're all converging on is called disaggregated inference. And if you're a developer building anything on top of LLMs, this is going to change how fast your products feel, how much they cost to run, and what's even possible to build.

Your GPU Is Doing Two Very Different Jobs

When you send a prompt to an LLM, the model doesn't just "think" and return text. It runs two completely separate operations, back to back, on the same hardware.

Phase 1: Prefill

The model processes your entire input prompt in parallel, building up the KV cache, which is effectively its working memory. This is compute-heavy work dominated by matrix multiplications, and the GPU's cores are maxed out doing matrix multiplications.

Phase 2: Decode

Now the model generates the output one token at a time. Because each new token depends on the previous one, this step can't be parallelized. Unlike prefill, this phase is not compute-bound, it is memory-bandwidth-bound. The bottleneck becomes how quickly you can read from memory, not how many FLOPs you have available.

The reality is that these two phases have very different hardware requirements. Prefill is compute-intensive and benefits from massive parallel compute. On the other hand, decode is memory-bandwidth-bound, and depends on on extremely fast, low-latency memory access.

GPUs perform well in compute-heavy regimes like prefill. But because they’re designed as general-purpose accelerators, they aren’t optimized for the memory-bound nature of decode when you need peak performance at scale. By separating prefill and decode, disaggregated architectures can deliver significantly higher throughput per kilowatt while maintaining consistent interactivity.

Instead of trading off speed for scale, you can serve more tokens, to more users, without degrading responsiveness.

What Disaggregated Inference Actually Means

The fix is almost obvious once you see the problem: stop running both phases on the same chip.

Disaggregated inference splits prefill and decode onto different machines. One pool handles the heavy parallel computation of prefill. Another handles the fast, sequential memory reads of decode. Even when both pools use the same type of hardware, this separation alone reduces interference between phases and improves latency.

But the real step change comes from heterogeneous disaggregation: pairing each phase with hardware that matches its workload. In this system, we use compute-optimized systems for prefill and memory-bandwidth-optimized systems for decode. This is where we see the biggest gains.

Think of it like a restaurant kitchen. Before disaggregation, the same chef is doing prep (prefill) and cooking dishes to order (decode), and every time a big new order comes in for prep, all the plates-in-progress sit waiting. With disaggregation, you have a prep team and a plating team. They work in parallel, passing the prepped ingredients (the KV cache) from one station to the next.

The result: prefill never interrupts decode. Each system does what it's best at. You get higher throughput, lower latency, and dramatically better tail latency.

A New Generation of Foundational Hardware

On March 13th, AWS and Cerebras announced a disaggregated inference partnership that makes this new architecture real and available at cloud scale.

The architecture: AWS Trainium chips handle prefill. Cerebras CS-3 systems handle decode. Connected via Amazon's Elastic Fabric Adapter (EFA) networking and available through Amazon Bedrock.

The Wafer-Scale Engine 3 (WSE-3), inside the CS-3 is the world's largest chip: 4 trillion transistors and 900,000 AI-optimized cores on a single processor. Unlike the GPU, the WSE-3 packs SRAM directly onto the chip instead of relying on off-chip HBM like GPUs do. By the numbers, it has 44GB of on-chip SRAM with 21 petabytes per second of memory bandwidth. That’s roughly 1,000–2,000x higher effective memory bandwidth than an NVIDIA B200, which still depends on off-chip HBM.The entire chip is one massive die, eliminating the packaging and interconnect bottlenecks that limit every other architecture.

"By splitting the inference workload across Trainium and CS-3, each system does what it's best at. The result will be inference that's an order of magnitude faster than what's available today." - AWS VP David Brown

Why This Matters So Much for Developers

Agents become actually responsive. A 10-step agent chain at 50 tok/s takes 30+ seconds. At 1200 tok/s (GPT-Codex-5.3-Spark powered by Cerebras), the same chain finishes in under 3 seconds.

Throughput scales dramatically. Specialized silicon reduces wasted work, allowing each system to serve far more tokens at high speed. We expect a disaggregated Trainium-Cerebras solution will increase token throughput by 5x over an aggregated solution. Instead of constraining usage, that headroom gets used to support more users, richer applications, and increasingly complex agent workflows.

Latency tail gets fixed. Your P95 is 3 seconds while your median is 500ms? That's prefill stall. Disaggregation eliminates it. Research shows up to 4.5x improvement in P95 latency on agentic workloads.

Every cloud provider is moving here.AWS and Cerebras, NVIDIA and Groq, Oracle, Azure. Every major LLM serving framework (Dynamo, SGLang, vLLM, llm-d) already supports disaggregation.

This shift toward heterogeneous, disaggregated systems is also reshaping the competitive landscape. Instead of a single vendor controlling the entire stack, different chips can compete at each stage of inference. That starts to chip away at NVIDIA’s end-to-end dominance and opens the door for more specialized hardware to win where it’s strongest.

For developers, that competition is a tailwind. More players means faster iteration and a constant push toward higher performance across the stack you’re building on.

The world YOU should build for

Two years ago it was all about training and who has the most H100s. That's over. Models are trained once but queried billions of times. The economics have flipped.

Computing demand for inference has grown by a factor of 1,000,000 in the last two years, and the industry now projects over $1 trillion in inference-driven infrastructure spending through 2027.

If you're a developer, the practical takeaway is simple: tokens are about to get faster within the next 18 months. Let’s build for that world.