Jan 28 2026

The year of ‘Latency Debt’

I typed a single sentence into one of the world's most advanced language models: "Write a function to parse JSON out of markdown code blocks" Then I waited. The cursor blinked. I shifted in my chair. "Thinking..." I checked Instagram stories. By the time the model was done, I’d already gotten pulled into a meeting.

The response was beautiful. The experience was far from ideal. And if you've been building with frontier AI models, you've probably felt this too. This is the best technology humans have ever built, and using it often feels like watching paint dry.

What is ‘Latency Debt’?

In software engineering, "technical debt" refers to the accumulated cost of shortcuts and slop code that works today but creates problems tomorrow. Engineers move fast, auto-accept AI suggestions, and defer the cleanup.

Latency debt works the same way. Over the past several years, we've spent enormous resources making AI models more capable, bigger, smarter, and more contextually aware. What we haven't done is update the hardware to support this new generation of software.

Latency debt is the hidden cost we accumulated while optimizing models faster than infrastructure. We optimized for intelligence, but each advancement like more model parameters or new reasoning models requires more compute per response.

The bill comes due as sluggish user experiences, real-time applications that aren't actually real-time, and AI agents that are theoretically capable but practically too slow to be useful.

The Ingredients of Latency Debt

To understand how we got here, let's look at the last few years of AI advancements. While each trends represents AI progress, every one of them also adds to our latency debt.

Models are getting bigger.

In 2018, BERT-Large had 340M parameters. Today, frontier models like Kimi-k2 exceed 1T parameters, a 3,000x increase. Bigger models are more capable, but they also require more computation for every token they generate.

Reasoning tokens have taken over.

Today, modern AI models don't just generate responses, they think first.

These intermediate "reasoning tokens" are the model working through a problem step-by-step before producing its final answer. A year ago, reasoning tokens made up less than 10% of total token generation. Today, they account for more than 50%.

Context windows are getting longer.

2019’s GPT-2 could handle about 1,000 tokens of context. Today's models can process 10 million tokens, roughly thirty 500-page books. This is transformative for applications that need to reason over large documents or entire codebases.

It also means the model has vastly more information to process before it can respond.

Developers are using that context.

In the past year alone, average input token lengths have quadrupled. Users feed models entire repositories, complete documentation sets, and full conversation histories.

Output lengths have exploded.

On the flip side, average output token counts have also tripled in the past year. This is thanks to test-time compute.

What Latency Debt Feels Like

Latency debt is the coding copilot that breaks your flow because you're waiting five minutes for a one-line suggestion. It's the AI agent that theoretically could book your flight, check your calendar, and draft your emails, but takes so long on each step that you'd be faster doing it yourself.

Humans have always been extremely latency sensitive. A 1968 research study by IBM on user experience found that computer response latency >2 seconds causes significant drops in engagement, and that delays above 4 seconds "become embarrassing because they imply a breaking of the thread of communication." Beyond 15 seconds, conversational interaction becomes impossible and users mentally check out.

More recently, a 2025 University of Central Florida study on LLM-powered agents found the same thresholds. Latency above four seconds degraded every measured dimension of user experience, and 74% of users said they wouldn't use a slow system at all.

A Pattern That Keeps Repeating

This pattern has happened before. In computing, workloads define hardware. Each time the workload changes, the hardware that served the old workload becomes a bottleneck for the new one.

1985 (Custom hardware → CPUs): Computing shifted from specialized scientific simulation to requiring programmable, general-purpose software.

2003 (CPU → GPU): New workloads like machine learning and graphics rendering demanded massive parallelism. As a result, CPUs became the bottleneck and GPUs were the solution.

2026 (GPU → Cerebras WSE): Today, our workloads are once again shifting as we build multi-agent systems, coding assistants, voice interfaces, and human-in-the-loop workflows. These workloads are sequential, memory-bound, and brutally latency-sensitive. GPUs weren't designed for this.

Luckily, the hardware transition is already underway. Cerebras's Wafer-Scale Engine (WSE), a single chip the size of a dinner plate, is a newer generation architecture optimized for AI workloads requiring low latency. Unlike GPUs, which shuttle data back and forth between chips and memory, the WSE keeps everything on-chip, eliminating the memory bandwidth constraint that throttles GPU inference.

The Smart Money Is Already Moving

Here's how you know latency debt is real: follow the money.

In the past six months, the four most important companies in AI have all made massive billion dollar bets on faster inference, and none of those bets were on the NVIDIA hardware that already dominates AI today.

Google, despite being one of NVIDIA's largest customers, built its own Ironwood TPU, 4x faster than NVIDIA's GPUs for inference workloads.

Anthropic committed tens of billions of dollars to Google's TPU infrastructure rather than expanding their NVIDIA capacity.

NVIDIA itself paid $20 billion to gut Groq for their IP and top talent, a company whose entire value proposition was inference speed. This is NVIDIA’s largest purchase to date.

And OpenAI just purchased 750 MW worth of Compute from Cerebras. Cerebras builds wafer-scale chips designed specifically for the kind of sequential, memory-bound workloads required for AI inference.

Four companies that compete on everything just made the same bet. That is not a coincidence.

Paying Down the Debt

Latency debt isn't a permanent condition. The acquisitions and investments described are the beginning of a hardware transition to newer, SOTA AI optimized hardware designs.

Even better, the payoff is immediate and obvious. When you run a frontier model on hardware designed for inference, the experience is visceral. The same model that made you wait thirty seconds responds in under two. AI evolves from being a bottleneck to a collaborator.

We've made measurable strides in teaching machines to think. But to realistically integrate them into our workflows and lives, we require hardware designed for speed-of-thought interaction. That's how we pay down latency debt.

For those who are ready to stop waiting on AI? You can experience the Cerebras inference API for free here :)

Note:

A thank you to Sarah Su, Zhenwei Gao, Swyx, and Bill Chen for their feedback and contributions.