Mar 20 2026

Why the AI Race Shifted to Speed

For most of 2025, the AI race was about model intelligence. In the past three months, the race has shifted. Model intelligence is still critical, but across every major frontier lab, inference speed has become a new and urgent focus:

Google unveiled Gemini 3 Flash. Built for agentic coding, it runs 3x faster than Gemini 3 Pro.
Anthropic released a 2.5x-faster edition of Claude Opus 4.6 for speed-critical coding use cases.
OpenAI announced a partnership with Cerebras to release GPT-5.3-Codex-Spark, running at over 1,200 tokens/s, making it the fastest OpenAI coding model to date.

Why has inference speed suddenly become so important? Because the rate at which a model generates tokens now directly affects the rate of model iteration for the major labs and the rate of building software for the broader economy.

In February, both OpenAI and Anthropic disclosed that they are now using their own coding models to build the next versions of their AI models. This is an extraordinary disclosure. In OpenAI's blog post "Harness Engineering," they write:

GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage its own deployment, and diagnose test results and evaluations — our team was blown away by how much Codex was able to accelerate its own development.

The post describes a team of three engineers who used Codex to produce a million lines of production code in five months — building the product in roughly one-tenth the time it would have taken by hand. Humans never manually wrote a single line of code. They prompted the agent, reviewed its pull requests, and removed obstacles. As OpenAI put it: "Humans steer. Agents execute."

Anthropic's story is similar. When they released the 2.5x-faster edition of Claude Opus 4.6, they acknowledged it was the speed they had already been using internally. Boris Cherny, head of Claude Code at Anthropic, has publicly stated that 100% of his code has been AI-written for over two months, and that roughly 90% of Claude Code's own codebase is written by Claude Code itself. In short, Anthropic has been using their coding tools to build their next products and until recently, they had reserved the fastest versions of their models for themselves.

The implications are both profound and clear — the recursive moment in software development has arrived, and when it comes to inference, the faster your token output, the faster you can ship your next product. Every lab is racing to build a more capable model. It used to be that whoever had the largest training cluster got there first. Now, all else being equal, whoever uses the fastest inference during model development crosses the finishing line first. Inference speed is now on the critical path to developing the next frontier model, and by extension, AGI.

If fast inference is truly this important, it ought to be very valuable. One way to verify this is to look at how Anthropic prices its models relative to their intelligence and speed.

Claude API Token Pricing

Anthropic’s flagship model Opus 4.6 is priced at a 66% premium over its mid-tier Sonnet 4.6 model. Meanwhile Opus 4.6 Fast, which runs 2.5x faster, is priced at 6x higher than the base model. Anthropic’s pricing affirms the idea that speed is now important enough to warrant its own category, and in terms of value, it is perhaps even more valuable than a step up in model intelligence.

Fast Inference accelerates time to revenue

Inference speed is not just strategic for OpenAI and Anthropic. It is strategic for any company building and shipping software products.

Consider two companies — Company A and Company B — both building a new AI-powered CRM. Company A uses a leading frontier model and finishes development in six weeks.

Company B has the same idea, team talent, and funding. But it uses a frontier model running fast inference, shipping the first version in just 3 weeks. In the subsequent weeks, Company B rapidly iterates based on user feedback. Version 3 of the product goes viral, reaching $10M in ARR by week 8. Meanwhile Company A is still learning from its first product release. Fast inference in this case directly accelerated product iteration and time to revenue.

The example above might sound a little fantastical, but it’s already happening in the real economy. In Stripe’s 2025 Annual Letter, the payments company revealed that the number of companies reaching $10M ARR within 3 months of launch doubled vs. 2024. This is almost certainly driven by the growing adoption of agentic coding. We expect 2026 will see even more dramatic acceleration as developers use even more powerful coding agents running at speeds an order of magnitude higher than 2025.

The pattern shown above is true not just of startups but also of enterprises. January 2026 was a reckoning for SaaS companies of every stage and size. Teams are scrambling to re-build their product stack and revenue model. Those with the fastest and most capable coding agents are more likely to find their footing in the post-agent economy.

Speed has always been the driver of the digital economy. In the 1990s, companies bought the fastest computers they could afford. In the 2000s, they raced to secure the fastest internet connections. In the AI era, high-speed inference is the critical infrastructure. Cerebras has been focused on speed since day one. Looking at the trajectory of the industry — models building models, coding agents replacing manual development, and go-to-market velocity becoming a function of token throughput — it is clear that speed will only be of greater importance going forward.