A year ago, Cerebras launched its inference API—setting a new benchmark for AI performance. While GPU-based providers were generating 50 to 100 tokens per second, Cerebras delivered 1,000 to 3,000 tokens per second across a range of open-weight models such as Llama, Qwen, and GPT-OSS. At the time, some skeptics argued that beating NVIDIA’s Hopper-generation GPUs was one thing, but the real test would come with its next generation Blackwell GPU. Now in late 2025, cloud providers are finally rolling out GB200 Blackwell systems, it’s time to revisit the question: who’s faster in AI inference—NVIDIA or Cerebras?
The Open-Weight Showdown: GPT-OSS 120B
OpenAI’s GPT-OSS-120B is today’s leading open-weight model developed by a U.S. company, widely used for its strong reasoning and coding capabilities. Based on benchmarks by Artificial Analysis, most vendors today run GPT-OSS-120B in the 100 to 300 tokens per second range, reflecting the performance of the widely deployed NVIDIA H100 GPUs.
Last month, Baseten published the first results of GPT-OSS-120B running on Nvidia’s latest Blackwell GPU, reaching 650 tokens per second—the fastest result ever achieved on GPUs at the time. To reach that milestone, Baseten ran the model on eight GB200 GPUs connected over NVLink, using Tensor Parallel 8 (TP8) for model distribution, TensorRT-LLM and NVIDIA Dynamo for dynamic graph optimization, and EAGLE-3 speculative decoding to accelerate token generation. It was an impressive result showcasing the peak performance of Blackwell in a production ready inference cloud.
With this result, Baseten not only outperformed every other GPU vendor in the market but also surpassed companies like Groq that relied on speed as its main differentiation. Blackwell showed that a 2-3x advantage over GPUs is quickly lost when Nvidia refreshes its hardware yearly.
To Baseten’s credit, it included Cerebras in its results, showing our hardware running GPT-OSS-120B model at over 3,000 tokens per second. This is made possible by our wafer-scale architecture that stores the entire model in on-chip memory, eliminating the bandwidth constraint of GPUs. Remarkably, the Cerebras Wafer Scale Engine 3, launched in 2024, still outperforms NVIDIA’s latest Blackwell generation by almost 5x—underscoring the enduring advantage of a compute architecture purpose-built for large-scale AI inference.
Not Just Fast, Great Price–Performance
Cerebras has always been known for raw speed—but we are often asked: is it worth the cost? In most high-performance products, speed follows the law of diminishing returns. A Ferrari costs 10x more than a Camry, but is barely 3x faster. So where does Cerebras fall on that curve?
Cerebras delivers 3,000 tokens per second at $0.75 per million tokens, while Baseten delivers 650 tokens per second at $0.50 per million tokens—a price–performance ratio of 4,000 vs. 1,300. In other words, Cerebras is only modestly more expensive yet several times faster—the inverse of the Ferrari–Camry example. You’re not paying 10× more for a marginal gain; you’re paying slightly more for a huge leap in performance.
Cerebras – Still the Fastest Inference in 2025
Nvidia Blackwell is a substantial upgrade over Hopper, improving the top speed of GPU inference by 2-3x and leapfrogging small-chip AI competitors like Groq. Cerebras is the only architecture that outperforms Nvidia, with a commanding 5x lead in OpenAI’s flagship open-weight AI model. We look forward to revisiting the leaderboards again in 2026.