For decades, AI hardware companies focused on increasing speed while ML researchers focused on improving model quality. When we started Cerebras, we shared that assumption. We believed that if we could build the world’s largest chip, it would allow ML applications to run much faster than GPUs. We realized that goal in 2019 when we built the world’s first wafer-scale AI chip, and today, we hold the inference speed record on Llama, Qwen, and Mistral models, often beating GPUs by 20x. What we didn’t anticipate, however, was that by building the fastest chip, we would fundamentally unlock model intelligence. That is to say, in today’s best models, higher speed directly translates to higher intelligence.
The growth of AI from GPT-1 to GPT-4 showed the power of scaling laws—namely larger models trained on more data and compute produce predictably better model results. But as we exhausted our internet scale datasets, this reliable recipe likewise started to plateau. OpenAI’s introduction of o1 in Sep 2024 put the industry on a fundamentally new trajectory. Model intelligence shifted from growing parameters and training data to increasing the computation a model is allowed to perform during inference. This trend has now been replicated across open-source as well, with models like DeepSeek R1 and Qwen3 showing clear upward curves in model performance as a function of inference-time token budgets.
The Qwen3 evaluations released by Alibaba provide some of the most compelling data to date. On benchmarks in science, coding, and mathematics, Qwen3 shows performance that scales monotonically with the number of inference-time tokens—sometimes by margins as large as 40 percentage points. Every state-of-the-art model today is a reasoning model. The dumbest reasoning model is still smarter than the best non-reasoning model. This is no longer just a research trend—it’s become the default path forward for the entire industry.
If reasoning models are more capable, why do we not use them by default? Why do we still use anything else? The answer is speed. Reasoning models generate thousands of tokens as ‘internal monologue’ before outputting the final answer to the user. These extra tokens take a lot of time—DeepSeek R1 often takes minutes to produce a final answer. As a result, reasoning models today are used sparingly and real-time applications still default to faster non-reasoning models.
GPUs cannot make reasoning models go much faster. GPUs use two separate chips: the GPU itself for predicting tokens and HBM memory to store model weights. To run an LLM, GPUs must shuttle the entire model—often hundreds of billions of parameters—from memory to compute. To generate one word, the model weights must be streamed across the memory bus. To generate the next word, they are streamed again. Because the full model cannot fit on the GPU itself, this process repeats for every token. This makes GPU inference persistently bandwidth-bound, typically maxing out at a few hundred tokens per second.
Cerebras solves this problem at the architectural level—our wafer scale engine is so large it stores the entire model on-chip. Qwen3-32B can fit on a single chip and larger models can be partitioned across multiple chips with no performance penalty. There is no external memory, no paging, and no bandwidth bottleneck. From the developer’s perspective, the hardware looks like a gigantic GPU with all the model weights on-chip. As a result, inference runs unreasonably fast—over 2,500 tokens/s and reasoning models can return the final answer in a single second.
Single-second reasoning makes it possible to use reasoning models for just about any application.For example, you can take Qwen3-32B to replace GPT-4.1 with better results, higher-speed, and lower cost. According to Artificial Analysis, Qwen3-32B with reasoning is already more intelligent than GPT-4.1. On Cerebras, it is 16× faster, and it costs one-tenth as much to run. For the first time, an open-source reasoning model can beat a closed-source frontier model and be deployed at real-time speed.
We started the company to build the fastest chip. And we succeeded. But the deeper insight—the one that defines our direction now—is that speed has become the enabler of intelligence. OpenAI formalized the test-time compute scaling law: the more tokens used at inference, the more intelligent the model. We observed that latency requirements in real world applications constrain the amount of tokens that can be generated. Thus the only way to use more tokens is to increase speed, leading to the Cerebras scaling law: faster inference speed results in higher model intelligence.
This has become our new North Star. We will continue to push the limits of inference speed, but not just for speed’s sake. It’s to move the Pareto frontier of model capability—so that the most intelligent models in the world aren’t just reserved for the most difficult off-line problems, but can be deployed in real-time applications, everywhere.