The world’s fastest inference is coming to the world’s leading cloud. Today we're announcing that Amazon Web Services is deploying Cerebras CS-3 systems in AWS data centers. Available via AWS Bedrock, the new service will offer leading open-source LLMs and Amazon’s Nova models running at the industry’s highest inference speed. In addition, AWS and Cerebras are collaborating on a new disaggregated architecture that pairs AWS Trainium with Cerebras WSE to deliver 5x more high-speed token capacity in the same hardware footprint.
The Need for Fast Inference
AI is reshaping software development. Code is increasingly written by AI agents rather than by human developers. Unlike conversational chat, agentic coding generates approximately 15x more tokens per query and demands high-speed token output to keep developers productive. The result is an urgent and growing need for more fast inference across the industry.
Cerebras is the market leader in high-speed AI inference, powering models from OpenAI, Cognition, and Meta at up to 3,000 tokens per second. Today's announcement brings that speed to AWS's global customer base. And through our unique collaboration on disaggregated inference, we'll be able to provide 5x more capacity for high-speed inference.
Disaggregated Inference
Every time you ask AI a question, two distinct kinds of computation happen: prefill and decode. Prefill processes the question while decode generates the answer. Prefill is a compute-bound operation, requiring relatively little memory bandwidth. Decode requires fetching the entire model from memory for every token generated, making it extremely bandwidth-intensive. Today, AI accelerators run both phases on the same chip. While this is simple and flexible, it means there is opportunity to improve performance by using specialized hardware for each phase.
Trainium is Amazon's purpose-built AI chip, designed for scalable performance and cost efficiency across a broad range of generative AI workloads. Its dense compute cores are especially suited for the prefill phase. The Cerebras CS-3 is the world's fastest AI inference system. Storing all model weights on-chip in SRAM, it delivers thousands of times greater memory bandwidth than the fastest GPU, making it the fastest processor for decode.
Through our collaboration we are building a disaggregated configuration taking advantage of the best from both companies. In disaggregated mode, Trainium focuses exclusively on prefill work. It computes the KV cache and sends it to the WSE via Amazon's high-speed EFA interconnect. The Cerebras WSE takes the result and exclusively performs decode, generating thousands of output tokens per second versus hundreds on GPUs. This architecture takes advantage of the best that each processor has to offer, and gives AWS customers a 5x boost in high-speed token volume.
“Inference is where AI delivers real value to customers, but speed remains a critical bottleneck for demanding workloads like real-time coding assistance and interactive applications,” said David Brown, Vice President, Compute & ML Services, AWS. “What we're building with Cerebras solves that: by splitting the inference workload across Trainium and CS-3, and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it's best at. The result will be inference that's an order of magnitude faster and higher performance than what's available today."
AWS and Cerebras will support both aggregated and disaggregated configurations. Disaggregated is ideal when you have large, stable workloads. Most customers run a mix of workloads with different prefill/decode ratios, where the traditional aggregated approach is still ideal. We expect most customers will want access to both and the ability to route workloads to whichever configuration serves them best.
We are very excited to be working with the AWS team on this unique collaboration. Disaggregated inference is a deep technical undertaking spanning AI hardware, models, and infrastructure. AWS brings world-class expertise in custom silicon, networking, and distributed computing. Cerebras brings a decade of innovation in wafer-scale system architecture, model expertise, and inference serving. Together, we are combining two exceptional and complementary engineering teams to build the world's fastest AI inference, in the #1 cloud, at unprecedented scale. We look forward to bringing this to our customers in the coming months.