Dec 04 2025
Cerebras at NeurIPS 2025: Nine Papers From Pretraining to Inference
Cerebras is excited to be at NeurIPS 2025—and what a year it's been. We launched our inference API last August, opened a new data center in Oklahoma City, and watched demand for Cerebras Inference explode with the latest state-of-the-art open weight models.
Our research team has been hard at work too, and this year they're presenting nine papers probing the foundational questions of modern AI practice: where does compute get wasted during training? How should reasoning models allocate tokens at inference? When do smaller models beat bigger ones? The work spans pretraining to inference—new findings on scaling laws, training efficiency, and smarter orchestration of test-time compute.
Below is an overview of each paper, what we found, and who should care. Links to the full arXiv papers are included. If you're at the conference, stop by our booth and take a selfie with the wafer. If you're not, we'd still love to hear your thoughts via X or email.
The Conductor and the Engine: A Path Towards Co-Designed Reasoning
CODA (Conductor-Driven Architecture) optimizes test-time compute across planning, execution, self-refinement, and verification based on problem difficulty—helping 32B models outperform 235B+ competitors in our benchmarks. The insight is that most test-time compute gets wasted: external orchestration and internal model reasoning both try to handle high-level thinking, duplicating effort and burning tokens. CODA separates the roles cleanly—the Conductor plans, reflects, and verifies; the Engine executes—unlocking frontier-level performance without frontier-level model sizes.
Key findings
- 32B models beat 235B+ competitors. Qwen3 32B with CODA outperforms Qwen3 235B, DeepSeek R1, and OpenAI o3-mini on AIME and LiveCodeBench. The 120B GPT-OSS model hits 87.5% on LiveCodeBench, topping Grok4.
- Adaptive path selection matters. Some math problems solve better through direct reasoning; others need code generation. CODA dynamically picks the right approach per problem—on 15 hard Numina Math questions, neither path alone reaches 100%, but adaptive selection does.
- Verification is the bottleneck. There's a consistent gap between CODA's final accuracy and the theoretical best-of-N ceiling, suggesting that better verifiers (potentially trained via RL) could unlock substantial additional gains.
Who this is useful for
Teams deploying reasoning systems on memory-constrained hardware. Researchers exploring test-time compute scaling. Anyone wanting frontier-level performance without frontier-level model sizes.
📄 [Paper Link]
Calibrated Reasoning: An Explanatory Verifier for Dynamic and Efficient Problem-Solving
When scaling reasoning models at test time, practitioners face a frustrating bottleneck: these models are terrible at evaluating their own solutions. Multi-path exploration strategies like best-of-n sampling rely on the model knowing when it's right or wrong, but current reasoning models are heavily biased toward giving everything high scores regardless of correctness. This paper introduces an Explanatory Verifier trained via reinforcement learning (GRPO) that produces calibrated confidence scores along with natural language reasoning for why solutions are correct or incorrect. The key insight is comparing pairs of candidate solutions rather than evaluating them in isolation, which helps the verifier catch subtle errors.
Key findings
- 1-3× token savings in best-of-n. The verifier achieves higher accuracy at low k values compared to self-consistency, and matches accuracy at higher k while using substantially fewer tokens—because it only generates additional candidates when both current options look wrong.
- Catches failure modes voting can't. When both candidate solutions are identically incorrect (a common failure mode on hard problems where models collapse into narrow, biased answers), majority voting fails completely. The verifier reliably detects these cases.
- Transfers across model scales. Though trained on Qwen3-8B outputs, the 8B verifier effectively evaluates generations from the larger Qwen3-32B, achieving 0.77 accuracy on AIME 2025 while using only 75% of the tokens versus self-consistency.
Who this is useful for
Teams running best-of-n or majority voting at scale and burning tokens on redundant generations. Researchers working on reward models and verifiers for reasoning. Anyone who's noticed their reasoning model is confidently wrong and wants a better filter.
📄 [Paper Link]
DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Speculative decoding has become a go-to technique for accelerating text-only LLMs, but applying it to vision-language models is harder—visual and textual information need to stay tightly integrated throughout generation. DREAM is a speculative decoding framework built specifically for VLMs, combining three innovations: a cross-attention mechanism that injects intermediate features from the target model into the draft model, entropy-adaptive feature selection to guide draft training, and visual token compression to reduce latency without losing critical information.
Key findings
- Up to 3.6× speedup over standard decoding. Across LLaVA, Pixtral, SmolVLM, and Gemma3, DREAM consistently outperforms prior speculative decoding baselines in both throughput and acceptance length.
- Cross-attention is the key architectural choice. Unlike text-only methods that concatenate features, DREAM's cross-attention preserves structured visual representations—removing it causes the largest performance drop in ablations.
- Visual tokens can be aggressively compressed. Retaining just 75% of visual tokens guided by attention-based importance scores yields a 7% speedup with minimal accuracy loss.
Who this is useful for
Teams deploying vision-language models in latency-sensitive applications. Researchers working on efficient multimodal inference. Anyone running VLMs at scale and looking to cut decoding time without sacrificing output quality.
📄 [Paper Link]
Don't be lazy: CompleteP enables compute-efficient deep transformers
When scaling up transformer models, hyperparameters like learning rate that work well for smaller models often need expensive retuning as you increase depth. This paper investigates parameterizations—rules for how to adjust hyperparameters as model architecture changes—and identifies one called CompleteP that enables hyperparameters to transfer reliably across both width and depth. The core change is scaling each residual block's output by 1/L before adding it to the residual stream, which maintains stability while ensuring every layer fully exploits its nonlinear capacity.
Key findings
- Depth-wise HP transfer works. Hyperparameters tuned on a 2-layer model transfer directly to 128-layer models without retuning—but only with CompleteP.
- 12-34% compute savings over µP. Gains increase for deeper models, and narrow-deep architectures that were previously suboptimal become viable.
- Theory explains why. Prior parameterizations cause layers to effectively linearize as depth increases ("lazy learning"), wasting the representational power of depth. CompleteP is the unique parameterization ensuring complete feature learning.
Who this is useful for
ML engineers training large transformers. Researchers studying neural scaling laws. Teams with hardware constraints favoring narrow-deep architectures.
📄 [Paper Link]
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Hyperparameters like weight decay and batch size that work at small scale need expensive retuning as model size, dataset size, and batch size change. This paper investigates scaling laws for these hyperparameters and identifies precise power-law relationships that enable accurate prediction of optimal settings before large-scale training begins.
The core insight is modeling training through what the authors call the AdamW timescale—a ratio that captures how weight updates are integrated over training, combining batch size, learning rate, weight decay, and dataset size into a single quantity. Rather than tuning learning rate as batch size changes (which hits stability limits), tuning weight decay maintains optimal timescale across a much wider range.
Key findings
- Optimal timescale follows a power law in tokens-per-parameter. As you train on more data relative to model size, the optimal timescale decreases predictably. This enables accurate prediction of optimal weight decay for any combination of model size, dataset size, and batch size.
- Optimal and critical batch sizes scale with data, not compute. Both the loss-minimizing batch size and the critical batch size (beyond which returns diminish sharply) follow power laws in dataset size, contradicting prior work suggesting they scale with FLOPs or loss. This means overtrained models can use much larger batch sizes efficiently.
- Overtrained models are Pareto-optimal for time-constrained training. When balancing training time against compute cost, smaller models trained on more data dominate—they benefit from both higher critical batch sizes and greater data parallelism.
Who this is useful for
ML engineers setting hyperparameters for large-scale pretraining. Teams making compute/time tradeoffs in model training. Researchers studying neural scaling laws and optimal training configurations.
📄 [Paper Link]
PTPP-Aware Adaptation Scaling Laws predict domain-adaptation performance at unseen pre-training budgets
When adapting a pre-trained language model to a new domain (like adding French to an English/Arabic model), you need to balance learning the new domain against forgetting the old one. Existing scaling laws for this process assume a fixed pre-training budget, but in practice you're often working with models trained to different extents. This paper introduces scaling laws that explicitly incorporate the pre-training budget (measured as tokens-per-parameter, or PTPP), enabling predictions about adaptation performance for models you haven't actually trained yet.
Key findings
- Early-stage fits predict late-stage performance. Laws trained on models at PTPP=15 and 31 accurately forecast French validation loss at PTPP=279—a nearly 10x extrapolation—outperforming baselines that ignore PTPP entirely.
- Pre-training budget affects adaptation in two ways. It shifts the loss floor (better-trained models start lower) and changes the shape of the learning curve (how efficiently the model uses adaptation data). Capturing both effects via a "gated+floor" formulation yields the best predictions.
- Enables analytical optimization of replay ratios. Rather than expensive grid searches, the fitted laws let you solve for optimal adaptation parameters (8.9 tokens-per-parameter, 34% replay) that satisfy constraints on both forgetting and target performance.
Who this is useful for
Teams adapting foundation models to new languages or domains. Researchers studying continual pre-training dynamics. Anyone trying to plan adaptation compute budgets without running exhaustive experiments.
📄 [Paper Link]
Life Sciences Collaboration with InceptionInstitute of Artificial Intelligence
Cerebras has a deep collaboration with Inception Institute of Artificial Intelligence on foundational models for life sciences. Together we've published three papers introducing the Omics42 platform: Prot42, a family of protein language models that generates high-affinity protein binders from sequence alone that achieve binding strengths competitive with Google's AlphaProteo while handling sequences up to 8,192 amino acids. Chem42, a family of chemical language models for drug-like molecule generation that outperforms larger models on standard benchmarks with 6x fewer parameters, and can design ligands tailored to specific protein targets. Gene42, a genomic foundation model that processes up to 192,000 DNA base pairs at single-nucleotide resolution—the first dense-attention model to achieve this scale—setting state-of-the-art on species classification and disease variant prediction tasks.
📄 Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation
📄 Chem42: a Family of Chemical Language Models for Target-aware Ligand Generation
📄 Gene42: Long-Range Genomic Foundation Model With Dense Attention