So, you need speed, intelligence, and great economics… introducing GLM 4.7, the first open model that delivers all three.
Why developers are switching
At Cerebras, we’ve seen overwhelming demand from developers for GLM 4.7. The migration to GLM 4.7 is driven by three key factors: cost, speed, and intelligence.
- Cost: GLM 4.7 is more affordable than models like Claude Sonnet 4.5, achieving high-proficiency intelligence at a fraction of the cost.
- Speed: On Cerebras, GLM 4.7 achieves output speeds of up to1500+ tokens per second, making it 20x faster than closed-source competitors like Sonnet 4.5. This significantly reduces latency in agentic workflows, allowing for rapid iteration and execution in development environments.
- Intelligence: GLM 4.7 is the strongest open-source coding models available today. It’s remarkably skilled at tool use, achieving 96% on 𝜏²-Bench Telecom, which makes it suitable for use within coding harnesses and general agents. Additionally, GLM 4.7 is excellent at reasoning and knowledge tasks, scoring 86% on the challenging GPQA Diamond (Scientific Reasoning)as measured by Artificial Analysis.
What GLM 4.7 is (and why it matters)
GLM 4.7 is Z.AIs latest model, and the first OS model to rival closed source giants like Sonnet 4.5. Some main points:
- Scale: ~358B total parameters, ~32B active per token (MoE routing)
- Built for: coding + tool use + agentic workflows
- Privacy on Cerebras: inputs/outputs are processed in memory and not persisted
But architecture is just theory until you see the results.
The Performance
If you care about coding and agentic evals, GLM 4.7 lands in the “top open model” tier. On LiveCodeBench, GLM 4.7 outperforms Anthropic and OpenAI models, trailing only Gemini 3.
GLM 4.7 extends that lead with substantial gains on GPQA and AIME, outperforming Claude Sonnet 4.5 on both, see AIME 2025 below.
And when you compare it directly to GLM 4.6, a previous model generation the improvement in coding and general capabilities is substantial, notably +12.4 points in HLE, +16.5 points Terminal Bench 2, amongst others.
Performance is one thing, but usability is another.
GLM 4.7: The opportunity to be 20x faster
The beauty of GLM 4.7 is that it’s open source. This means that you’re no longer obstructed by hardware bottlenecks. Now, your product can achieve token output speeds magnitudes faster than closed source models like Sonnet 4.5 or GPT 5.2 running on GPUs. For example, see the difference in speed for GLM 4.7 across different hardware providers.
Every model has different personality quirks. When migrating to GLM 4.7, a common mistake is reusing old prompts without adjusting them for its unique behavior, which can lead to suboptimal performance. To fully leverage GLM 4.7's strengths, it's essential to refine your prompts, architecture, and sampling parameters accordingly.
Below are 10 rules to help you get the most from GLM 4.7.
Rule #1: Front load your instructions
On Cerebras, GLM 4.7 supports up to 131K context length. However, like most other large models, the output quality for GLM4.7 is most accurate at shorter lengths and can degrade at extreme lengths.
GLM 4.7 in particular has been observed to have a strong bias towards the beginning of the prompt, even more so than other models. This is especially noticeable when using think tags in conversations - it reinforces earlier instructions.
Accordingly, to ensure proper instruction following, place all mandatory instructions and behavioral directives at the absolute start of your system prompt to leverage the model's beginning bias. This is more effective than placing them later in the prompt.
Rule #2: Provide clear and direct instructions
Different models follow instructions differently. GLM 4.7 responds best to firm, direct language that removes ambiguity. Establish rules immediately using strong, explicit directives like MUST and STRICTLY. Avoid soft, suggestive language that the model may treat as optional.
For example:
- Do write: “Before writing any code, you MUST first read and fully comprehend the architecture.md file. All code you generate must strictly conform…”
- Don’t write: “Please read and follow my architecture.md...”
Rule #3: Specify a default language
Because GLM 4.7 is multilingual, we’ve found that it can sometimes switch languages in its responses.
If you’re migrating from a language model that defaults to English, it’s helpful to include a directive such as:"Always respond in English" (or your preferred language) in your system prompt to prevent unexpected outputs.
Occasionally, we’ve observed that the model may output reasoning traces in Chinese on the first turn. Explicit language control prevents this behavior.
Rule #4: Leverage role-play
One of GLM 4.7’s biggest strengths is its ability to effectively maintain and follow roles and personas. Its internal “thinking blocks” mirror role prompts closely, allowing precise control over tone and domain knowledge.
To take advantage of the models’ ability to role play: give the model an explicit persona, or create multi-agent systems each with their own personas.
For example:
- Do write: “You’re acting as an analyst preparing an executive summary; make sure to review the following sources in detail and then give a structured and professional response…”
Rule #5: Break up the task
GLM 4.7 performs a single reasoning pass per prompt before acting and does not continuously re-evaluate mid-task. This is sometimes referred to as 'interleaved thinking', which is supported by models like the Sonnet/OAI models.
In interleaved thinking, the model will alternate between:
- Reasoning steps (analysis, hypothesis generation)
- Action steps (retrieval, tool use, code execution, or environment interaction)
This allows the model to pause, reflect on intermediate results, and adjust its approach dynamically throughout the task execution.
Without interleaved thinking, it is encouraged to prompt better task completion in GLM 4.7. You can do this by breaking tasks into small, well-defined sub-steps. For example:
- List dependencies.
- Propose the new structure.
- Generate and verify migrations.
This incremental approach produces cleaner results and closely matches GLM’s execution-first tendencies.
Rule #6: Disable or minimize reasoning when not needed
GLM 4.7 often includes internal thought/reasoning blocks in its output. For many straightforward tasks, this reasoning overhead is unnecessary and slows down responses.
To minimize reasoning:
- Disable Reasoning: Use the nonstandard parameter disable_reasoning: True in your request parameters for the Cerebras API. Note that this is different from Z.ai who uses the parameter thinking in the Z.ai API.
- Set max_completion_tokens: GLM 4.7's verbosity can be effectively controlled by setting appropriate token limits. For focused responses, consider using lower values.
- Prompt for Less: Include instructions in the system prompt to minimize reasoning. For example, add to the system prompt: "Reason only when necessary" or "Skip reasoning for straightforward tasks."
- Set Output Constraints: Use structured output formats (JSON, lists, bullets) that naturally discourage verbose reasoning blocks.
- Set clear_thinking: true: The model removes its internal state between simple turn conversations saving tokens.
Rule #7: Enable enhanced reasoning for complex tasks
While GLM 4.7's reasoning can be excessive for simple tasks, it becomes valuable for complex problem-solving that requires step-by-step thinking.
To enhance reasoning:
- Enable Reasoning: Ensure disable_reasoning is set to False (or omitted) in your API request when tackling complex problems.
- Prompt for Depth: Add explicit reasoning instructions to your system prompt: "For any given task you must think step by step" or "Break down your reasoning into clear logical steps."
- Chain-of-Thought Prompting: Include examples that demonstrate the reasoning process you want, showing the model how to work through problems methodically.
Rule #8: When in doubt, use critics!
Following from rule #4, one of the most powerful patterns when working with GLM 4.7 (or any LLM) is to employ specialized critic agents to review and validate outputs before allowing the main agentic flow to advance in its plan. Rather than relying on a single agent to both generate and validate code, create dedicated sub-agents with specific expertise:
- Code Review Agent: A sub-agent configured to rigorously check for code quality, adherence to SOLID/DRY/YAGNI principles, and maintainability issues.
- QA Expert Agent: Potentially bound with agentic browser capabilities to test user flows, edge cases, and integration points.
- Security Review Agent: Specialized in identifying vulnerabilities, unsafe patterns, and compliance issues.
- Performance Audit Agent: Focused on detecting performance bottlenecks, inefficient algorithms, or resource leaks.
By splitting responsibilities across multiple agents, each with a focused persona (see Rule #4), you create a robust pipeline where generation and validation are decoupled.
Rule #9: Pair GLM with a frontier model
GLM 4.7 excels at reasoning compared to other open source models. However, if your use case relies on frontier reasoning capabilities, you may find that GLM 4.7 falls short on the toughest 10% of use cases.
Here are three architectural patterns you can employ to effectively utilize GLM 4.7 in your applications:
- Route to GLM 4.7 for simpler tasks and fall back to slower models for the most complex queries.
- Use GLM 4.7 as a fast backbone agent that loops in slower, more intelligent models only when needed.
- Use Sonnet or GPT to first create a plan, then execute it rapidly using GLM 4.7—allowing you to handle high volumes of tasks at a fraction of the cost while maintaining quality on complex reasoning steps.
By leveraging GLM 4.7's 17x faster output speed and lower costs for the majority of tasks, you can realize significant speed and cost savings without sacrificing overall quality.
Rule #10: Use clear_thinking to control memory between calls
Use clear_thinking to decide how much internal “thinking state” GLM 4.7 should carry across calls. For agent loops, multi‑step plans, and coding sessions that build on prior reasoning, set clear_thinking: false so the model preserves its internal state between turns. For one‑off calls, batch jobs, or when you see unwanted drift from previous steps, set clear_thinking: true so each response is based only on the visible prompt.
And now, ready to see it in action?
One request to try (Cerebras SDK)
Run the fastest inference in the world with top in-classintelligence, and sane economics.
When you run GLM 4.7 on Cerebras, you’re using our Wafer‑Scale Cluster, but the way you work with the model stays familiar. You can use our long context of around 131k tokens and allow for responses, up to about 40k output tokens via max_completion_tokens, so the model has enough room to take in the prompt and produce a complete answer.
You can also decide how much internal “thinking” the model should do. If you care more about speed and straightforward output, you can turn reasoning off with disable_reasoning: true to keep responses shorter and cleaner. If you’re building agents or coding loops and want the model to keep that internal state across turns, leave clear_thinking: false so it doesn’t reset between calls.
For sampling, a good starting point on Cerebras is temperature=1 with top_p=0.95. These defaults work well for most workloads on our system. Once you see how GLM 4.7 behaves in your application, you can adjust one of these settings at a time to get the style and variability you need.
Once you’ve set these parameters, you’re in a good place to start tuning for your specific use case.
If you’re already standardized on the OpenAI Python SDK, Cerebras is OpenAI-compatible, just pass GLM-specific fields using extra_body.
Getting started
- Grab an API key: https://cloud.cerebras.ai
- Use model id: zai-glm-4.7
- Start with the defaults above, then tune only when your workload forces you to.
Credits
We want to honor our amazing community that tests and provides feedback on models early. Your benchmarks, bug reports, and enthusiasm are what makes our ecosystem thrive.