Abstract
To tackle complex, real-world problems, Large Language Models (LLMs) need to learn how to reason, plan, and adapt. Our recent work on test-time scaling, CePO, demonstrated that even medium-sized models (<= 32B parameters) can outperform much larger frontier models by using adaptive planning, tool use, and self-correction [1]. We believe we can push these capabilities even further by teaching LLMs to break down challenging tasks into smaller steps, advancing when successful and backtracking when they hit a wall.
This post presents our work on teaching these skills using online Reinforcement Learning (RL). Our journey begins with an ideal proxy for this kind of challenge: Sudoku. While its rules are simple, solving difficult puzzles requires significant planning and the ability to backtrack from incorrect assumptions, making it a perfect testbed for teaching an LLM the foundational skills of long-horizon reasoning.
1. Task Definiton
Sudoku is a combinatorial puzzle widely studied for its challenging nature and simple, yet elegant constraints. Sudokus have been developed in many forms, they do need not be number based and can be letters or colors or shapes or whatever you like; it is a very varied type of puzzle. Common 9x9 puzzles are subdivided into nine 3x3 sub-grids and the objective is to fill cells such that each row, column and sub-grid contains digits from 1 to 9 exactly once. A smaller 4x4 variant follows analogous rules.
Despite its straightforward rules, sudoku presents significant combinational complexity, serving as an effective benchmark in constraint satisfaction and logical reasoning studies due to the exponential growth in difficulty as grid size and complexity increase. In general the extension of the puzzle to n × n grids is known to be NP-complete [2].
1.1. Data Generation and Difficulty Level Analysis
In this section, we examine methods for generating puzzles and assessing their difficulty. Although no universally accepted classification exists, one common metric is the number of empty cells in a puzzle [3]. Alternative approaches assign difficulty ratings based on the solving techniques required, producing solver‑dependent measures that reflect the logical rules and strategies applied.
One well-known method is depth-first search (DFS) which involves two components (1) making guess (branching) (2) propagating constraints such that number of possible values are less (nodes). Number of the guess we make till the solution is a good indicator of difficulty of the puzzle. This also correlates with number of empty cells, since more empty cells usually translates to more options. Therefore it is critical to avoid making guesses and learn to propagate the constraints and only guess if needed. Although DFS is straight forward, it is considered machine way of solving puzzles and humans has developed other techniques. Human way of solving it has around 8-10 pre-defined rules to identify cell values [4]. Those rules can be ordered in increasing difficulty and puzzles can be rated based on the highest difficulty rule needed to complete during solving the puzzle.
We create our sudoku puzzles by starting from a randomly filled grid and removing the cells one by one until we reach to a desired number of empty cells. At each iteration we check that starting from current state of the puzzle we have a unique solution. Final 9x9 puzzles contains only unique solutions whereas we augment 4x4 puzzles by allowing up to 2 instances per unique solution.
Next, we study various ways to assess the difficulty of the puzzles and use clustering methods determine the logical ratings in combination with number of empty cells. Final 9x9 dataset distribution is presented in Figure 1. We also share logical difficulty levels in Appendix 1.
1.2 Representation Space and Level of Guidance
An other critical thing to consider is the representation space and level of inductive bias in the prompt about the specific task. It is known that LLMs are sensitive to tokenization process and we want to make sure our representation is robust to this process and aligns with models reasoning habits [5]. Sudoku can mainly be represented as (1) structured grid and (2) coordinate lists. Therefore we manually prompt models of interest with various representations, e.g, markdown, structured grid, row based representation, alpha-numeric coordinate list, etc… and determine that models tend to transform any given representation to a format that can enable referencing cells with coordinates R{row}:C{column}. We evaluate models with grid, row-based and markdown representations (see Fig. 2).
In addition to representation, we ablate with system and user prompt where we include various level of guidance (see Appendix 2) about the Sudoku task. We design our solution parser lenient and do our best to accept a solution. Common failure modes include format correctness, whether solution satisfies row, column and sub-grid constraints and finally whether the given solutions is not only valid but also for the given puzzle and finally we report fully correct solutions.
We present results in below table. We use temperature 0.3, top-p. 0.95, maximum context length of 32,768 for our generations unless specified.
We observe that Qwen3-8B-Base with 4x4 Markdown and Row representation has similar accuracy whereas Grid representation has lower score mainly since its solutions are mainly to incorrect initial puzzles. We hypothesize that lack of coordinate list is the main reason since model do not have a systematic way of referring cells consistently. Qwen3-8B model has significant boost in 9x9 Markdown representation probably mostly due to the significantly more markdown data involved in post-training of this model. We do not include 4x4 numerics for Qwen-8B and its successors since they solve all the puzzles.
Qwen3-8B-Base model has not been trained on system prompt with chat template and reasoning. We ablate with and without system prompt. Model without the system prompt can be seen as chain-of-thought (CoT) response and it can be seen a reasoning model with system prompt. We observe similar accuracy in both prompts, yet they have different failure modes. Reasoning prompts is better at outputting valid puzzles, yet it struggles with format. We hypothesize that reasoning helps with task complexity but it disrupts the model and format of the output.
2. The First Challenge: From Good to Great
Online RL training has shown significant boost in math and coding domains [6]. In this study, we apply GRPO on Qwen3-8B to improve its accuracy on 9x9 grid represented puzzles [7]. We use our aforementioned sudoku dataset, at each training iteration we solve 32 puzzles with 16 rollouts each. We also use temperature 0.65, top-p 1.0, maximum context length of 20480, and do 2 weights updates per training step with linear learning rate of 1e-6 with warmup and KL loss coefficient of 0.005, lower 0.2 and upper 0.28 clip rates. We also apply dynamic sampling to include only non-zero standart deviation groups. We constantly monitor our training performance on puzzle difficulty levels and interrupt the training 3 times to remove levels that can be solved with >80% avg@4 accuracy. This enables training to be more FLOP efficient. We present our results at below table. Although we see that our model keeps improving at a low rate, we stop training at 375 step due to the training budget.
Our results demonstrate that the Qwen3-8B-RL model significantly improves upon Qwen3-8B by achieving gains of 73% in avg@4 and 34% in pass@4 metrics, while utilizing 27% fewer tokens. This highlights the effectiveness and efficiency of our approach in enhancing the reasoning capabilities of the model for complex tasks. Additionally, as shown in the table below, our model ranks second among state-of-the-art (SOTA) models, surpassed only by o4-mini, and outperforming both the R1 and Qwen3-235B models. We observed that R1 and Qwen3-235B encountered challenges related to output formatting despite lenient parsing criteria. This observation underscores a secondary metric, isValid for evaluating a model's intrinsic capability to solve complex reasoning tasks independent of formatting requirements, where our model also demonstrates strong performance.
3. The Ultimate Test: From Novice to Expert
Inspired from our previous results, we further try RL-Only approach on Qwen3-8B-Base model. We apply GRPO on Qwen3-8B-Base to improve its accuracy on 4x4 row represented puzzles. We use the same setup and hyper-parameters with a shorter maximum context length of 4096.
Due to the limited dataset size in 4x4 puzzles, we introduce a priority based sampling method in this experiment. This method enforces an implicit curriculum learning where model prioritizes samples it struggles with. We keep priorities in logit space and their values based on the rewards earned in a training step.
We sample from priorities using probabilities. It reduces number of regenerations for dynamic sampling since it samples from harder questions and improves our throughput more than 400% especially further in the training.
In addition to the Sudoku task, we applied a chat template augmented with a reasoning system prompt during training. Note that the base model had not previously been exposed to this template. Our ablation studies revealed that the model infrequently utilized the <think> and </think> tokens. Although lightly supervised fine-tuning (SFT) on these tokens could potentially improve their usage, we opted to defer this exploration to future work, instead employing the <reasoning> and </reasoning> tags. With these modifications, our Qwen3-8B-Base-RL model demonstrated significant improvement on 4x4 puzzles, increasing performance from an avg@32/pass@32 of 0.14/0.63 to 0.97/0.99. Notably, despite no direct training on 9x9 puzzles, the model also improved from an avg@4/pass@4 of 0.033/0.059 to 0.23/0.28, highlighting the generalization capabilities of our training approach to more complex, unseen puzzles.
We further trained our checkpoint on a 9x9 dataset; although we were able to improve on easy/medium level puzzles, this checkpoint was not able to generalize to challenging puzzles. Upon analyzing the model’s output, we observed that the model learned various strategies throughout training but ultimately overfitted to a single strategy for solving puzzles. To address this issue, we implemented the Clip-Cov method to encourage the usage of rare tokens and diverse strategies [8]. While our Qwen3-8B-Base-RL-ClipCov checkpoints achieved similar results on 4x4 puzzles, it’s performance on 9x9 puzzles degraded. We hypothesize that combining SFT with online reinforcement learning methods could enable the model to solve more complex puzzles, and we leave this exploration for future research.
4. On Generalization
4.1 Affects on Other Representations
We assess whether training in one representation can generalize to others. Below table shows that training on one representation generalize to representations. It highlights that regardless of the format, models inherent ability to solve complex problems improve.
4.2 Affects on Benchmarks
We also evaluate our checkpoints on a small set of public benchmarks. Our Qwen3-8B-Base-RL model improves on AIME-24, AIME-25 and GPQA-Diamond. Aligned with our previous observations this supports that model’s improvement in inherent complex task handling. Our Qwen3-8B-RL has slight regression on AIME-24 and AIME-25 and one point improvement on GPQA. We find these values encouraging since this models gets extremely good at one task, Sudoku, and does not show significant degrade in others.
5. Conclusion
We present a through analysis of Sudoku puzzles, their generation and difficulty assessment as well as prompting affects. Further conduct experiments to show Online RL’s ability to improve base and instruct model’s complex task handling on Sudoku puzzles. We introduce chat template and reasoning tags to base model using RL-Only. We further do ablations and analyze generalizing on various representations and affects of sudoku training on general benchmarks. We plan to extend our experiments to more general scientific reasoning agents using more diverse datasets including math and code domains in future work.
Appendix
1. Logical Difficulty Levels
2. Level of Guidance
We ablate our experiments by including; system prompt, sudoku rules and tips to solve the puzzles.
2.1 No Guidance
2.2 Guide with Rules
2.3 Guide with Tips
References
[2] Kendall, Graham & Parkes, Andrew & Spoerer, Kristian. (2008). A Survey of NP-Complete puzzles. ICGA Journal. 31. 13-34. 10.3233/ICG-2008-31103.
[3] Shah, Kulin, Nishanth Dikkala, Xin Wang, and Rina Panigrahy. “Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles.” arXiv, September 16, 2024. https://doi.org/10.48550/arXiv.2409.10502 .
[4] https://www.sudokudragon.com/sudokustrategy.htm#
[5] Wang, Dixuan, Yanda Li, Junyuan Jiang, Zepeng Ding, Guochao Jiang, Jiaqing Liang, and Deqing Yang. “Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization.” arXiv, May 27, 2024. https://doi.org/10.48550/arXiv.2405.17067 .
[6] Chen, Yang, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. “AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning.” arXiv, June 5, 2025. https://doi.org/10.48550/arXiv.2505.16400 .
[7] Shao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv, April 27, 2024. https://doi.org/10.48550/arXiv.2402.03300 .
[8] Cui, Ganqu, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, et al. “The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.” arXiv, May 28, 2025. https://doi.org/10.48550/arXiv.2505.22617 .