Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Better Search with Language Models via Guided Reinforced Self-Training
Authors: Seungyong Moon, Bumsoo Park, Hyun Oh Song
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method significantly enhances the search capabilities of language models on arithmetic reasoning and code self-repair tasks, including Countdown, Code Contests, and Code Forces. We release the source code at https://github.com/snu-mllab/guided-rest. |
| Researcher Affiliation | Collaboration | Seungyong Moon1, Bumsoo Park2, Hyun Oh Song1 1Seoul National University, 2KRAFTON EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 AUGMENTSUBGOAL: Generating search traces via subgoal augmentation Algorithm 2 Guided reinforced self-training Algorithm 3 Guided reinforced self-training (episode-level) |
| Open Source Code | Yes | We release the source code at https://github.com/snu-mllab/guided-rest. |
| Open Datasets | Yes | We use Countdown as the primary benchmark, following previous studies on searching with language models [38, 4]. We build the training data using Prime Intellect s SYNTHETIC-1 [16], which was curated from APPS, Code Contests, and TACO [8, 14, 13]. |
| Dataset Splits | Yes | For evaluation, we follow the protocol of Gandhi et al. [4]. We construct 10K test examples for each of two settings: (1) seen targets, where the target numbers overlap with those in the training data but are paired with different input numbers, and (2) unseen targets, where the target numbers are entirely disjoint from the training data. We measure accuracy while varying the token budget. |
| Hardware Specification | No | The paper mentions that computation resources are provided in the supplementary materials, which is not available in the provided text. The main text does not specify exact GPU/CPU models or other hardware details. |
| Software Dependencies | No | The paper mentions base models like Llama-3.2-1B-Instruct and Qwen2.5-7B-Instruct, and algorithms like PPO, but does not specify version numbers for any software libraries or dependencies. The checklist states details are in supplementary material, but this is not accessible. |
| Experiment Setup | Yes | We use Llama-3.2-1B-Instruct as the base model [5], with a maximum response length of 4K tokens.2 For So S, we generate search traces using heuristic-guided DFS and BFS over 500K training examples and fine-tune the base model for two epochs. For Guided-Re ST, we generate a single search trace for each problem using 200K training examples and fine-tune the model for two epochs, repeating this procedure for three iterations. For PPO, we fine-tune the model on 200K training examples for two epochs. We adopt an outcome reward function that assigns 1 for success and 0 otherwise. Additional details are provided in Appendix E. |