Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcement Learning Teachers of Test Time Scaling
Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show how distilling the raw outputs of a 7B RLT directly outperforms training students with carefully postprocessed reasoning traces from orders of magnitude larger LMs. We demonstrate that RLTs also allow for better cold-starts for traditional RL, effective distillation to larger students, and even zero-shot transfer to new reasoning domains. |
| Researcher Affiliation | Industry | Edoardo Cetin, Tianyu Zhao, Yujin Tang Sakana AI, Japan EMAIL |
| Pseudocode | No | The paper describes methods and a framework but does not contain explicitly labeled pseudocode or algorithm blocks. It explains the process in descriptive text and mathematical equations, such as in Section 3 "Reinforcement learning teachers" and Section 3.3 "Evaluating the quality of explanations", but without structured algorithm blocks. |
| Open Source Code | Yes | We share our code and pretrained checkpoints1 to facilitate future research in RL reasoning and distillation. 1https://github.com/Sakana AI/RLT |
| Open Datasets | Yes | We train RLTs on the set of questions and solutions selected by Li et al. [12] based on their level of challenge. This dataset comprises less than 17K math and coding problems originally used for distilling filtered and post-processed reasoning traces collected from Qw Q [22] and Deep Seek R1 [4]. 2https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k |
| Dataset Splits | Yes | We collect our distillation dataset with the learned RLTs using the same full set of 17K question-solution pairs from training. With the new reasoning traces, we then proceed to fine-tune our students either on this full data or a randomly sampled 1K subset, equating the distillation budget and following the same recipes as our baselines [6, 12]. We train and test our models on distinct datasets of 16K and 1K automatically-generated question and solution pairs. |
| Hardware Specification | Yes | Our experiments are conducted on a single compute node comprising 8 Nvidia Hopper H100 GPUs, 1.8 TB of memory, and 208 Intel Xeon Platinum 8481C CPUs. |
| Software Dependencies | No | The paper mentions several software components and libraries, such as "Qwen2.5-7B-Instruct LM [23]", "Adam W optimizer [45]", "TRL library [46]", "VLLM generation [47]", "Lighteval [29]", and "GPT4.1-mini [32]". However, it does not provide specific version numbers for these software components or libraries in the text or the tables describing hyperparameters. |
| Experiment Setup | Yes | We train our main models for 125 steps, less than a single epoch, with a batch size of 1024, a constant learning rate of 1 10 6, and a group size of 64. We provide a full list of hyperparameters to ensure reproducibility in Table 3. |