Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforcement Learning Teachers of Test Time Scaling

Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show how distilling the raw outputs of a 7B RLT directly outperforms training students with carefully postprocessed reasoning traces from orders of magnitude larger LMs. We demonstrate that RLTs also allow for better cold-starts for traditional RL, effective distillation to larger students, and even zero-shot transfer to new reasoning domains.
Researcher Affiliation	Industry	Edoardo Cetin, Tianyu Zhao, Yujin Tang Sakana AI, Japan EMAIL
Pseudocode	No	The paper describes methods and a framework but does not contain explicitly labeled pseudocode or algorithm blocks. It explains the process in descriptive text and mathematical equations, such as in Section 3 "Reinforcement learning teachers" and Section 3.3 "Evaluating the quality of explanations", but without structured algorithm blocks.
Open Source Code	Yes	We share our code and pretrained checkpoints1 to facilitate future research in RL reasoning and distillation. 1https://github.com/Sakana AI/RLT
Open Datasets	Yes	We train RLTs on the set of questions and solutions selected by Li et al. [12] based on their level of challenge. This dataset comprises less than 17K math and coding problems originally used for distilling filtered and post-processed reasoning traces collected from Qw Q [22] and Deep Seek R1 [4]. 2https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k
Dataset Splits	Yes	We collect our distillation dataset with the learned RLTs using the same full set of 17K question-solution pairs from training. With the new reasoning traces, we then proceed to fine-tune our students either on this full data or a randomly sampled 1K subset, equating the distillation budget and following the same recipes as our baselines [6, 12]. We train and test our models on distinct datasets of 16K and 1K automatically-generated question and solution pairs.
Hardware Specification	Yes	Our experiments are conducted on a single compute node comprising 8 Nvidia Hopper H100 GPUs, 1.8 TB of memory, and 208 Intel Xeon Platinum 8481C CPUs.
Software Dependencies	No	The paper mentions several software components and libraries, such as "Qwen2.5-7B-Instruct LM [23]", "Adam W optimizer [45]", "TRL library [46]", "VLLM generation [47]", "Lighteval [29]", and "GPT4.1-mini [32]". However, it does not provide specific version numbers for these software components or libraries in the text or the tables describing hyperparameters.
Experiment Setup	Yes	We train our main models for 125 steps, less than a single epoch, with a batch size of 1024, a constant learning rate of 1 10 6, and a group size of 64. We provide a full list of hyperparameters to ensure reproducibility in Table 3.