Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TTRL: Test-Time Reinforcement Learning

Authors: Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, yuchen zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL s potential for broader tasks and domains.
Researcher Affiliation Academia 1Tsinghua University 2Shanghai AI Lab EMAIL, EMAIL
Pseudocode Yes Appendix C presents the pseudo-code of the reward function.
Open Source Code Yes Code: https://github.com/PRIME-RL/TTRL
Open Datasets Yes Benchmarks We evaluate TTRL on GPQA-Diamond [35], a challenging and high-quality subset of the Graduate-Level Google-Proof Question Answering benchmark, and 3 mathematical reasoning benchmarks: AIME 2024 [21], AMC [21], and MATH-500 [14].
Dataset Splits Yes We apply TTRL to each benchmark individually and then evaluate. We set the maximum generation length to 3072 tokens, unless otherwise specified. For the main experiments, following Deep Seek-R1 [11], we adopt the pass@k evaluation protocol [3] and calculate pass@1 using non-zero temperature sampling. Specifically, we generate 16 responses (4 for 32k context) per question using a temperature of 0.6 and a top-p value of 0.95. For the analysis and additional experiments on Qwen2.5-MATH, we evaluate using greedy decoding to report pass@1, to ensure a fair comparison with previous works. Appendix E presents a set of training-time metrics we used to monitor the performance of TTRL and analyze its training dynamics in the absence of ground-truth.
Hardware Specification Yes All experiments were conducted on 8 * NVIDIA A100 80GB GPUs.
Software Dependencies No We independently apply GRPO [38] on each benchmark to implement TTRL. For hyperparameters, we use a cosine learning rate schedule with a peak value of 5 × 10^−7 and adopt the Adam W optimizer for the policy model.
Experiment Setup Yes For hyperparameters, we use a cosine learning rate schedule with a peak value of 5 × 10^−7 and adopt the Adam W optimizer for the policy model. For rollout, we sample 64 responses using a temperature of 0.6 (1.0 for Qwen2.5-Math and LRMs) for voting-based label estimation and downsample 32 responses per prompt for training. The maximum generation length is set to 32,768 tokens for LRMs and 3,072 tokens for all other models. We set the number of episodes to 10, 30, and 80 for MATH-500, AMC, and AIME 2024, respectively, based on the dataset size.