Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Authors: Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1 s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1. ... Experiments show that the T1 models achieve superior performance across all benchmarks. For example, T1 with Qwen-32B as its base can outperform the recent Qwen Qw Q-32B-Preview model on MATH500, AIME2024, and Omni-MATH-500.
Researcher Affiliation	Collaboration	1Tsinghua University 2Zhipu AI. Correspondence to: Yuxiao Dong <EMAIL>.
Pseudocode	No	The paper describes methods and processes in narrative text, without including any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	The model weights and training data are publicly available at https://github.com/THUDM/T1.
Open Datasets	Yes	The prompts used in the training data all come from publicly available datasets, including MATHtrain (Hendrycks et al., 2021), and Numina Math (Li et al., 2024b). ... We evaluate the performance on widely-used math reasoning benchmarks AIME, Omni MATH (Gao et al., 2024), MATH (Hendrycks et al., 2021), and GPQA (Rein et al., 2023).
Dataset Splits	Yes	We split around 12k for the SFT stage and the others for RL training. To prepare the data for reinforcement learning, we convert the original instances into (Question, Label) pairs through the following two steps: The first step is answer extraction. ... For each question, we generate 16 responses and retain only those instances whose pass rate lies in the interval (0, δ) (where δ = 0.3 in our experiments). Finally, we got 30k data that can be used for RL training. ... For MATH, we assess performance on a subset of the MATH-test set, referred to as MATH500, following the predefined split in Lightman et al.. For Omni-MATH, we sample a smaller evaluation subset by sampling 500 examples called Omni-MATH-500 for efficient yet comprehensive evaluation. GPQA consists of graduate-level problems in biology, physics, and chemistry. For AIME, we use the official questions released for the year 2024, which consists of 30 problems. We evaluate each model 32 times on AIME to get stable results and report the average performance.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU or CPU models.
Software Dependencies	No	For all datasets, we use the greedy sampling strategy with SGLANG (Zheng et al., 2023) as the inference engine. While SGLANG is mentioned, a specific version number is not provided, nor are other key software dependencies with versions.
Experiment Setup	Yes	For SFT, we train the models for three epochs using a learning rate 1e-5 with cosine decay scheduling. For RL training, we sample 64 responses for each prompt and perform policy gradient descent for every 32 prompts. We train the model with a 1.5e-6 learning rate and KL set to 2e-4. For the reward function, we use the ground truth, i.e., the correctness of the response, as the metric, assigning a reward of 1 for correct answers and 0 for incorrect ones. Although using a trained reward model is generally considered a superior approach due to its ease of optimization, we find that using the correctness of response as the reward also performs well for reasoning tasks and helps mitigate issues such as data distribution shifts and reward hacking. If not specified, the max generation length for training and inference is set to 10,240 for GLM-4-9B and Qwen2.5-14B models and 16,384 for Qwen2.5-32B models.