Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Authors: Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1 s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1. ... Experiments show that the T1 models achieve superior performance across all benchmarks. For example, T1 with Qwen-32B as its base can outperform the recent Qwen Qw Q-32B-Preview model on MATH500, AIME2024, and Omni-MATH-500. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Zhipu AI. Correspondence to: Yuxiao Dong <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in narrative text, without including any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | The model weights and training data are publicly available at https://github.com/THUDM/T1. |
| Open Datasets | Yes | The prompts used in the training data all come from publicly available datasets, including MATHtrain (Hendrycks et al., 2021), and Numina Math (Li et al., 2024b). ... We evaluate the performance on widely-used math reasoning benchmarks AIME, Omni MATH (Gao et al., 2024), MATH (Hendrycks et al., 2021), and GPQA (Rein et al., 2023). |
| Dataset Splits | Yes | We split around 12k for the SFT stage and the others for RL training. To prepare the data for reinforcement learning, we convert the original instances into (Question, Label) pairs through the following two steps: The first step is answer extraction. ... For each question, we generate 16 responses and retain only those instances whose pass rate lies in the interval (0, δ) (where δ = 0.3 in our experiments). Finally, we got 30k data that can be used for RL training. ... For MATH, we assess performance on a subset of the MATH-test set, referred to as MATH500, following the predefined split in Lightman et al.. For Omni-MATH, we sample a smaller evaluation subset by sampling 500 examples called Omni-MATH-500 for efficient yet comprehensive evaluation. GPQA consists of graduate-level problems in biology, physics, and chemistry. For AIME, we use the official questions released for the year 2024, which consists of 30 problems. We evaluate each model 32 times on AIME to get stable results and report the average performance. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models. |
| Software Dependencies | No | For all datasets, we use the greedy sampling strategy with SGLANG (Zheng et al., 2023) as the inference engine. While SGLANG is mentioned, a specific version number is not provided, nor are other key software dependencies with versions. |
| Experiment Setup | Yes | For SFT, we train the models for three epochs using a learning rate 1e-5 with cosine decay scheduling. For RL training, we sample 64 responses for each prompt and perform policy gradient descent for every 32 prompts. We train the model with a 1.5e-6 learning rate and KL set to 2e-4. For the reward function, we use the ground truth, i.e., the correctness of the response, as the metric, assigning a reward of 1 for correct answers and 0 for incorrect ones. Although using a trained reward model is generally considered a superior approach due to its ease of optimization, we find that using the correctness of response as the reward also performs well for reasoning tasks and helps mitigate issues such as data distribution shifts and reward hacking. If not specified, the max generation length for training and inference is set to 10,240 for GLM-4-9B and Qwen2.5-14B models and 16,384 for Qwen2.5-32B models. |