Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Authors: Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. For all experiments, we initialize the buffers as described in Section 3.1. AZR models are trained using a batch size of 64 * 6 (2 roles * 3 task types). |
| Researcher Affiliation | Academia | 1Tsinghua University 2BIGAI 3Penn State University |
| Pseudocode | Yes | We showcase an illustration of our Absolute Zero Reasoner approach in Figure 2 and Algorithm 1. Algorithm 1 Self-Play Training of Absolute Zero Reasoner (AZR) Require: Pretrained base LLM πθ; batch size B; #references K; iterations T |
| Open Source Code | Yes | To foster further exploration and advancement of this emerging paradigm, we are releasing the code, models, and logs as open-source, encouraging the research community to build upon our findings. |
| Open Datasets | Yes | For coding tasks, we evaluate using Evalplus [45] on the Human Eval+ and MBPP+ benchmarks [6, 2], as well as Live Code Bench Generation (v1-5, May 23-Feb 25) [34]. For mathematical reasoning, we utilize six standard benchmarks commonly used in recent zero reasoners: AIME 24, AIME 25, Olympiad Bench [25], Minerva [40], Math500 [26], and AMC 23. |
| Dataset Splits | No | To evaluate our models, we divide the benchmarks into in-distribution (ID) and out-of-distribution (OOD) categories. |
| Hardware Specification | Yes | All experiments were conducted on clusters of A800 GPUs, each experiment lasts around 3-5 days. |
| Software Dependencies | No | We built Absolute Zero Reasoner upon the ve RL codebase [65]. For code execution, we incorporated components from the Qw Q Python executor. Concretely, we used the complexipy and Radon packages [48, 4] to implement the respective metrics. |
| Experiment Setup | Yes | For all experiments, we initialize the buffers as described in Section 3.1. AZR models are trained using a batch size of 64 * 6 (2 roles * 3 task types). We use constant learning rate= 1e-6 and the Adam W optimizer [49]. Complete list of hyperparameters is provided in Table 4. |