Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Authors: Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. For all experiments, we initialize the buffers as described in Section 3.1. AZR models are trained using a batch size of 64 * 6 (2 roles * 3 task types).
Researcher Affiliation	Academia	1Tsinghua University 2BIGAI 3Penn State University
Pseudocode	Yes	We showcase an illustration of our Absolute Zero Reasoner approach in Figure 2 and Algorithm 1. Algorithm 1 Self-Play Training of Absolute Zero Reasoner (AZR) Require: Pretrained base LLM πθ; batch size B; #references K; iterations T
Open Source Code	Yes	To foster further exploration and advancement of this emerging paradigm, we are releasing the code, models, and logs as open-source, encouraging the research community to build upon our findings.
Open Datasets	Yes	For coding tasks, we evaluate using Evalplus [45] on the Human Eval+ and MBPP+ benchmarks [6, 2], as well as Live Code Bench Generation (v1-5, May 23-Feb 25) [34]. For mathematical reasoning, we utilize six standard benchmarks commonly used in recent zero reasoners: AIME 24, AIME 25, Olympiad Bench [25], Minerva [40], Math500 [26], and AMC 23.
Dataset Splits	No	To evaluate our models, we divide the benchmarks into in-distribution (ID) and out-of-distribution (OOD) categories.
Hardware Specification	Yes	All experiments were conducted on clusters of A800 GPUs, each experiment lasts around 3-5 days.
Software Dependencies	No	We built Absolute Zero Reasoner upon the ve RL codebase [65]. For code execution, we incorporated components from the Qw Q Python executor. Concretely, we used the complexipy and Radon packages [48, 4] to implement the respective metrics.
Experiment Setup	Yes	For all experiments, we initialize the buffers as described in Section 3.1. AZR models are trained using a batch size of 64 * 6 (2 roles * 3 task types). We use constant learning rate= 1e-6 and the Adam W optimizer [49]. Complete list of hyperparameters is provided in Table 4.