Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Authors: Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Building on these theoretical insights, we validate our findings with two experiments: an image-denoising task using diffusion probabilistic models (DPMs), and a math-reasoning task with large language models (LLMs). Across these experiments, exponential and linear (polynomial) growth policies outperform constant policies, with exponential policies often providing more stable performance.
Researcher Affiliation	Academia	Pu Yang Peking University Yunzhen Feng New York University Ziyuan Chen Peking University Yuhang Wu UC Berkeley Zhuoyuan Li National University of Singapore
Pseudocode	Yes	The algorithm framework is formalized in Algorithm 1. In the selection step, the data may be selected with noise. We present a simple form here for ease of understanding, while our results extend to the noisy case in Appendix B.5.
Open Source Code	Yes	We have included our codes for all the experiments in the supplemental materials. All the data are open-sourced. (...) We provide the code for all the experiments in the Git Hub repository: https://github.com/zylipku/spend-wisely.
Open Datasets	Yes	Experiment Setup Type Generator Reward Dataset Image Denoising Theoretical Validation DPM PSNR MNIST (Deng, 2012) Math Reasoning Practical Scenario LLM Accuracy GSM-Symbolic (Mirzadeh et al., 2025)
Dataset Splits	Yes	We fine-tune a pre-trained diffusion model3 for denoising on the MNIST (Deng, 2012) dataset, and readers may refer to Appendix C for more details. (...) For evaluation, the model is tested on held-out datasets from all three difficulty levels.
Hardware Specification	Yes	Each configuration requires around 100 GPU-hours on an NVIDIA A800 cluster. (...) Each configuration requires approximately 400 GPU-hours on an NVIDIA A800 cluster.
Software Dependencies	No	Our implementation builds on the public Open RLHF framework (Hu et al., 2024). The paper does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For all denoising experiments, we fix batch size B = 640 and learning rate 5e-5. (...) Hyperparameters are as follows: one epoch per iteration on the selected data; a constant learning rate of 10^-7; batch size B = 256; and roughly 1,000 total generator update steps. During generation we use temperature 0.3 for diversity, and temperature 0 at evaluation for accuracy. The maximum generation length is 512.