Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Authors: Ran Xu, Yuchen Zhuang, Zihan Dong, Ruiyu Wang, Yue Yu, Joyce Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three reasoning-intensive tasks across 10 datasets show that Ace Searcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%.
Researcher Affiliation	Academia	1Emory University 2Georgia Institute of Technology 3Rutgers University 4SUNY Albany 5UT Southwestern Medical Center
Pseudocode	No	The paper describes the architecture and training process with equations but does not present a structured pseudocode block or algorithm.
Open Source Code	Yes	Code: https://github.com/ritaranx/Ace Searcher/ We open-source code for reproducing our results in https://github.com/ritaranx/Ace Searcher.
Open Datasets	Yes	Dataset/Model: https://huggingface.co/Ace Searcher... For training data, they are all public available data with open access in https://huggingface.co/Ace Searcher/datasets.
Dataset Splits	Yes	We conduct evaluations on all questions from Strategy QA and Bamboogle, and the first 500 questions from the development sets of the other datasets following existing studies [68, 57, 36]. For dataset in Doc Math Eval, we use the testmini version as the evaluation set to compare the performance of Ace Searcher and baselines.
Hardware Specification	Yes	All models are optimized using Adam W with β1 = 0.9 and β2 = 0.98, and experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions Adam W as an optimizer, but does not provide specific version numbers for programming languages, libraries, or frameworks used.
Experiment Setup	Yes	For SFT, we set the batch size to 64 for every example, and set the learning rate as Table 7. With maximum number of tokens to 2560. We set the hyperparameters to m = 3, m = 4, and t = 1.0 when generating multiple rollouts. Examples with identical maximum and minimum rewards are discarded. For RFT, we use β = 0.1 and run for the DPO for 2 iterations by default. All models are optimized using Adam W with β1 = 0.9 and β2 = 0.98