Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play
Authors: Ran Xu, Yuchen Zhuang, Zihan Dong, Ruiyu Wang, Yue Yu, Joyce Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three reasoning-intensive tasks across 10 datasets show that Ace Searcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. |
| Researcher Affiliation | Academia | 1Emory University 2Georgia Institute of Technology 3Rutgers University 4SUNY Albany 5UT Southwestern Medical Center |
| Pseudocode | No | The paper describes the architecture and training process with equations but does not present a structured pseudocode block or algorithm. |
| Open Source Code | Yes | Code: https://github.com/ritaranx/Ace Searcher/ We open-source code for reproducing our results in https://github.com/ritaranx/Ace Searcher. |
| Open Datasets | Yes | Dataset/Model: https://huggingface.co/Ace Searcher... For training data, they are all public available data with open access in https://huggingface.co/Ace Searcher/datasets. |
| Dataset Splits | Yes | We conduct evaluations on all questions from Strategy QA and Bamboogle, and the first 500 questions from the development sets of the other datasets following existing studies [68, 57, 36]. For dataset in Doc Math Eval, we use the testmini version as the evaluation set to compare the performance of Ace Searcher and baselines. |
| Hardware Specification | Yes | All models are optimized using Adam W with β1 = 0.9 and β2 = 0.98, and experiments are conducted on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions Adam W as an optimizer, but does not provide specific version numbers for programming languages, libraries, or frameworks used. |
| Experiment Setup | Yes | For SFT, we set the batch size to 64 for every example, and set the learning rate as Table 7. With maximum number of tokens to 2560. We set the hyperparameters to m = 3, m = 4, and t = 1.0 when generating multiple rollouts. Examples with identical maximum and minimum rewards are discarded. For RFT, we use β = 0.1 and run for the DPO for 2 iterations by default. All models are optimized using Adam W with β1 = 0.9 and β2 = 0.98 |