Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

Authors: Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten Rijke, Zhaochun Ren

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on four benchmarks show that EXSEARCH outperforms baselines substantially, e.g., +7.8% improvement on exact match score. Extensive experiments on a wide range of knowledge-intensive benchmarks demonstrate the improvement of EXSEARCH over strong baselines. Table 1: Comparison between our proposed EXSEARCH and baselines... Table 2: Recall@K (K=3,5) for our method... Table 3: Ablation study where we remove each component from the vanilla EXSEARCH. Figure 1: Performance on Hotpot QA dataset when applying our EXSEARCH to different LLMs.
Researcher Affiliation	Collaboration	1Shandong University, Qingdao, China 2Baidu. Inc, Beijing, China 3Leiden University, Leiden, The Netherlands 4University of Amsterdam, Amsterdam, The Netherlands
Pseudocode	Yes	Algorithm 1: Training process in EXSEARCH, which alternates between the E-step and M-step. Input: Initial LLM θ0; Training data D = {(xi, yi)}N i=1; Training iteration N; Maximal step T.
Open Source Code	Yes	Code is available on EXSEARCH. The code and data in this work are well documented in open-source Github.
Open Datasets	Yes	We conduct experiments on a range of well-established benchmarks: Natural Questions (NQ) [36], Hotpot QA [84], Mu Si Que [70], and 2Wiki Multihop QA (2Wiki QA) [20]. Table 5 in Appendix D summarizes their key statistics. We use the Wikipedia passage dump from December 20, 2018, as the retrieval corpus.
Dataset Splits	Yes	Table 5: Statistics of our experimental datasets, where we provide the amount of training and evaluation dataset, the average length of input query (word) as well as the retrieval corpus. Nature Question [36] 58,622 9.21 6,489 9.16 Wiki2018 Hotpot QA [84] 90,185 17.85 7,384 15.63 Wiki2018 Musi Que QA [70] 19,938 15.96 2,417 18.11 Wiki2018 2Wiki Multi Hop QA [20] 167,454 12.74 12,576 11.97 Wiki2018
Hardware Specification	No	The paper does not explicitly state specific hardware details such as GPU models, CPU models, or memory amounts used for running the experiments. It mentions BF16 mixed-precision training but without specifying the hardware capable of this.
Software Dependencies	Yes	We use Deep Speed Ze RO 3 [56] with a learning rate of 2 10 6. We use the Wikipedia passage dump from December 20, 2018, as the retrieval corpus and adopt Col BERTv2.0 [59] for document retrieval.
Experiment Setup	Yes	Table 6: Experimental settings for model training. Model Batch size Learning rate Cutoff length Scheduler Gradient accumulation. Qwen-2.5-3B-instruct 4 2 10 6 8192 tokens Cosine 16. During the training stage, we trained the models with a learning rate of 2 10 6, using Deep Speed Zero 3 for efficient distributed optimization. The batch size was set to 4 for the 3B, 7B, and 8B models, and reduced to 2 for the 24B model due to memory constraints. We applied a linear warm-up (10% of total steps), followed by a cosine learning rate scheduler. All experiments used BF16 mixed-precision training with a sequence length cutoff of 8192 tokens.