Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging

Authors: Hongjin Qian, Zheng Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate In Forage s superior performance over baseline methods. These results highlight In Forage s effectiveness in building robust, adaptive, and efficient reasoning agents. We provide all codes and datasets in the supplementary materials as well as in this repository.
Researcher Affiliation	Academia	1 Beijing Academy of Artificial Intelligence 2 Hong Kong Polytechnic University EMAIL
Pseudocode	No	The paper describes the methodology, reward functions, and optimization process using mathematical formulations (e.g., Eq. 1-9) but does not present any explicit pseudocode or algorithm blocks labeled as such.
Open Source Code	Yes	We provide all codes and datasets in the supplementary materials as well as in this repository. (...) We provide all source codes, datasets and prompts in this repository.
Open Datasets	Yes	Datasets: We evaluate on the following datasets: Natural Questions [Kwiatkowski et al., 2019], Trivia QA [Joshi et al., 2017], Pop QA [Mallen et al., 2022], Hotpot QA [Yang et al., 2018], 2Wiki Multihop QA [Ho et al., 2020], Mu Si Que [Trivedi et al., 2022b], Bamboogle [Press et al., 2023], and a self-constructed real-time web QA dataset. (...) We provide all codes and datasets in the supplementary materials as well as in this repository.
Dataset Splits	Yes	The final dataset comprises a structured split of 19,500 training examples and 500 evaluation examples. The 500 evaluation samples denoted as the Self dataset comprise real-time, open-ended web tasks that demand multi-hop reasoning, offering a challenging benchmark for search-enhanced reasoning.
Hardware Specification	Yes	All training and evaluation were conducted using 8 NVIDIA A800-80G GPUs.
Software Dependencies	Yes	We use Qwen-2.5 Instruct models (3B and 7B) as our foundation LLMs for In Forage. (...) During inference, we use the E5 encoder [Wang et al., 2024] and the Wikipedia dump from Flash RAG [Jin et al., 2025b] as the retrieval backend for open-domain and multi-hop QA tasks. (...) Retrieval over this corpus is performed using the BGE-M3 retriever, and we set the maximum number of reasoning steps to 6.
Experiment Setup	Yes	These model-generated trajectories are then used to fine-tune the foundation models for two epochs using a learning rate of 1 10 5. Following SFT, we perform RL with PPO over 300 steps, using a learning rate of 1 10 6 and a warm-up ratio of 0.5. (...) Structured rewards (Eq. 8) with α = 0.2 and β = 0.95 are applied only to the self-constructed dataset, as the others lack intermediate traces.