Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Optimizing Retrieval for RAG via Reinforcement Learning

Authors: Jiawei Zhou, Lei Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
Researcher Affiliation	Academia	Jiawei Zhou Lei Chen The Hong Kong University of Science and Technology The Hong Kong University of Science and Technology (Guangzhou) EMAIL
Pseudocode	No	The paper describes methods and processes in narrative text and diagrams (Figure 2 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We plan to release the trained models and code upon acceptance. Documentation will be provided to ensure reproducibility and responsible use.
Open Datasets	Yes	We evaluate R3 on five public RAG benchmarks. For free-form generation, we utilize Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), and Hotpot QA [60], three well-established open-domain QA datasets. For closed-set generation, we employ the Pub Health [61] dataset for fact-checking tasks, and the ARC-Challenge [62] dataset for multiple-choice reasoning.
Dataset Splits	Yes	For experiments on NQ, since certain baselines are trained on its training split, we additionally train SIDRMS on NQ before applying our method. By defualt, we use the same English Wikipedia datastore and prompt as those open-sourced by SELF-RAG, detailed in Appendix K. For Hotpot QA, we use the official datastore provided with the dataset. During training, we train the retriever for each dataset for 80 epochs, aligning with the training duration used for SIDRMS.
Hardware Specification	Yes	Our experiments are conducted with 4 NVIDIA GPUs.
Software Dependencies	No	The paper mentions using specific models like 'Llama3-8b' and tools like 'vLLM', but it does not specify version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	During training, we train the retriever for each dataset for 80 epochs, aligning with the training duration used for SIDRMS. We use a batch size of 128 and an Adam W optimizer [66] with a learning rate of 2 10 5. The training process is divided into two phases: the first half involves a warm-up phase using initial retrieved positives and negatives, while the second half transitions to in-training retrieval, using the in-training positives and negatives. During inference, we set the maximum number of generated token to be 100 for free-form generation while 20 for closed-set generation.