Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback

Authors: Jingsheng Gao, Linxu Li, Ke Ji, Weiyuan Li, Yixin Lian, yuzhuo fu, Bin Dai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that the jointly optimized Smart RAG can achieve better performance than separately optimized counterparts. We validate the effectiveness of Smart RAG on various datasets, demonstrating that our Smart RAG outperforms its counterparts with separately optimized modules.
Researcher Affiliation	Collaboration	Shanghai Jiao Tong University Xiaobing.AI The Chinese University of Hong Kong, Shenzhen EMAIL {keji}@link.cuhk.edu.cn EMAIL
Pseudocode	Yes	The Smart RAG pipeline is shown in Algorithm 1. Require: Policy Network πθ, Retriever R 1: Input: input question x, retrieve quota N, observation os [ ], retrieve count n 0 2: while n N do 3: if n = N then, 4: a πθ ([x, os]) s.t. a0 = [ANSWER] 5: else 6: a πθ ([x, os]) 7: n n + 1 8: if a0 = [RETRIEVE] then, 9: o R(a1:), os [os, o] 10: else 11: return a1:
Open Source Code	No	Additionally, we have uploaded our code in the supplementary materials and will open-source it upon acceptance of the paper.
Open Datasets	Yes	Following Ma et al. (2023), we use three open-domain QA datasets for evaluation, namely Pop QA (Mallen et al., 2023a), Ambig NQ (Min et al., 2020) and Hotpot QA (Yang et al., 2018). Besides, we also use other datasets like Trivial QA (Joshi et al., 2017), Open Book QA (Mihaylov et al., 2018), Med QA-cn (Jin et al., 2021) and ARC-c (Clark et al., 2018) for more analysis and comparisons. All the details of the related datasets are shown in Appendix A.1.
Dataset Splits	Yes	The scale of the data used in the paper is presented in Table 7. To reduce the number of training iterations, we combined the three datasets for joint training. It is worth noting that this cross-dataset training presents a more challenging task, and we employed multi-dataset training during both the warm-up and PPO training phases. To mitigate the long-tail issues arising from distribution differences among the datasets, we utilized only 20k samples from the Hotpot QA dataset to align with the scales of the Pop QA and Ambig NQ datasets. Additionally, for the multiple-choice questions, we limited our training to 2k samples from both Open Book QA and Med QA-en.
Hardware Specification	Yes	We use 4 Nvidia A100 with 80GB memory to train our models.
Software Dependencies	No	Since our framework employs a single model to handle all tasks within the RAG system, we employed models of varying sizes for training. To further explore the effects of different architectures and model sizes, we conducted experiments on the Flan-T5 series models. In addition, we trained a Llama2 7B model to validate the effectiveness of our framework on a larger model. It is noteworthy that, since we use PPO for training, it requires optimizing the policy model, value model, and reference model simultaneously. To reduce memory consumption and improve training speed, we applied Lo RA (Hu et al., 2021) for tuning Lla Ma-7B.
Experiment Setup	Yes	Training Details. In the SFT warmup stage, we finetune the network for 1 epoch with the learning rate being 3e 4. The batch size is set to be 8. In the reinforcement learning stage, we use an onpolicy sampling strategy. For each iteration, we sample many trajectories such that the total length of all these trajectories is 5120. The policy network is then trained for 1 epoch on these samples with the learning rate being 2e 6 and batch size being 32. The retrieval penalty α in (5) is set to 0.2 and the retrieval quota N is specified to be 1 to fairly compare with other baselines.