Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Influence Guided Context Selection for Effective Retrieval-Augmented Generation

Authors: Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information.
Researcher Affiliation	Academia	Jiale Deng, Yanyan Shen , Ziyuan Pei, Youmin Chen, Linpeng Huang Shanghai Jiao Tong University EMAIL
Pseudocode	No	The paper includes architectural diagrams and loss functions but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
Open Datasets	Yes	Tasks and Datasets. We consider the following knowledge-intensive NLP tasks: (1) Open-Domain QA, including NQ [23], Trivia QA [21] and Web QA [3]. (2) Multihop QA that requires multi-step reasoning to generate answers, including Hotpot QA [52] and 2Wiki Multi Hop QA [15]. (3) Fact Checking dataset FEVER [43] that challenges the model to use complex reasoning to determine the factual accuracy of given claims. (4) Multiple Choice dataset Truthful QA [25]. (5) Long-Form QA dataset ASQA [39] that generating long and abstract answers given the question. Following [19, 47], we report Exact Match (EM) for Open-Domain QA datasets, F1 for Multihop QA and Long-Form QA datasets, and Accuracy for Fact Checking and Multiple Choice datasets.
Dataset Splits	Yes	For datasets without a provided test set, we utilize the development set as the test set and perform a split on the training set, allocating 80% as training set and 20% as dev set.
Hardware Specification	Yes	We conducted all the experiments on a server equipped with Montage Jintide(R) C6226R CPU, 256GB Memory, and 4 Nvidia Ge Force RTX 4090 GPUs.
Software Dependencies	Yes	Our model architecture consists of three main components: (1) a pretrained BERT-uncased [10] model serving as the local layer... We use Llama3-8b-intruct [1] and Qwen2.5-7b-instruct [42] as the LLM generators. All experiments are conducted with a fixed random seed of 2024 for reproducibility. ... We use E5-base-v2 [45] as the dense retriever...
Experiment Setup	Yes	Hyperparameter setting. In our experiments, we employ the following hyperparameters: for supervised training, we set τ = 1 and β = 0.1 and train CSM for 10 epochs with a batch size of 16; for end-to-end training, we set λ = 1 and train CSM for 10 epochs with a batch size of 4. All experiments are conducted with a fixed random seed of 2024 for reproducibility.