Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Authors: Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present empirical results showing Reliability RAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.
Researcher Affiliation	Collaboration	Zeyu Shen Department of Computer Science Princeton University Princeton, New Jersey, 08540 EMAIL; Chong Xiang NVIDIA Santa Clara, California, 95051 EMAIL
Pseudocode	Yes	Algorithm 1: RELIABILITYRAG via MIS (ordinal-reliability setting); Algorithm 2: Reliability RAG via sample and aggregate (cardinal-reliability setting)
Open Source Code	No	Public datasets are cited, but an anonymized code repository is not yet linked. We plan to release code after acceptance to preserve double-blind review.
Open Datasets	Yes	We evaluate on three open-domain QA datasets: Realtime QA (RQA) [28], Natural Questions (NQ) [32], Trivia QA (TQA) [27], and a long-form Biography generation dataset (Bio) [31].
Dataset Splits	Yes	We use 100 queries from RQA dataset, randomly draw 500 queries from each of NQ dataset and TQA dataset, and 50 queries from Bio dataset.
Hardware Specification	Yes	We measure end-to-end latency of our approach on one NVIDIA A100 (80GB) using Mistral-7B or Llama3.2-3B for generation and De BERTa-v3-large-mnli-fever-anli-ling-wanli NLI checker in Table 14.
Software Dependencies	Yes	We run experiments using three LLMs as the generators in our RAG pipelines: Mistral-7B-Instruct-v0.2 [24], Llama3.2-3B-Instruct [40], and GPT-4o-mini [50].
Experiment Setup	Yes	We set temperature to 0 for all experiments. When testing MIS, we use the top k = 10 passages. For Sampling + MIS, we use the top k = 50 documents, since one major motivation for the weighted sample and aggregate framework is scalability. We set context size m = 2 and number of sampling rounds T = 20. For the weights, we use the exponentially decaying weights and set w(xi) γi 1, where γ = 0.9.