Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Authors: Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results showing Reliability RAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.
Researcher Affiliation Collaboration Zeyu Shen Department of Computer Science Princeton University Princeton, New Jersey, 08540 EMAIL; Chong Xiang NVIDIA Santa Clara, California, 95051 EMAIL
Pseudocode Yes Algorithm 1: RELIABILITYRAG via MIS (ordinal-reliability setting); Algorithm 2: Reliability RAG via sample and aggregate (cardinal-reliability setting)
Open Source Code No Public datasets are cited, but an anonymized code repository is not yet linked. We plan to release code after acceptance to preserve double-blind review.
Open Datasets Yes We evaluate on three open-domain QA datasets: Realtime QA (RQA) [28], Natural Questions (NQ) [32], Trivia QA (TQA) [27], and a long-form Biography generation dataset (Bio) [31].
Dataset Splits Yes We use 100 queries from RQA dataset, randomly draw 500 queries from each of NQ dataset and TQA dataset, and 50 queries from Bio dataset.
Hardware Specification Yes We measure end-to-end latency of our approach on one NVIDIA A100 (80GB) using Mistral-7B or Llama3.2-3B for generation and De BERTa-v3-large-mnli-fever-anli-ling-wanli NLI checker in Table 14.
Software Dependencies Yes We run experiments using three LLMs as the generators in our RAG pipelines: Mistral-7B-Instruct-v0.2 [24], Llama3.2-3B-Instruct [40], and GPT-4o-mini [50].
Experiment Setup Yes We set temperature to 0 for all experiments. When testing MIS, we use the top k = 10 passages. For Sampling + MIS, we use the top k = 50 documents, since one major motivation for the weighted sample and aggregate framework is scalability. We set context size m = 2 and number of sampling rounds T = 20. For the weights, we use the exponentially decaying weights and set w(xi) γi 1, where γ = 0.9.