Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
Authors: Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results showing Reliability RAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG. |
| Researcher Affiliation | Collaboration | Zeyu Shen Department of Computer Science Princeton University Princeton, New Jersey, 08540 EMAIL; Chong Xiang NVIDIA Santa Clara, California, 95051 EMAIL |
| Pseudocode | Yes | Algorithm 1: RELIABILITYRAG via MIS (ordinal-reliability setting); Algorithm 2: Reliability RAG via sample and aggregate (cardinal-reliability setting) |
| Open Source Code | No | Public datasets are cited, but an anonymized code repository is not yet linked. We plan to release code after acceptance to preserve double-blind review. |
| Open Datasets | Yes | We evaluate on three open-domain QA datasets: Realtime QA (RQA) [28], Natural Questions (NQ) [32], Trivia QA (TQA) [27], and a long-form Biography generation dataset (Bio) [31]. |
| Dataset Splits | Yes | We use 100 queries from RQA dataset, randomly draw 500 queries from each of NQ dataset and TQA dataset, and 50 queries from Bio dataset. |
| Hardware Specification | Yes | We measure end-to-end latency of our approach on one NVIDIA A100 (80GB) using Mistral-7B or Llama3.2-3B for generation and De BERTa-v3-large-mnli-fever-anli-ling-wanli NLI checker in Table 14. |
| Software Dependencies | Yes | We run experiments using three LLMs as the generators in our RAG pipelines: Mistral-7B-Instruct-v0.2 [24], Llama3.2-3B-Instruct [40], and GPT-4o-mini [50]. |
| Experiment Setup | Yes | We set temperature to 0 for all experiments. When testing MIS, we use the top k = 10 passages. For Sampling + MIS, we use the top k = 50 documents, since one major motivation for the weighted sample and aggregate framework is scalability. We set context size m = 2 and number of sampling rounds T = 20. For the weights, we use the exponentially decaying weights and set w(xi) γi 1, where γ = 0.9. |