Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

Authors: Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign, USA 2Delft University of Technology, Netherlands 3Netflix Eyeline Studios, USA 4University of California, Berkeley, USA 5University of Chicago, USA.
Pseudocode Yes We refer to Alg. 1 in App. C.1 for the pseudocode of the protocol.
Open Source Code Yes The codes are publicly available at https://github. com/kangmintong/C-RAG.
Open Datasets Yes We evaluate C-RAG on four widely used NLP datasets, including AESLC (Zhang & Tetreault, 2019), Common Gen (Lin et al., 2019), DART (Nan et al., 2020), and E2E (Novikova et al., 2017).
Dataset Splits Yes We perform conformal calibration on validation sets with uncertainty δ = 0.1.
Hardware Specification No No specific hardware (e.g., GPU models, CPU types, or memory) used for experiments is mentioned.
Software Dependencies No The paper mentions using “Llama-2-7b for inference” but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We use our generation protocol (Alg. 1 in App. C.1) controlled by the number of retrieved examples Nrag, generation set size λg, and diversity threshold λs. We use Llama-2-7b for inference and perform conformal calibration on validation sets with uncertainty δ = 0.1. We use 1 ROUGE-L as the risk function. See App. J.1 for more details of evaluation setup.