Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models
Authors: Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign, USA 2Delft University of Technology, Netherlands 3Netflix Eyeline Studios, USA 4University of California, Berkeley, USA 5University of Chicago, USA. |
| Pseudocode | Yes | We refer to Alg. 1 in App. C.1 for the pseudocode of the protocol. |
| Open Source Code | Yes | The codes are publicly available at https://github. com/kangmintong/C-RAG. |
| Open Datasets | Yes | We evaluate C-RAG on four widely used NLP datasets, including AESLC (Zhang & Tetreault, 2019), Common Gen (Lin et al., 2019), DART (Nan et al., 2020), and E2E (Novikova et al., 2017). |
| Dataset Splits | Yes | We perform conformal calibration on validation sets with uncertainty δ = 0.1. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU types, or memory) used for experiments is mentioned. |
| Software Dependencies | No | The paper mentions using “Llama-2-7b for inference” but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We use our generation protocol (Alg. 1 in App. C.1) controlled by the number of retrieved examples Nrag, generation set size λg, and diversity threshold λs. We use Llama-2-7b for inference and perform conformal calibration on validation sets with uncertainty δ = 0.1. We use 1 ROUGE-L as the risk function. See App. J.1 for more details of evaluation setup. |