Conformal Risk Control

Authors: Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, Tal Schuster

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score. ... To demonstrate the flexibility and empirical effectiveness of the proposed algorithm, we apply it to four tasks across computer vision and natural language processing. ... We report results with α = 0.1 in Figure 1. The mean and standard deviation of the risk over 1000 trials are 0.0987 and 0.0114, respectively.
Researcher Affiliation Collaboration Anastasios N. Angelopoulos1, Stephen Bates2, Adam Fisch2, Lihua Lei3, Tal Schuster4 1UC Berkeley 2MIT 3Stanford 4Google Research
Pseudocode No The paper describes the algorithm and provides mathematical formulas (e.g., 'ˆλ = inf λ : n n + 1 b Rn(λ) + B n + 1 α'), but it does not include a distinct, labeled pseudocode or algorithm block.
Open Source Code Yes Code to reproduce our examples is available at https://github.com/aangelopoulos/ conformal-risk.
Open Datasets Yes For evaluating the proposed procedure we pool data from several online open-source gut polyp segmentation datasets: Kvasir, Hyper-Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. ... We evaluate on the Microsoft Common Objects in Context (MS COCO) dataset (Lin et al., 2014) ... We use the Image Net dataset (Deng et al., 2009) ... We use the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019).
Dataset Splits Yes We used n = 1000, and evaluated risk control with the 781 remaining validation data points. ... We used n = 4000, and evaluated risk control with 1000 validation data points. ... We choose a Res Net152 (He et al., 2016) for f and n = 30000, and evaluate risk with the remaining 20000. ... We use n = 2500 calibration points, and evaluate risk control with the remaining 1110.
Hardware Specification No The paper mentions models used (e.g., Pra Net, TRes Net, Res Net152, DPR Retriever Reader model) and datasets, but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions the use of certain models (e.g., Pra Net, TRes Net, Res Net152, DPR Retriever Reader model) but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used in the experiments.
Experiment Setup Yes We report results with α = 0.1 in Figure 1. ... We report results with α = 0.1 in Figure 2. ... We report results with α = 0.05 in Figure 3. ... We use α = 0.3 (chosen empirically as the lowest F1 score which reliably results in approximately correct answers by manual validation) in Figure 4. ... For evaluating the proposed procedure we pool data from several online open-source gut polyp segmentation datasets: Kvasir, Hyper-Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. We choose a Pra Net (Fan et al., 2020) as our base model f and used n = 1000, and evaluated risk control with the 781 remaining validation data points.