Non-Exchangeable Conformal Risk Control

Authors: António Farinhas, Chrysoula Zerva, Dennis Thomas Ulmer, Andre Martins

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with both synthetic and real world data show the usefulness of our method. and 4 EXPERIMENTS In this section, we turn to demonstrating the validity of our theoretical results in three different tasks using different nonincreasing losses: a multilabel classification problem using synthetic time series data, minimizing the false negative rate (4.1), a problem involving monitoring electricity usage, minimizing the λ-insensitive absolute loss (4.2), and an open-domain question answering (QA) task, where we control the best token-level F1-score (4.3).
Researcher Affiliation Collaboration Ant onio Farinhas 1,2, Chrysoula Zerva 1,2, Dennis Ulmer 3,4, Andr e F. T. Martins 1,2,5 1Instituto de Telecomunicac oes, 2Instituto Superior T ecnico, Universidade de Lisboa (Lisbon ELLIS Unit), 3IT University of Copenhagen, 4Pioneer Centre for Artificial Intelligence , 5Unbabel
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/deep-spin/non-exchangeable-crc.
Open Datasets Yes We use the ELEC2 dataset (Harries, 1999) and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020)
Dataset Splits Yes After a warmup period of 200 time points, at each time step n = 200, . . . , N 1 we assign odd indices to the training set, even indices to the calibration set, and we let Xn+1 be the test point. and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020), considering n = 2500 points for calibration and 1110 for evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper acknowledges the use of open-source software by citing works such as Van Rossum & Drake (2009) (Python), Oliphant (2006) (NumPy), Virtanen et al. (2020) (SciPy), Walt et al. (2011) (NumPy/SciPy), Pedregosa et al. (2011) (Scikit-learn), and Paszke et al. (2019) (PyTorch). However, it does not specify the exact version numbers for these software dependencies used in their experiments.
Experiment Setup Yes We compare standard CRC with non-exchangeable (non-X) CRC, for which we use weights wi = 0.99n+1 i and predict ˆλ following Eq. (10). In both cases, we minimize the false negative rate (FNR)... and For non-X CRC, we use weights wi = 0.99n+1 i and we also experiment with weighted least-squares regression, placing weights ti = wi on each data point (non-X CRC + WLS). and We experiment using λ [0, 1] with a step of 0.01. and For non-X CRC, we choose weights {wi}n i=1 by computing the dot product between the embedding representations of {Xi}n i=1 and Xn+1, obtained using a sentence-transformer model... We use α = 0.3 and report results over 1000 trials.