Non-Exchangeable Conformal Risk Control
Authors: António Farinhas, Chrysoula Zerva, Dennis Thomas Ulmer, Andre Martins
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with both synthetic and real world data show the usefulness of our method. and 4 EXPERIMENTS In this section, we turn to demonstrating the validity of our theoretical results in three different tasks using different nonincreasing losses: a multilabel classification problem using synthetic time series data, minimizing the false negative rate (4.1), a problem involving monitoring electricity usage, minimizing the λ-insensitive absolute loss (4.2), and an open-domain question answering (QA) task, where we control the best token-level F1-score (4.3). |
| Researcher Affiliation | Collaboration | Ant onio Farinhas 1,2, Chrysoula Zerva 1,2, Dennis Ulmer 3,4, Andr e F. T. Martins 1,2,5 1Instituto de Telecomunicac oes, 2Instituto Superior T ecnico, Universidade de Lisboa (Lisbon ELLIS Unit), 3IT University of Copenhagen, 4Pioneer Centre for Artificial Intelligence , 5Unbabel |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/deep-spin/non-exchangeable-crc. |
| Open Datasets | Yes | We use the ELEC2 dataset (Harries, 1999) and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020) |
| Dataset Splits | Yes | After a warmup period of 200 time points, at each time step n = 200, . . . , N 1 we assign odd indices to the training set, even indices to the calibration set, and we let Xn+1 be the test point. and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020), considering n = 2500 points for calibration and 1110 for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper acknowledges the use of open-source software by citing works such as Van Rossum & Drake (2009) (Python), Oliphant (2006) (NumPy), Virtanen et al. (2020) (SciPy), Walt et al. (2011) (NumPy/SciPy), Pedregosa et al. (2011) (Scikit-learn), and Paszke et al. (2019) (PyTorch). However, it does not specify the exact version numbers for these software dependencies used in their experiments. |
| Experiment Setup | Yes | We compare standard CRC with non-exchangeable (non-X) CRC, for which we use weights wi = 0.99n+1 i and predict ˆλ following Eq. (10). In both cases, we minimize the false negative rate (FNR)... and For non-X CRC, we use weights wi = 0.99n+1 i and we also experiment with weighted least-squares regression, placing weights ti = wi on each data point (non-X CRC + WLS). and We experiment using λ [0, 1] with a step of 0.01. and For non-X CRC, we choose weights {wi}n i=1 by computing the dot product between the embedding representations of {Xi}n i=1 and Xn+1, obtained using a sentence-transformer model... We use α = 0.3 and report results over 1000 trials. |