Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Non-Exchangeable Conformal Risk Control
Authors: António Farinhas, Chrysoula Zerva, Dennis Thomas Ulmer, Andre Martins
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with both synthetic and real world data show the usefulness of our method. and 4 EXPERIMENTS In this section, we turn to demonstrating the validity of our theoretical results in three different tasks using different nonincreasing losses: a multilabel classification problem using synthetic time series data, minimizing the false negative rate (4.1), a problem involving monitoring electricity usage, minimizing the λ-insensitive absolute loss (4.2), and an open-domain question answering (QA) task, where we control the best token-level F1-score (4.3). |
| Researcher Affiliation | Collaboration | Ant onio Farinhas 1,2, Chrysoula Zerva 1,2, Dennis Ulmer 3,4, Andr e F. T. Martins 1,2,5 1Instituto de Telecomunicac oes, 2Instituto Superior T ecnico, Universidade de Lisboa (Lisbon ELLIS Unit), 3IT University of Copenhagen, 4Pioneer Centre for Artificial Intelligence , 5Unbabel |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/deep-spin/non-exchangeable-crc. |
| Open Datasets | Yes | We use the ELEC2 dataset (Harries, 1999) and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020) |
| Dataset Splits | Yes | After a warmup period of 200 time points, at each time step n = 200, . . . , N 1 we assign odd indices to the training set, even indices to the calibration set, and we let Xn+1 be the test point. and We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020), considering n = 2500 points for calibration and 1110 for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper acknowledges the use of open-source software by citing works such as Van Rossum & Drake (2009) (Python), Oliphant (2006) (NumPy), Virtanen et al. (2020) (SciPy), Walt et al. (2011) (NumPy/SciPy), Pedregosa et al. (2011) (Scikit-learn), and Paszke et al. (2019) (PyTorch). However, it does not specify the exact version numbers for these software dependencies used in their experiments. |
| Experiment Setup | Yes | We compare standard CRC with non-exchangeable (non-X) CRC, for which we use weights wi = 0.99n+1 i and predict ˆλ following Eq. (10). In both cases, we minimize the false negative rate (FNR)... and For non-X CRC, we use weights wi = 0.99n+1 i and we also experiment with weighted least-squares regression, placing weights ti = wi on each data point (non-X CRC + WLS). and We experiment using λ [0, 1] with a step of 0.01. and For non-X CRC, we choose weights {wi}n i=1 by computing the dot product between the embedding representations of {Xi}n i=1 and Xn+1, obtained using a sentence-transformer model... We use α = 0.3 and report results over 1000 trials. |