Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Backward Conformal Prediction
Authors: Etienne Gauthier, Francis Bach, Michael I Jordan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate through an image classification experiment how the estimator ˆαLOO effectively approximates the true miscoverage E[ α], showcasing the effectiveness of Backward Conformal Prediction. In this section, we conduct experiments using a constant size constraint rule T. Additional details and experiments are provided in Appendix B, starting with a binary classification example to motivate the need for controlling prediction set sizes, followed by an image classification experiment using a more complex data-dependent size constraint rule. Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes. |
| Researcher Affiliation | Academia | Etienne Gauthier INRIA-ENS-PSL Paris Francis Bach INRIA-ENS-PSL Paris Michael I. Jordan INRIA-ENS-PSL Paris UC Berkeley |
| Pseudocode | Yes | We summarize the Backward Conformal Prediction procedure in Algorithm 1. Algorithm 1: Backward Conformal Prediction |
| Open Source Code | No | Our code will be made publicly available upon acceptance. All datasets used in our experiments are publicly available, and we will also release our code publicly upon acceptance. |
| Open Datasets | Yes | Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes. We perform binary classification experiments on the Breast Cancer Wisconsin (Diagnostic) dataset [Wolberg et al., 1993]. |
| Dataset Splits | Yes | Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes. We randomly split the dataset into 70% training and 30% testing data. |
| Hardware Specification | Yes | All experiments were run on a machine with a 13th Gen Intel Core i7-13700H CPU and it typically takes 0.1-1.5 hours for each trial, depending on the calibration size n. |
| Software Dependencies | No | No specific versions of programming languages, libraries, or frameworks are mentioned. The paper references models like Efficient Net-B0, XGBoost, and ResNet-18, and optimizers like SGD, but does not specify the software environment or library versions used to implement them. |
| Experiment Setup | Yes | The model is trained to minimize the cross-entropy loss using stochastic gradient descent (SGD) with a learning rate of 0.1, momentum 0.9, weight decay 5 10 4, and cosine annealing over 100 epochs. We use a batch size of 512 and apply standard data augmentation during training. |