Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Backward Conformal Prediction

Authors: Etienne Gauthier, Francis Bach, Michael I Jordan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate through an image classification experiment how the estimator ˆαLOO effectively approximates the true miscoverage E[ α], showcasing the effectiveness of Backward Conformal Prediction. In this section, we conduct experiments using a constant size constraint rule T. Additional details and experiments are provided in Appendix B, starting with a binary classification example to motivate the need for controlling prediction set sizes, followed by an image classification experiment using a more complex data-dependent size constraint rule. Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes.
Researcher Affiliation	Academia	Etienne Gauthier INRIA-ENS-PSL Paris Francis Bach INRIA-ENS-PSL Paris Michael I. Jordan INRIA-ENS-PSL Paris UC Berkeley
Pseudocode	Yes	We summarize the Backward Conformal Prediction procedure in Algorithm 1. Algorithm 1: Backward Conformal Prediction
Open Source Code	No	Our code will be made publicly available upon acceptance. All datasets used in our experiments are publicly available, and we will also release our code publicly upon acceptance.
Open Datasets	Yes	Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes. We perform binary classification experiments on the Breast Cancer Wisconsin (Diagnostic) dataset [Wolberg et al., 1993].
Dataset Splits	Yes	Our approach is evaluated on the CIFAR-10 dataset [Krizhevsky, 2009], which consists of 50,000 training images and 10,000 test images across 10 classes. We randomly split the dataset into 70% training and 30% testing data.
Hardware Specification	Yes	All experiments were run on a machine with a 13th Gen Intel Core i7-13700H CPU and it typically takes 0.1-1.5 hours for each trial, depending on the calibration size n.
Software Dependencies	No	No specific versions of programming languages, libraries, or frameworks are mentioned. The paper references models like Efficient Net-B0, XGBoost, and ResNet-18, and optimizers like SGD, but does not specify the software environment or library versions used to implement them.
Experiment Setup	Yes	The model is trained to minimize the cross-entropy loss using stochastic gradient descent (SGD) with a learning rate of 0.1, momentum 0.9, weight decay 5 10 4, and cosine annealing over 100 epochs. We use a batch size of 512 and apply standard data augmentation during training.