Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

One Sample is Enough to Make Conformal Prediction Robust

Authors: Soroush H. Zargarbashi, Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate average set size (lower is better), and empirical coverage (exceeding 1 α on average). Note that in RCPs the empirical coverage conservatively exceeds 1 α by increasing r. Under perturbation this decreases at worst to 1 α. As Bin CP [26] outperforms other robust CP approaches [24, 27], we set it as our main comparison baseline. All recent smoothing-based RCPs return non-informative sets (C(x) = Y) for low number of samples (e.g. 32). Note that our main contribution is to return efficient sets with one inference per input; therefore we do not expect RCP1 to outperform Bin CP for a large sample-rate. Our reported results are over 100 iterations with calibration set randomly sampled from the data. Further details are in E, and the code is in our Git Hub. Since we certify the coverage guarantee (instead of scores), we can use the same binary certificate for both classification and regression tasks. We discuss the classification here, and defer the regression task to E. The algorithm remains the same, only for the regression we use the absolute distance from the ground truth as the score. To the best of our knowledge, this is the first conformal regression certificate based on randomized smoothing. Classification. We compare methods for the CIFAR10, and Image Net datasets. We have two inference pipelines The original pipeline from Bin CP, and CAS (computationally cheap setup): we use the Res Net models trained with noise augmentation from Cohen et al. [7]. Because of the model size, large sample-rates, although inefficient, are not unrealistic. We also evaluate on an alternative more expensive pipeline outlined by Carlini et al. [6]: the input is first denoised by a diffusion model and then classified by a vision transformer. For CIFAR-10 we combine a 50M-parameter diffusion model from Dhariwal and Nichol [9], with a Vi T-B/16 from Dosovitskiy et al. [10], pretrained
Researcher Affiliation	Academia	Soroush H. Zargarbashi1 Mohammad Sadegh Akhondzadeh2 Aleksandar Bojchevski2 1 CISPA Helmholtz Center for Information Security, 2 University of Cologne [zargarbashi, akhondzadeh, bojchevski]@cs.uni-koeln.de
Pseudocode	Yes	Algorithm 1. RCP1; the colored part shows the difference with vanilla CP. Require: Calibration set Dn = {(xi, yi)}n i=1; nominal coverage 1 α (0, 1); score s : X Y R; potentially perturbed test point xn+1 1: Compute si s(xi+ϵ, yi) : (xi, yi) D. 2: Set 1 α c [1 α, B 1] e.g., Gaussian smoothing with Br: Φσ(Φ 1 σ (1 α) + r). 3: Set qα = Q (α , {si}n i=1). 4: For input xn+1 return Cr( xn+1) = {y : s( xn+1+ϵ, y) qα}
Open Source Code	Yes	Further details are in E, and the code is in our Git Hub. ... Our code is uploaded as a zip file in the supplementary materials. After acceptance we also share the code in Git Hub.
Open Datasets	Yes	We compare methods for the CIFAR10, and Image Net datasets. ... We use the model from Fischer et al. [11] on the City Scapes dataset [8] which is a scene segmentation task.
Dataset Splits	Yes	For the CIFAR-10 datasets we evaluate the results over 2048 test samples for Res Net model and 10000 images for the Vi T models. For the Image Net since the number of classes are 1000, we report our results over 5000 images for Vi T models and 50000 images on Res Net models. Ultimately the number of samples does not influence the empirical results. The number of Monte Carlo samples are initially set to 500 for CIFAR and 300 for Image Net. For each experiment, and for the reported sample rate we cut the precomputed samples, from the reported number. Our results are reports over 100 runs (except the conformal risk control which is over one run. In each run we sample the 10% of the points as the calibration set. For conformal risk control we report the result on 300 images where 100 random images from it is taken for the calibration. Ultimately the size of the calibration set does not effect the final performance.
Hardware Specification	Yes	We ran our experiment using Nividia A-100 and H-100 Tensor Core GPUs. For each experiment only one GPU was used. We use the A-100 GPU for the CIFAR-10 dataset under Res Net setup, and the conformal risk control experiment. The rest of the results use H-100 as the compute resource.
Software Dependencies	No	Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019.
Experiment Setup	Yes	For CIFAR-10 we combine a 50M-parameter diffusion model from Dhariwal and Nichol [9], with a Vi T-B/16 from Dosovitskiy et al. [10], pretrained on Image Net at 224 224 resolution and finetuned on CIFAR10 with 97.9% accuracy for the Hugging Face implementation. For Image Net we use a 552M-parameter class-unconditional diffusion model followed by BEi T-L model (305M parameters) from Bao et al. [3] achieving 88.6% top-1 validation accuracy. We use the implementation provided by the timm library [23].