Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Split conformal classification with unsupervised calibration

Authors: Santiago Mazuelas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments assess the performance obtained by the proposed approach for split conformal prediction with unsupervised classiﬁcation using 9 common benchmark datasets. The coverage probabilities and prediction set sizes are compared with those provided by the conventional approach that uses supervised calibration samples and by the naive approach that uses label predictions for calibration samples.
Researcher Affiliation	Academia	Santiago Mazuelas BCAM-Basque Center for Applied Mathematics and IKERBASQUE-Basque Foundation for Science Bilbao, Spain EMAIL
Pseudocode	Yes	Algorithm 1 Supervised split conformal prediction Algorithm 2 Split conformal prediction with unsupervised calibration Algorithm 3 Kernel-based method to obtain label weights for calibration instances
Open Source Code	Yes	The code implementing the methods presented and reproducing the experiments can be found at https://github.com/MachineLearningBCAM/Unsupervised-conformal-prediction-NeurIPS2025.
Open Datasets	Yes	The experiments assess the performance obtained by the proposed approach for split conformal prediction with unsupervised classiﬁcation using 9 common benchmark datasets. The code implementing the methods presented and reproducing the experiments can be found at https://github.com/MachineLearningBCAM/Unsupervised-conformal-prediction-NeurIPS2025. The supplementary materials provide implementation details and additional results in Appendix E, including running time assessments, as well as results for different target coverages and types of conformal scores. E.1 Implementation and datasets details We utilize 9 publicly available and common benchmark datasets for classiﬁcation tasks: Drybean , Forestcov , Satellite , USPS , MNIST , Fashion MNIST , CIFAR10 , Image Net10 (ﬁrst 10 classes of Image Net), and Letter . These datasets can be found in the UCI repository [23], Tensor Flow datasets [24], and Kaggle website https://www.kaggle.com.
Dataset Splits	Yes	In each random realization, the datasets are randomly partitioned in training, calibration, and test sets. The sizes of the training and test sets are 3, 000 samples and 1, 000, and that of the calibration set is varied from 10 to 3, 000.
Hardware Specification	No	Figure 3 shows the running time achieved by Algorithm 3 using a regular desktop machine in the datasets Drybean , CIFAR10 , and Letter .
Software Dependencies	No	In the experimental results carried out, such optimization problem is solved using interior point methods with Mosek solver https://www.mosek.com.
Experiment Setup	Yes	The classiﬁcation rules are obtained using random forests in the tabular datasets ( Drybean , Forestcov , Satellite , and Letter ) and using neural networks in the image datasets ( USPS , MNIST , Fashion MNIST , CIFAR10 , and Image Net10 ). Speciﬁcally, the random forests are given by 200 decision trees and learning is carried out with 20 maximum number of splits and 10 minimum leaf size. The neural networks for USPS , MNIST , and Fashion MNIST datasets have two hidden layers of sizes 128 and 64 and learning is carried out with regularization parameter 0.001. The implementation for CIFAR10 and Image Net10 datasets is slightly different due their higher complexity. In particular, the classiﬁcation rule for those datasets is learned only once by ﬁne-tuning a Resnet50 until a validation error of 4% in CIFAR10 and 11% in Image Net10 datasets, using SGD with momentum, initial learning rate of 0.001, minibatch size of 64, and regularization parameter 0.001.