Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Split conformal classification with unsupervised calibration
Authors: Santiago Mazuelas
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments assess the performance obtained by the proposed approach for split conformal prediction with unsupervised classification using 9 common benchmark datasets. The coverage probabilities and prediction set sizes are compared with those provided by the conventional approach that uses supervised calibration samples and by the naive approach that uses label predictions for calibration samples. |
| Researcher Affiliation | Academia | Santiago Mazuelas BCAM-Basque Center for Applied Mathematics and IKERBASQUE-Basque Foundation for Science Bilbao, Spain EMAIL |
| Pseudocode | Yes | Algorithm 1 Supervised split conformal prediction Algorithm 2 Split conformal prediction with unsupervised calibration Algorithm 3 Kernel-based method to obtain label weights for calibration instances |
| Open Source Code | Yes | The code implementing the methods presented and reproducing the experiments can be found at https://github.com/MachineLearningBCAM/Unsupervised-conformal-prediction-NeurIPS2025. |
| Open Datasets | Yes | The experiments assess the performance obtained by the proposed approach for split conformal prediction with unsupervised classification using 9 common benchmark datasets. The code implementing the methods presented and reproducing the experiments can be found at https://github.com/MachineLearningBCAM/Unsupervised-conformal-prediction-NeurIPS2025. The supplementary materials provide implementation details and additional results in Appendix E, including running time assessments, as well as results for different target coverages and types of conformal scores. E.1 Implementation and datasets details We utilize 9 publicly available and common benchmark datasets for classification tasks: Drybean , Forestcov , Satellite , USPS , MNIST , Fashion MNIST , CIFAR10 , Image Net10 (first 10 classes of Image Net), and Letter . These datasets can be found in the UCI repository [23], Tensor Flow datasets [24], and Kaggle website https://www.kaggle.com. |
| Dataset Splits | Yes | In each random realization, the datasets are randomly partitioned in training, calibration, and test sets. The sizes of the training and test sets are 3, 000 samples and 1, 000, and that of the calibration set is varied from 10 to 3, 000. |
| Hardware Specification | No | Figure 3 shows the running time achieved by Algorithm 3 using a regular desktop machine in the datasets Drybean , CIFAR10 , and Letter . |
| Software Dependencies | No | In the experimental results carried out, such optimization problem is solved using interior point methods with Mosek solver https://www.mosek.com. |
| Experiment Setup | Yes | The classification rules are obtained using random forests in the tabular datasets ( Drybean , Forestcov , Satellite , and Letter ) and using neural networks in the image datasets ( USPS , MNIST , Fashion MNIST , CIFAR10 , and Image Net10 ). Specifically, the random forests are given by 200 decision trees and learning is carried out with 20 maximum number of splits and 10 minimum leaf size. The neural networks for USPS , MNIST , and Fashion MNIST datasets have two hidden layers of sizes 128 and 64 and learning is carried out with regularization parameter 0.001. The implementation for CIFAR10 and Image Net10 datasets is slightly different due their higher complexity. In particular, the classification rule for those datasets is learned only once by fine-tuning a Resnet50 until a validation error of 4% in CIFAR10 and 11% in Image Net10 datasets, using SGD with momentum, initial learning rate of 0.001, minibatch size of 64, and regularization parameter 0.001. |