Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Valid Selection among Conformal Sets

Authors: Mahmoud Hegazy, Liviu Aolaritei, Michael I Jordan, Aymeric Dieuleveut

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we validate our approaches on multiple experimental settings in Section 6. 6 Experiments To illustrate our approach in practice, we present here two simple experimental setups, one on synthetic and one on real data.
Researcher Affiliation Academia 1CMAP, École polytechnique, Institut Polytechnique de Paris 2Inria, École Normale Supérieure, PSL Research University 3Department of Electrical Engineering and Computer Sciences, University of California, Berkeley EMAIL EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 Adaptive COMA (Ada COMA) Input: K conformal algorithms {C(t) i }K i=1, stability parameter (η, τ), initial weights w(1) = (1/k, . . . , 1/k) For: t = 1, 2, . . . Compute w(t) using COMA. Compute p ((w(t)), ξt) K 1 using Min SE with b = w(t) and parameters (η, τ) Output: Any of the following two options: Option 1: Combined set C(t) comb(Xt) equal to n y Y PK i=1 p i (w(t), ξt)1 n y C(t) i (Xt) o 1 Option 2: Combined predictor leading to C(t) ˆS(ξt,εt)(Xt), with P n ˆS(ξt,εt) = i|ξt o = p i (w(t), ξt)
Open Source Code Yes Moreover, the code to reproduce the experiments is available in the supplementary material1. 1Code also available at Valid-Selection-among-Conformal-Sets. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: ... The code to reproduce all experiments and a guide on how to use it can be found in the supplemental material.
Open Datasets Yes We conduct experiments on three standard regression datasets: Abalone, California Housing, and Bike Sharing [30, 31]. [31] Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository. https://archive.ics.uci.edu, 2023. Accessed: 2023-10-05. For the real dataset experiments (Abalone, Bike Sharing [31, CC BY 4.0], California Housing [30, BSD License]), the hyperparameters of several base regression models were optimized prior to their use in the main conformal prediction experiments. We complement our regression studies with a compact Image Net-1k classification [37, Non Commercial Use] experiment intended to emulate a heterogeneous setting.
Dataset Splits Yes We used 80% of the data for training, 10% for calibration, and 10% for testing.
Hardware Specification Yes All experiments took approximately 200 CPU hours using 16 cores of Intel Xeon CPU Gold 6230 and 32 GB of system memory. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The computational resources, including hardware type (CPU/GPU) and runtime, are described in appendix C.
Software Dependencies Yes using the Kernel Ridge Regression model from scikit-learn [29]. This tuning was performed using Bayes Search CV from the scikit-optimize library [32, BSD-2 license]. scikit-optimize/scikit-optimize: v0.5.2, March 2018. URL https://doi.org/10. 5281/zenodo.1207017.
Experiment Setup Yes The feature dimension is set to d = 10, and the training data is split into two blocks. In the first block, we train K distinct regression models, f1, . . . , f K, using the Kernel Ridge Regression model from scikit-learn [29]. For each model, we randomly sample the kernel function (either linear or radial basis function (RBF)) and the regularization parameter (uniformly chosen between 0.1 and 1). For each model i, we use the second block of training data to train a random forest model gi that predicts the absolute residuals |fi(X) Y |, enabling us to use the nonconformity score, defined as si(X, Y ) = |fi(X) Y |/gi(X). We use 400 datapoints for the calibration dataset. For each dataset, leveraging scikit-optimize for hyperparameter tuning [32], we used the following scikit-learn models: Ada Boost Regressor, Decision Tree Regressor, Gradient Boosting Regressor, Elastic Net, Random Forest Regressor, and Linear Regression. The Bayes Search CV process was configured to run for 25 iterations (n_iter=25) with 3-fold cross-validation (cv_folds=3) for each model and hyperparameter setting.