Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sample-Conditional Coverage in Split-Conformal Prediction

Authors: John C. Duchi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main purpose has been to investigate conditional quantile estimation procedures, providing theoretical bounds for their performance; there is already practical experience with these methods [12]. We thus provide an exploratory experiment on the CIFAR-100 dataset [20], a 100-class image classification dataset consisting of 60,000 training examples and a 10,000 example test set, highlighting that these conditional approaches can provide better coverage than using a static threshold, i.e., b Cn(x) = {y Y \| s(x, y) bτn}. In Appendix D we provide a few further simulations to investigate heuristic corrections to the nominal level α that may yield better realized coverage.
Researcher Affiliation	Academia	John Duchi Departments of Statistics and Electrical Engineering Stanford University EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	The code is available, along with the paper, at the public repository https: //github.com/jduchi/cond-conformal.
Open Datasets	Yes	We thus provide an exploratory experiment on the CIFAR-100 dataset [20], a 100-class image classification dataset consisting of 60,000 training examples and a 10,000 example test set, highlighting
Dataset Splits	Yes	1. Uniformly randomly split the training examples into a validation set of size 10,000 and a model training set of size 50,000, on which we fit a linear classifier s : Rd Rk, where sy(x) = βy, x is the score assigned to class y, using multinomial logistic regression.
Hardware Specification	No	The paper does not specify exact hardware models like specific GPU/CPU types, memory, or processor speeds. It mentions the experiments are runnable on a laptop, but without further detail.
Software Dependencies	No	The paper discusses methods like multinomial logistic regression and the use of ResNet but does not provide specific software library names with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup	Yes	We use the output features of a 50 layer Res Net, pre-trained on Image Net [13, 14], as d = 2048-dimensional input to a 100 class logistic regression. We repeat the following experiment 10 times: 1. Uniformly randomly split the training examples into a validation set of size 10,000 and a model training set of size 50,000, on which we fit a linear classifier s : Rd Rk, where sy(x) = βy, x is the score assigned to class y, using multinomial logistic regression. 2. Draw a random matrix W Rd d0, where d0 = 10 and Wij iid N(0, 1), and use the validation data with score function s(x, y) = sy(x) and the lower-dimensional mapping ϕ(x) = W x to predict quantiles via bh(x) = bθ, ϕ(x) .