Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conservative Prediction via Data-Driven Confidence Minimization

Authors: Caroline Choi, Fahim Tajwar, Yoonho Lee, Huaxiu Yao, Ananya Kumar, Chelsea Finn

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify our approach through experiments on several standard benchmarks for selective classiﬁcation and OOD detection demonstrate the eﬀectiveness of DCM. In selective classiﬁcation, DCM consistently outperforms 6 representative approaches in conditions of distribution shift by 2.3% across 4 distribution-shift datasets. DCM also outperforms an ensemble of 5 models on 3 out of 4 datasets in AUROC, despite the 5 diﬀerence in computational cost. In the OOD detection setting, among other methods, we provide a comparison with Outlier Exposure (Hendrycks et al., 2018), allowing us to test our choice of uncertainty dataset. DCM consistently outperforms Outlier Exposure on a benchmark of 8 ID-OOD distribution pairs, reducing FPR (at TPR 95%) by 6.3% and 58.1% on CIFAR-10 and CIFAR-100, respectively. DCM also shows strong performance in challenging near-OOD detection settings, achieving 1.89% and 2.94% higher AUROC compared to the state-of-the-art.
Researcher Affiliation	Academia	Caroline Choi EMAIL Department of Computer Science Stanford University; Fahim Tajwar EMAIL Machine Learning Department Carnegie Mellon University; Yoonho Lee EMAIL Department of Computer Science Stanford University; Huaxiu Yao EMAIL Department of Computer Science University of North Carolina at Chapel Hill; Ananya Kumar EMAIL Department of Computer Science Stanford University; Chelsea Finn EMAIL Department of Computer Science Stanford University
Pseudocode	Yes	We outline our approach in Algorithm 1. Algorithm 1 DCM for Selective Classiﬁcation Input: Training data Dtr, Validation data Dval, Hyperparameter λ Initialize weights θ θ0 while Not converged do Sample mini-batch Btr Dtr Update θ using θLxent(Btr, f) end while Get correct set D val {(x, y) Dval \| fθ(x) = y} Get error set D val {(x, y) Dval \| fθ(x) = y} while Not converged do Sample mini-batches Btr Dtr D val, B val D val Update θ using θLxent(Btr, f) + λLconf(B val, f) end while Algorithm 2 DCM for OOD Detection Input: Training data Dtr, Unlabeled data Du, Hyperparameter λ Initialize weights θ θ0 while Not converged do Sample mini-batch Btr Dtr Update θ using θLxent(f, Btr) end while while Not converged do Sample mini-batches Btr Dtr, Bu Du Update θ using θLxent(f, Btr)+λLconf(f, Bu) end while
Open Source Code	No	The paper mentions OpenReview as a platform for discussion (https: // openreview. net/ forum? id= QPuxjsj KCP), but it does not provide any explicit statement or link to the authors' source code for the methodology described in the paper.
Open Datasets	Yes	Datasets. We use CIFAR-10 and CIFAR-100 as our ID datasets and Tiny Image Net, LSUN, i SUN and SVHN as our OOD datasets, resulting in a total of 8 ID-OOD pairs... For comparison on large-scale image datasets, we use Image Net-1K as ID and i Naturalist, SUN, Textures and Places as OOD datasets... We evaluate selective classiﬁcation performance on CIFAR-10 (Krizhevsky et al., a) and CIFAR10-C (Hendrycks & Dietterich, 2019), Waterbirds (Sagawa et al., 2019; Wah et al., 2011), Camelyon17 (Koh et al., 2021), and FMo W (Koh et al., 2021).
Dataset Splits	Yes	Our uncertainty and test sets are disjoint datasets with 5,000 and 1,000 examples, respectively... We split the ID data into 40,000 examples for training and 10,000 examples for validation... For all methods except outlier exposure and energy based ﬁne-tuning, we use 40,000 out of the 50,000 train examples for training and 10,000 train examples for validation... We use two disjoint sets of 6,000 images as the uncertainty dataset and test set. Each set contains 5,000 ID examples and 1,000 OOD examples.
Hardware Specification	Yes	All model training and experiments were conducted on a single NVIDIA RTX Titan or A40 GPU.
Software Dependencies	No	The paper mentions common deep learning frameworks like PyTorch implicitly (e.g., through references to `odin-pytorch` on GitHub), but it does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	We find that λ = 0.5 works well in practice and use this value in all experiments unless otherwise speciﬁed. Further details, such as ﬁne-tuning duration and the number of samples in Dft and Dunc, are described in Appendix C... For MSP, ODIN, Mahalanobis and energy score, we train our networks for 110 epochs with an initial learning rate of 0.1, weight decay of 5e-4, dropout 0.3 and batch size 128... for our method, we pre-train our network for 100 epochs with the same setup, and ﬁne-tune the network with our modiﬁed loss objective for 10 epochs using the same setting, except we use a initial learning rate of 0.001, batch size 32 for ID train set and 64 for the uncertainty dataset. During ﬁne-tuning, we use 27,000 images per epoch, 9,000 of which are labeled ID train examples and the rest are from the uncertainty dataset. Finally, we use λ = 0.5 for all experiments, as in Hendrycks et al. (2018), without any additional hyper-parameter tuning.