Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Conservative Prediction via Data-Driven Confidence Minimization
Authors: Caroline Choi, Fahim Tajwar, Yoonho Lee, Huaxiu Yao, Ananya Kumar, Chelsea Finn
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify our approach through experiments on several standard benchmarks for selective classification and OOD detection demonstrate the effectiveness of DCM. In selective classification, DCM consistently outperforms 6 representative approaches in conditions of distribution shift by 2.3% across 4 distribution-shift datasets. DCM also outperforms an ensemble of 5 models on 3 out of 4 datasets in AUROC, despite the 5 difference in computational cost. In the OOD detection setting, among other methods, we provide a comparison with Outlier Exposure (Hendrycks et al., 2018), allowing us to test our choice of uncertainty dataset. DCM consistently outperforms Outlier Exposure on a benchmark of 8 ID-OOD distribution pairs, reducing FPR (at TPR 95%) by 6.3% and 58.1% on CIFAR-10 and CIFAR-100, respectively. DCM also shows strong performance in challenging near-OOD detection settings, achieving 1.89% and 2.94% higher AUROC compared to the state-of-the-art. |
| Researcher Affiliation | Academia | Caroline Choi EMAIL Department of Computer Science Stanford University; Fahim Tajwar EMAIL Machine Learning Department Carnegie Mellon University; Yoonho Lee EMAIL Department of Computer Science Stanford University; Huaxiu Yao EMAIL Department of Computer Science University of North Carolina at Chapel Hill; Ananya Kumar EMAIL Department of Computer Science Stanford University; Chelsea Finn EMAIL Department of Computer Science Stanford University |
| Pseudocode | Yes | We outline our approach in Algorithm 1. Algorithm 1 DCM for Selective Classification Input: Training data Dtr, Validation data Dval, Hyperparameter λ Initialize weights θ θ0 while Not converged do Sample mini-batch Btr Dtr Update θ using θLxent(Btr, f) end while Get correct set D val {(x, y) Dval | fθ(x) = y} Get error set D val {(x, y) Dval | fθ(x) = y} while Not converged do Sample mini-batches Btr Dtr D val, B val D val Update θ using θLxent(Btr, f) + λLconf(B val, f) end while Algorithm 2 DCM for OOD Detection Input: Training data Dtr, Unlabeled data Du, Hyperparameter λ Initialize weights θ θ0 while Not converged do Sample mini-batch Btr Dtr Update θ using θLxent(f, Btr) end while while Not converged do Sample mini-batches Btr Dtr, Bu Du Update θ using θLxent(f, Btr)+λLconf(f, Bu) end while |
| Open Source Code | No | The paper mentions OpenReview as a platform for discussion (https: // openreview. net/ forum? id= QPuxjsj KCP), but it does not provide any explicit statement or link to the authors' source code for the methodology described in the paper. |
| Open Datasets | Yes | Datasets. We use CIFAR-10 and CIFAR-100 as our ID datasets and Tiny Image Net, LSUN, i SUN and SVHN as our OOD datasets, resulting in a total of 8 ID-OOD pairs... For comparison on large-scale image datasets, we use Image Net-1K as ID and i Naturalist, SUN, Textures and Places as OOD datasets... We evaluate selective classification performance on CIFAR-10 (Krizhevsky et al., a) and CIFAR10-C (Hendrycks & Dietterich, 2019), Waterbirds (Sagawa et al., 2019; Wah et al., 2011), Camelyon17 (Koh et al., 2021), and FMo W (Koh et al., 2021). |
| Dataset Splits | Yes | Our uncertainty and test sets are disjoint datasets with 5,000 and 1,000 examples, respectively... We split the ID data into 40,000 examples for training and 10,000 examples for validation... For all methods except outlier exposure and energy based fine-tuning, we use 40,000 out of the 50,000 train examples for training and 10,000 train examples for validation... We use two disjoint sets of 6,000 images as the uncertainty dataset and test set. Each set contains 5,000 ID examples and 1,000 OOD examples. |
| Hardware Specification | Yes | All model training and experiments were conducted on a single NVIDIA RTX Titan or A40 GPU. |
| Software Dependencies | No | The paper mentions common deep learning frameworks like PyTorch implicitly (e.g., through references to `odin-pytorch` on GitHub), but it does not provide specific version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | We find that λ = 0.5 works well in practice and use this value in all experiments unless otherwise specified. Further details, such as fine-tuning duration and the number of samples in Dft and Dunc, are described in Appendix C... For MSP, ODIN, Mahalanobis and energy score, we train our networks for 110 epochs with an initial learning rate of 0.1, weight decay of 5e-4, dropout 0.3 and batch size 128... for our method, we pre-train our network for 100 epochs with the same setup, and fine-tune the network with our modified loss objective for 10 epochs using the same setting, except we use a initial learning rate of 0.001, batch size 32 for ID train set and 64 for the uncertainty dataset. During fine-tuning, we use 27,000 images per epoch, 9,000 of which are labeled ID train examples and the rest are from the uncertainty dataset. Finally, we use λ = 0.5 for all experiments, as in Hendrycks et al. (2018), without any additional hyper-parameter tuning. |