Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Quantifying Uncertainty in the Presence of Distribution Shifts
Authors: Yuli Slavutsky, David Blei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate VIDS on both synthetic and real-world datasets, across classification and regression tasks. We compare the uncertainty estimates produced by VIDS (ours) with previous distance aware methods: SNGP [Liu et al., 2020], DUE [Van Amersfoort et al., 2021], and distance uncertainty layers (DUL) [Park and Blei, 2024] (see 1 for more details). In all experiments, the same neural network architecture is used as the prediction model. Hyper-parameters of our and competing methods were optimized via grid search to maximize average performance (accuracy for classification, RMSE for regression) on a single sample of J = 50 synthetic test environments of size m = 10, which was discarded from the analysis. |
| Researcher Affiliation | Academia | Yuli Slavutsky Department of Statistics Columbia University New York, NY 10027, USA EMAIL David M. Blei Departments of Statistics, Computer Science Columbia University New York, NY 10027, USA EMAIL |
| Pseudocode | Yes | The full procedure is summarized in Algorithm 1 and illustrated in Figure 4. [...] In Algorithm 2 we summarize the complete procedure of variational posterior estimation with synthetic environments (see Appendix C). |
| Open Source Code | Yes | Code The code to reproduce our results is attached to the submission and upon acceptance a link to a permanent repository will be included in the main text. |
| Open Datasets | Yes | Corrupted CIFAR-10 We evaluate model performance under corruption-induced distribution shifts using the CIFAR-10-C dataset [Hendrycks and Dietterich, 2019]. We construct the training set of 5000 images, 90% clean from the original CIFAR-10 dataset [Krizhevsky et al., 2009] and 10% corrupted images from CIFAR-10-C, while the test set is constructed with 5000 images, 90% corrupted and 10% clean. Celeb-A For each experiment, we choose one annotated attribute as the target, and another attribute A to induce a distribution shift in the Celeb A dataset [Liu et al., 2015]. [...] We conduct experiments on three UCI regression datasets Boston, Concrete, and Wine. [...] CIFAR-C dataset was obtained from Zenodo3. 3https://zenodo.org/record/2535967/files/CIFAR-10-C.tar |
| Dataset Splits | Yes | Corrupted CIFAR-10 ...We construct the training set of 5000 images, 90% clean from the original CIFAR-10 dataset [Krizhevsky et al., 2009] and 10% corrupted images from CIFAR-10-C, while the test set is constructed with 5000 images, 90% corrupted and 10% clean. Celeb-A ...The training set contains 500 images with 90% having A = 1 and 10% with A = 0; the test set reverses this ratio: 90% images with A = 0 and 10% with A = 1. Regression ...The training set contains 90% of the high-variance cluster, while the test set consists of 90% of the low-variance cluster. |
| Hardware Specification | Yes | We ran all synthetic data and UCI experiments on 2 CPUs. Each repetition of these experiments lasted less than 7 minutes. For real data classification experiments (on CIFAR-10 and Celeb-A datasets) we used a single A100 cloud GPU. Each repetition lasted less than 18 minutes. |
| Software Dependencies | Yes | All the code in this work was implemented in Python 3.11. We used Numpy 2.0, Tensor Flow 2.13 and Tensor Flow Addons 0.21 packages. The UCI datasets were loaded through sklearn 1.6. CIFAR-C dataset was obtained from Zenodo3. CIFAR-10 and Celeb-A datasets were loaded through torchvision 0.21. All figures were generated using Matplotlib 3.10. |
| Experiment Setup | Yes | Hyper-parameters of our and competing methods were optimized via grid search to maximize average performance (accuracy for classification, RMSE for regression) on a single sample of J = 50 synthetic test environments of size m = 10, which was discarded from the analysis. For the corresponding values, and additional implementation details see Appendix E. [...] The hyperparameters used for the heteroskedastic linear regression and logistic regression with missing data are detailed in Table 2. [...] For VIDS we specify hγ as a fully-connected neural network with 6 layers of sizes 64d, 32d, 16d, 8d, 4d, 2d and Re LU activation between the layers, for d = 8. |