Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quantifying Uncertainty in the Presence of Distribution Shifts

Authors: Yuli Slavutsky, David Blei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate VIDS on both synthetic and real-world datasets, across classification and regression tasks. We compare the uncertainty estimates produced by VIDS (ours) with previous distance aware methods: SNGP [Liu et al., 2020], DUE [Van Amersfoort et al., 2021], and distance uncertainty layers (DUL) [Park and Blei, 2024] (see 1 for more details). In all experiments, the same neural network architecture is used as the prediction model. Hyper-parameters of our and competing methods were optimized via grid search to maximize average performance (accuracy for classification, RMSE for regression) on a single sample of J = 50 synthetic test environments of size m = 10, which was discarded from the analysis.
Researcher Affiliation	Academia	Yuli Slavutsky Department of Statistics Columbia University New York, NY 10027, USA EMAIL David M. Blei Departments of Statistics, Computer Science Columbia University New York, NY 10027, USA EMAIL
Pseudocode	Yes	The full procedure is summarized in Algorithm 1 and illustrated in Figure 4. [...] In Algorithm 2 we summarize the complete procedure of variational posterior estimation with synthetic environments (see Appendix C).
Open Source Code	Yes	Code The code to reproduce our results is attached to the submission and upon acceptance a link to a permanent repository will be included in the main text.
Open Datasets	Yes	Corrupted CIFAR-10 We evaluate model performance under corruption-induced distribution shifts using the CIFAR-10-C dataset [Hendrycks and Dietterich, 2019]. We construct the training set of 5000 images, 90% clean from the original CIFAR-10 dataset [Krizhevsky et al., 2009] and 10% corrupted images from CIFAR-10-C, while the test set is constructed with 5000 images, 90% corrupted and 10% clean. Celeb-A For each experiment, we choose one annotated attribute as the target, and another attribute A to induce a distribution shift in the Celeb A dataset [Liu et al., 2015]. [...] We conduct experiments on three UCI regression datasets Boston, Concrete, and Wine. [...] CIFAR-C dataset was obtained from Zenodo3. 3https://zenodo.org/record/2535967/files/CIFAR-10-C.tar
Dataset Splits	Yes	Corrupted CIFAR-10 ...We construct the training set of 5000 images, 90% clean from the original CIFAR-10 dataset [Krizhevsky et al., 2009] and 10% corrupted images from CIFAR-10-C, while the test set is constructed with 5000 images, 90% corrupted and 10% clean. Celeb-A ...The training set contains 500 images with 90% having A = 1 and 10% with A = 0; the test set reverses this ratio: 90% images with A = 0 and 10% with A = 1. Regression ...The training set contains 90% of the high-variance cluster, while the test set consists of 90% of the low-variance cluster.
Hardware Specification	Yes	We ran all synthetic data and UCI experiments on 2 CPUs. Each repetition of these experiments lasted less than 7 minutes. For real data classification experiments (on CIFAR-10 and Celeb-A datasets) we used a single A100 cloud GPU. Each repetition lasted less than 18 minutes.
Software Dependencies	Yes	All the code in this work was implemented in Python 3.11. We used Numpy 2.0, Tensor Flow 2.13 and Tensor Flow Addons 0.21 packages. The UCI datasets were loaded through sklearn 1.6. CIFAR-C dataset was obtained from Zenodo3. CIFAR-10 and Celeb-A datasets were loaded through torchvision 0.21. All figures were generated using Matplotlib 3.10.
Experiment Setup	Yes	Hyper-parameters of our and competing methods were optimized via grid search to maximize average performance (accuracy for classification, RMSE for regression) on a single sample of J = 50 synthetic test environments of size m = 10, which was discarded from the analysis. For the corresponding values, and additional implementation details see Appendix E. [...] The hyperparameters used for the heteroskedastic linear regression and logistic regression with missing data are detailed in Table 2. [...] For VIDS we specify hγ as a fully-connected neural network with 6 layers of sizes 64d, 32d, 16d, 8d, 4d, 2d and Re LU activation between the layers, for d = 8.