Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?

Authors: Fredrik K. Gustafsson, Martin Danelljan, Thomas B. Schön

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets.
Researcher Affiliation Academia Fredrik K. Gustafsson Department of Information Technology Uppsala University, Sweden Martin Danelljan Computer Vision Lab ETH Zürich, Switzerland Thomas B. Schön Department of Information Technology Uppsala University, Sweden
Pseudocode No The paper describes the methods and their implementation details in prose (Sections 2.1, 2.2, and 4) rather than providing structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/fregu856/regression_uncertainty. ... All experiments are implemented using Py Torch (Paszke et al., 2019), and our complete implementation is made publicly available.
Open Datasets Yes We propose an extensive benchmark for testing the reliability of regression uncertainty estimation methods under real-world distribution shifts. The benchmark consists of 8 publicly available image-based regression datasets, which are described in detail in Section 3.1. ... We utilize the Cell-200 dataset from Ding et al. (2021; 2020)... We utilize the RC-49 dataset (Ding et al., 2021; 2020)... We utilize the Poverty Map-Wilds dataset from (Koh et al., 2021)... We utilize the Echo Net-Dynamic dataset (Ouyang et al., 2020)... We utilize the brain tumour dataset of the medical segmentation decathlon (Simpson et al., 2019; Antonelli et al., 2022)... We utilize the HAM10000 dataset by Tschandl et al. (2018)... We utilize the Co NSe P dataset by Graham et al. (2019), along with the pre-processed versions they provide of the Kumar (Kumar et al., 2017) and TNBC (Naylor et al., 2018) datasets... We utilize the Inria aerial image labeling dataset (Maggiori et al., 2017).
Dataset Splits Yes Cells: We randomly draw 10 000 train images, 2 000 val images and 10 000 test images. ... Chair Angle: We randomly split their training set and obtain 17 640 train images and 4 410 val images. By sub-sampling their test set we also get 11 225 test images. ... Asset Wealth: We use the training, validation-ID and test-OOD subsets of the data, giving us 9 797 train images, 1 000 val images and 3 963 test images. ... Ventricular Volume: We utilize the provided dataset splits, giving us 7 460 train images, 1 288 val images and 1 276 test images. ... Brain Tumour Pixels: We split these scans 80%/20%/20% into train, val and test sets... This gives us 20 614 train images, 6 116 val images and 6 252 test images. ... Skin Lesion Pixels: After randomly splitting the remaining images 85%/15% into train and val sets, we obtain 6 592 train images, 1 164 val images and 2 259 test images. ... Histology Nuclei Pixels: In the end, we obtain 10 808 train images, 2 702 val images and 2 267 test images. ... Aerial Building Pixels: After preprocessing, we obtain 11 184 train images, 2 797 val images and 3 890 test images.
Hardware Specification Yes All models were trained on individual NVIDIA TITAN Xp GPUs. On one such GPU, training 20 models on one dataset took approximately 24 hours.
Software Dependencies No All experiments are implemented using Py Torch (Paszke et al., 2019), and our complete implementation is made publicly available. ... We then utilize scikit-learn (Pedregosa et al., 2011) to fit a GMM (4 components, full covariance) to these train feature vectors. ... Following Kuan & Mueller (2022), we utilize the Annoy approximate neighbors library, with cosine similarity as the distance metric.
Experiment Setup Yes All models are trained for 75 epochs using the ADAM optimizer (Kingma & Ba, 2014). The same hyperparameters are used for all datasets, and neither the training procedure nor the models are specifically tuned for any particular dataset.