Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?
Authors: Fredrik K. Gustafsson, Martin Danelljan, Thomas B. Schön
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. |
| Researcher Affiliation | Academia | Fredrik K. Gustafsson Department of Information Technology Uppsala University, Sweden Martin Danelljan Computer Vision Lab ETH Zürich, Switzerland Thomas B. Schön Department of Information Technology Uppsala University, Sweden |
| Pseudocode | No | The paper describes the methods and their implementation details in prose (Sections 2.1, 2.2, and 4) rather than providing structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/fregu856/regression_uncertainty. ... All experiments are implemented using Py Torch (Paszke et al., 2019), and our complete implementation is made publicly available. |
| Open Datasets | Yes | We propose an extensive benchmark for testing the reliability of regression uncertainty estimation methods under real-world distribution shifts. The benchmark consists of 8 publicly available image-based regression datasets, which are described in detail in Section 3.1. ... We utilize the Cell-200 dataset from Ding et al. (2021; 2020)... We utilize the RC-49 dataset (Ding et al., 2021; 2020)... We utilize the Poverty Map-Wilds dataset from (Koh et al., 2021)... We utilize the Echo Net-Dynamic dataset (Ouyang et al., 2020)... We utilize the brain tumour dataset of the medical segmentation decathlon (Simpson et al., 2019; Antonelli et al., 2022)... We utilize the HAM10000 dataset by Tschandl et al. (2018)... We utilize the Co NSe P dataset by Graham et al. (2019), along with the pre-processed versions they provide of the Kumar (Kumar et al., 2017) and TNBC (Naylor et al., 2018) datasets... We utilize the Inria aerial image labeling dataset (Maggiori et al., 2017). |
| Dataset Splits | Yes | Cells: We randomly draw 10 000 train images, 2 000 val images and 10 000 test images. ... Chair Angle: We randomly split their training set and obtain 17 640 train images and 4 410 val images. By sub-sampling their test set we also get 11 225 test images. ... Asset Wealth: We use the training, validation-ID and test-OOD subsets of the data, giving us 9 797 train images, 1 000 val images and 3 963 test images. ... Ventricular Volume: We utilize the provided dataset splits, giving us 7 460 train images, 1 288 val images and 1 276 test images. ... Brain Tumour Pixels: We split these scans 80%/20%/20% into train, val and test sets... This gives us 20 614 train images, 6 116 val images and 6 252 test images. ... Skin Lesion Pixels: After randomly splitting the remaining images 85%/15% into train and val sets, we obtain 6 592 train images, 1 164 val images and 2 259 test images. ... Histology Nuclei Pixels: In the end, we obtain 10 808 train images, 2 702 val images and 2 267 test images. ... Aerial Building Pixels: After preprocessing, we obtain 11 184 train images, 2 797 val images and 3 890 test images. |
| Hardware Specification | Yes | All models were trained on individual NVIDIA TITAN Xp GPUs. On one such GPU, training 20 models on one dataset took approximately 24 hours. |
| Software Dependencies | No | All experiments are implemented using Py Torch (Paszke et al., 2019), and our complete implementation is made publicly available. ... We then utilize scikit-learn (Pedregosa et al., 2011) to fit a GMM (4 components, full covariance) to these train feature vectors. ... Following Kuan & Mueller (2022), we utilize the Annoy approximate neighbors library, with cosine similarity as the distance metric. |
| Experiment Setup | Yes | All models are trained for 75 epochs using the ADAM optimizer (Kingma & Ba, 2014). The same hyperparameters are used for all datasets, and neither the training procedure nor the models are specifically tuned for any particular dataset. |