Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning

Authors: Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, Dmitry Vetrov

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore the standards for its quantification and point out pitfalls of existing metrics. Avoiding these pitfalls, we perform a broad study of different ensembling techniques.
Researcher Affiliation Collaboration Arsenii Ashukha Samsung AI Center Moscow, HSE aashukha@bayesgroup.ru Alexander Lyzhov Samsung AI Center Moscow, Skoltech , HSE alyzhov@bayesgroup.ru Dmitry Molchanov Samsung AI Center Moscow, HSE dmolch@bayesgroup.ru Dmitry Vetrov HSE , Samsung AI Center Moscow dvetrov@bayesgroup.ru
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code and all computed metrics are available on Git Hub1. 1Source code: https://github.com/bayesgroup/pytorch-ensembles
Open Datasets Yes We use standard benchmark problems of image classification which comprise a common setting in research on learning ensembles of neural networks. There are other relevant settings where the correctness of probability estimates can be a priority, and ensembling techniques are used to improve it. These settings include, but are not limited to, regression, language modeling (Gal, 2016), image segmentation (Gustafsson et al., 2019), active learning (Settles, 2012) and reinforcement learning (Buckman et al., 2018; Chua et al., 2018). We compute the deep ensemble equivalent (DEE) of various ensembling techniques for four popular deep architectures: VGG16 (Simonyan & Zisserman, 2014), Pre Res Net110/164 (He et al., 2016), and Wide Res Net28x10 (Zagoruyko & Komodakis, 2016) on CIFAR-10/100 datasets (Krizhevsky et al., 2009), and Res Net50 (He et al., 2016) on Image Net dataset (Russakovsky et al., 2015).
Dataset Splits Yes The public training set might be divided into a smaller training set and validation set, or the public test set can be split into test and validation parts (Guo et al., 2017; Nixon et al., 2019). ... In order to reduce the variance of the second approach, we perform a test-time cross-validation . We randomly divide the test set into two equal parts, then compute metrics for each half of the test set using the temperature optimized on another half. We repeat this procedure five times and average the results across different random partitions to reduce the variance of the computed metrics.
Hardware Specification Yes Training of one model on a single NVIDIA Tesla V100 GPU takes approximately 5.5 days.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2017)' but does not specify a version number for PyTorch or any other software dependencies crucial for replication.
Experiment Setup Yes To train a single network on CIFAR-10/100, we used SGD with batch size of 128, momentum 0.9 and model-specific parameters, i.e. the initial learning rate (lrinit), the weight decay coefficient (wd), and the number of optimization epochs (epoch). Specific hyperparameters are shown in Table 1.