A Scalable Laplace Approximation for Neural Networks
Authors: Hippolyt Ritter, Aleksandar Botev, David Barber
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively compare our method to using Dropout and a diagonal Laplace approximation for estimating the uncertainty of a network. We demonstrate that our Kronecker factored method leads to better uncertainty estimates on out-of-distribution data and is more robust to simple adversarial attacks. Our approach only requires calculating two square curvature factor matrices for each layer. Their size is equal to the respective square of the input and output size of the layer, making the method efficient both computationally and in terms of memory usage. We illustrate its scalability by applying it to a state-of-the-art convolutional network architecture. |
| Researcher Affiliation | Academia | Hippolyt Ritter1 , Aleksandar Botev1, David Barber1 2 1University College London 2Alan Turing Institute |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our fork available at: https://github.com/BB-UCL/Lasagne |
| Open Datasets | Yes | For this we train a network with two layers of 1024 hidden units and ReLU transfer functions to classify MNIST digits. We assess the uncertainty of the predictions when classifying data from a different distribution than the training data. For this we train a network with two layers of 1024 hidden units and Re LU transfer functions to classify MNIST digits. We use a learning rate of 10 2 and momentum of 0.9 for 250 epochs. We apply Dropout with p=0.5 after each inner layer, as our chief interest is to compare against its uncertainty estimates. We further use L2-regularisation with a factor of 10 2 and randomly binarise the images during training according to their pixel intensities and draw 1, 000 such samples per datapoint for estimating the curvature factors. We use this network to classify the images in the not MNIST dataset4, which contains 28 28 grey-scale images of the letters A to J from various computer fonts, i.e. not digits. (Footnote 4: From: http://yaroslavvb.blogspot.nl/2011/09/notmnist-dataset.html) We apply it to a state-of-the-art convolutional network architecture. Recently, deep residual networks (He et al., 2016a;b) have been the most successful ones among those. We compare our uncertainty estimates on wide residual networks (Zagoruyko & Komodakis, 2016), a recent variation that achieved competitive performance on CIFAR100 (Krizhevsky & Hinton, 2009) |
| Dataset Splits | Yes | We set the hyperparameters of the Laplace approximations (see Section 3.4) using a grid search over the likelihood of 20 validation points that are sampled the same way as the training set. We use the first 5,000 images as a validation set to tune the hyperparameters of our Laplace approximation and the final 5,000 ones for evaluating the predictive uncertainty on all methods. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | All experiments are implemented using Theano (Theano Development Team, 2016) and Lasagne (Dieleman et al., 2015). The paper mentions the software names but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We use a learning rate of 10 2 and momentum of 0.9 for 250 epochs. We apply Dropout with p=0.5 after each inner layer, as our chief interest is to compare against its uncertainty estimates. We further use L2-regularisation with a factor of 10 2 and randomly binarise the images during training according to their pixel intensities and draw 1, 000 such samples per datapoint for estimating the curvature factors. Our wide residual network has n=3 block repetitions and a width factor of k=8 on CIFAR100 with and without Dropout using hyperparameters taken from (Zagoruyko & Komodakis, 2016): the network parameters are trained on a cross-entropy loss using Nesterov momentum with an initial learning rate of 0.1 and momentum of 0.9 for 200 epochs with a minibatch size of 128. We decay the learning rate every 50 epochs by a factor of 0.2, which is slightly different to the schedule used in (Zagoruyko & Komodakis, 2016) (they decay after 60, 120 and 160 epochs). As the original authors, we use L2-regularisation with a factor of 5 10 4. |