Adapting the Linearised Laplace Model Evidence for Modern Deep Learning
Authors: Javier Antoran, David Janz, James U Allingham, Erik Daxberger, Riccardo Rb Barbano, Eric Nalisnick, Jose Miguel Hernandez-Lobato
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers. 5. Experiments We proceed to provide empirical evidence for our assumptions and recommendations. |
| Researcher Affiliation | Academia | 1University of Cambridge 2University of Alberta 3Max Planck Institute for Intelligent Systems, T ubingen 4University College London 5University of Amsterdam. Correspondence to: Javier Antor an <ja666@cam.ac.uk>. |
| Pseudocode | Yes | Algorithm 1: Efficient evaluation of the likelihood gradient for the linearised model |
| Open Source Code | No | The paper states 'We use the recently-released laplace library2...'. This indicates the use of an external library, not the release of the authors' specific implementation for the methods described in this paper. |
| Open Datasets | Yes | We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers. The paper mentions using well-known public datasets such as 'MNIST', 'KMNIST' (with citation: Clanuwat et al., 2018), and 'CIFAR10'. |
| Dataset Splits | No | The paper mentions 'val-based early stopping' and implies training, validation, and test sets are used, but it does not provide specific percentages or counts for these splits. For example, it does not state '80/10/10 split' or similar details for any dataset. |
| Hardware Specification | Yes | This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). This choice avoids confounding the effects described in Section 3 with any further approximations. In Section 5.3, we show that our recommendations yield performance improvements on the 23M parameter Res Net-50 network while employing the standard KFAC approximation to the Hessian (Martens & Grosse, 2015; Daxberger et al., 2021a). the largest model for which we can tractably compute the Hessian on an A100 GPU. |
| Software Dependencies | No | The paper states 'We use the recently-released laplace library2' and provides a URL in a footnote, but it does not specify a version number for this library or any other software component used in the experiments. |
| Experiment Setup | Yes | Unless specified otherwise, NN weights θ are learnt using SGD, with an initial learning rate of 0.1, momentum of 0.9, and weight decay of 1 10 4. We trained for 90 epochs, using a multi-step LR scheduler with a decay rate of 0.1 applied at epochs 40 and 70. |