Deep kernel processes
Authors: Laurence Aitchison, Adam Yang, Sebastian W. Ober
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We found that DIWP usually gives better predictive performance and (and when it does not, the differences are very small; Table 1). We expected DIWP to be better than (or the same as) the NNGP as the NNGP was a special case of our DIWP. We found that the DGP performs poorly in comparison to DIWP and NNGPs, and even to past baselines on all datasets except protein (which is by far the largest). |
| Researcher Affiliation | Academia | 1Department of Computer Science, Bristol, BS8 1UB, UK 2Department of Engineering, Cambridge, CB2 1PZ, UK. |
| Pseudocode | Yes | Algorithm 1 Computing predictions/ELBO for one batch |
| Open Source Code | Yes | 1Reference implementation at: github.com/Laurence A/ bayesfunc |
| Open Datasets | Yes | We began by comparing the performance of our deep inverse Wishart process (DIWP) against infinite Bayesian neural networks (known as the neural network Gaussian process or NNGP) and DGPs. To ensure sensible comparisons against the NNGP, we used a ReLU kernel in all models (Cho & Saul, 2009). For all models, we used three layers (two hidden layers and one output layer), with three applications of the kernel. In each case, we used a learned bias and scale for each input feature, and trained for 8000 gradient steps with the Adam optimizer with 100 inducing points, a learning rate of 10^-2 for the first 4000 steps and 10^-3 for the final 4000 steps. For evaluation, we used approximate posterior 100 samples, and for each training step we used 10 approximate posterior samples in the smaller datasets (boston, concrete, energy, wine, yacht), and 1 in the larger datasets. We found that DIWP usually gives better predictive performance and (and when it does not, the differences are very small; Table 1). We expected DIWP to be better than (or the same as) the NNGP as the NNGP was a special case of our DIWP. We found that the DGP performs poorly in comparison to DIWP and NNGPs, and even to past baselines on all datasets except protein (which is by far the largest). This is because we used a plain feedforward architecture for all models. In contrast, Salimbeni & Deisenroth (2017) found that good performance (or even convergence) with DGPs on UCI datasets required a complex GP-prior inspired by skip connections. Here, we used simple feedforward architectures, both to ensure a fair comparison to the other models, and to avoid the need for an architecture search. In addition, the inverse Wishart process is implicitly able to learn the network width , δℓ, whereas in the DGPs, the width is fixed to be equal to the number of input features, following standard practice in the literature (e.g. Salimbeni & Deisenroth, 2017). Next, we considered fully-connected networks for small image classification datasets (MNIST and CIFAR-10; Table 2). We used the same models as in the previous section, with the omission of learned bias and scaling of the inputs. Note that we do not expect these methods to perform well relative to standard methods (e.g. CNNs) for these datasets, as we are using fully-connected networks with only 100 inducing points (whereas e.g. work in the NNGP literature uses the full 60, 000 × 60, 000 covariance matrix). Nonetheless, as the architectures are carefully matched, it provides another opportunity to compare the performance of DIWPs, NNGPs and DGPs. |
| Dataset Splits | Yes | Errors are quoted as two standard errors in the difference between that method and the best performing method, as in a paired t-test. This is to account for the shared variability that arises due to the use of different test/train splits in the data (20 splits for all but protein, where 5 splits are used Gal & Ghahramani, 2015) |
| Hardware Specification | No | The paper mentions 'University of Bristol’s Advanced Computing Research Centre (ACRC) for computational resources' but does not specify any particular hardware components like CPU or GPU models, or memory. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' but does not provide specific version numbers for any software libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | For all models, we used three layers (two hidden layers and one output layer), with three applications of the kernel. In each case, we used a learned bias and scale for each input feature, and trained for 8000 gradient steps with the Adam optimizer with 100 inducing points, a learning rate of 10^-2 for the first 4000 steps and 10^-3 for the final 4000 steps. |