Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines
Authors: Edward Milsom, Ben Anson, Laurence Aitchison
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Recent work developed convolutional deep kernel machines, achieving 92.7% test accuracy on CIFAR-10 using a Res Net-inspired architecture, which is SOTA for kernel methods. However, this still lags behind neural networks, which easily achieve over 94% test accuracy with similar architectures. In this work we introduce several modifications to improve the convolutional deep kernel machine s generalisation, including stochastic kernel regularisation, which adds noise to the learned Gram matrices during training. The resulting model achieves 94.5% test accuracy on CIFAR-10. |
| Researcher Affiliation | Academia | Edward Milsom School of Mathematics University of Bristol edward.milsom@bristol.ac.uk Ben Anson School of Mathematics University of Bristol ben.anson@bristol.ac.uk Laurence Aitchison School of Engineering Mathematics and Technology University of Bristol laurence.aitchison@gmail.com |
| Pseudocode | Yes | Algorithm 1 Convolutional deep kernel machine prediction. Changes from this paper are in red. |
| Open Source Code | Yes | Code available at https://github.com/edwardmilsom/skr_cdkm |
| Open Datasets | Yes | We evaluated our method on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), containing 60, 000 RGB images (50, 000 train, 10, 000 test) of size 32 32 divided into 10 classes. |
| Dataset Splits | Yes | We evaluated our method on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), containing 60, 000 RGB images (50, 000 train, 10, 000 test) of size 32 32 divided into 10 classes. |
| Hardware Specification | Yes | On an NVIDIA A100 with TF32 matmuls and convolutions enabled, the Adam-trained neural network takes ~45s per epoch, whilst our model takes ~260s per epoch. We estimate (very roughly) a total time, including the ablations and CIFAR-100 experiments detailed later, of around 2000 GPU hours for all experiments in this paper, and around 2-3 times that number when including preliminary and failed experiments during the entire project. |
| Software Dependencies | No | The paper mentions 'The model is implemented1 in Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | For the stochastic kernel regularisation, we used γ = P ℓ i /4 and a jitter size of λ = 0.1, and for the objective we used a regularisation strength of ν = 0.001. We train all parameters by optimising the sparse DKM objective function (Equation 27 with Taylor approximated terms from Section 3.2) using Adam (Kingma & Ba, 2017), with β1 = 0.8, β2 = 0.9 and with an initial learning rate of 0.01 which is divided by 10 at epochs 800 and 1100, for a total of 1200 epochs. |