Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Authors: Edward Milsom, Ben Anson, Laurence Aitchison

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Recent work developed convolutional deep kernel machines, achieving 92.7% test accuracy on CIFAR-10 using a Res Net-inspired architecture, which is SOTA for kernel methods. However, this still lags behind neural networks, which easily achieve over 94% test accuracy with similar architectures. In this work we introduce several modifications to improve the convolutional deep kernel machine s generalisation, including stochastic kernel regularisation, which adds noise to the learned Gram matrices during training. The resulting model achieves 94.5% test accuracy on CIFAR-10.
Researcher Affiliation Academia Edward Milsom School of Mathematics University of Bristol edward.milsom@bristol.ac.uk Ben Anson School of Mathematics University of Bristol ben.anson@bristol.ac.uk Laurence Aitchison School of Engineering Mathematics and Technology University of Bristol laurence.aitchison@gmail.com
Pseudocode Yes Algorithm 1 Convolutional deep kernel machine prediction. Changes from this paper are in red.
Open Source Code Yes Code available at https://github.com/edwardmilsom/skr_cdkm
Open Datasets Yes We evaluated our method on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), containing 60, 000 RGB images (50, 000 train, 10, 000 test) of size 32 32 divided into 10 classes.
Dataset Splits Yes We evaluated our method on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009), containing 60, 000 RGB images (50, 000 train, 10, 000 test) of size 32 32 divided into 10 classes.
Hardware Specification Yes On an NVIDIA A100 with TF32 matmuls and convolutions enabled, the Adam-trained neural network takes ~45s per epoch, whilst our model takes ~260s per epoch. We estimate (very roughly) a total time, including the ablations and CIFAR-100 experiments detailed later, of around 2000 GPU hours for all experiments in this paper, and around 2-3 times that number when including preliminary and failed experiments during the entire project.
Software Dependencies No The paper mentions 'The model is implemented1 in Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup Yes For the stochastic kernel regularisation, we used γ = P ℓ i /4 and a jitter size of λ = 0.1, and for the objective we used a regularisation strength of ν = 0.001. We train all parameters by optimising the sparse DKM objective function (Equation 27 with Taylor approximated terms from Section 3.2) using Adam (Kingma & Ba, 2017), with β1 = 0.8, β2 = 0.9 and with an initial learning rate of 0.01 which is divided by 10 at epochs 800 and 1100, for a total of 1200 epochs.