Non-Vacuous Generalisation Bounds for Shallow Neural Networks

Authors: Felix Biggs, Benjamin Guedj

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST, Fashion-MNIST, and binary classification versions of the above.In Section 5 we discuss our experimental setting and give our numerical results, which we discuss along with future work in Section 6.
Researcher Affiliation Academia 1Centre for Artificial Intelligence and Department of Computer Science, University College London and Inria London, UK.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes We provide all of our results and code to reproduce them along with the figures (including with the option of using the same scaling for the bound and errors, as described in Figure 1) in the supplementary material.
Open Datasets Yes Through this, we obtain classification bounds for these deterministic predictors that are non-vacuous on the celebrated baselines MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), and binarised versions of the above.
Dataset Splits Yes Formally, we divide S = Sprior Sbnd and use Sprior to learn a prior P n where n = |Sprior|, then apply the PAC-Bayesian bound using sample Sbnd to a posterior Q learned on the entirety of S. We adopt a 60%-prefix coupling procedure for generating the prior weights U n, V n (rather than U 0, V 0, and similarly in the binary case) as in Dziugaite et al. (2021). Note that this also replaces m by m n and S by Sbnd in the bounds, so we are making a trade off between optimising the prior and the tightness of the bound (affected by m n). We evaluated for ten different random seeds, a grid search of learning rates {0.1, 0.03, 0.01} without momentum, and additionally {0.003, 0.001} with momentum (where small learning rate convergence was considerably faster), and widths {50, 100, 200, 400, 800, 1600} to generate the bounds in Table 1. Using the test set, we also verified that assumption ( ) holds in all cases in which it is used to provide bounds.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instance types) were mentioned for running experiments.
Software Dependencies No The paper mentions common deep learning libraries like PyTorch for the erf function, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We trained using SGD with momentum = 0.9 (as suggested by Hendrycks & Gimpel, 2016 and following Biggs & Guedj, 2022) and a batch size of 200, or without momentum and a batch size of 1000 (with this larger batch size stabilising training). We evaluated for ten different random seeds, a grid search of learning rates {0.1, 0.03, 0.01} without momentum, and additionally {0.003, 0.001} with momentum (where small learning rate convergence was considerably faster), and widths {50, 100, 200, 400, 800, 1600} to generate the bounds in Table 1. The parameter β appears in the non-stochastic shallow network FU,V and thus affects the final predictions made and the training by SGD, and can be related to data normalisation as discussed above. We therefore set it to the fixed value of β = 5 in all our experiments.