Non-Vacuous Generalisation Bounds for Shallow Neural Networks
Authors: Felix Biggs, Benjamin Guedj
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST, Fashion-MNIST, and binary classification versions of the above.In Section 5 we discuss our experimental setting and give our numerical results, which we discuss along with future work in Section 6. |
| Researcher Affiliation | Academia | 1Centre for Artificial Intelligence and Department of Computer Science, University College London and Inria London, UK. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We provide all of our results and code to reproduce them along with the figures (including with the option of using the same scaling for the bound and errors, as described in Figure 1) in the supplementary material. |
| Open Datasets | Yes | Through this, we obtain classification bounds for these deterministic predictors that are non-vacuous on the celebrated baselines MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), and binarised versions of the above. |
| Dataset Splits | Yes | Formally, we divide S = Sprior Sbnd and use Sprior to learn a prior P n where n = |Sprior|, then apply the PAC-Bayesian bound using sample Sbnd to a posterior Q learned on the entirety of S. We adopt a 60%-prefix coupling procedure for generating the prior weights U n, V n (rather than U 0, V 0, and similarly in the binary case) as in Dziugaite et al. (2021). Note that this also replaces m by m n and S by Sbnd in the bounds, so we are making a trade off between optimising the prior and the tightness of the bound (affected by m n). We evaluated for ten different random seeds, a grid search of learning rates {0.1, 0.03, 0.01} without momentum, and additionally {0.003, 0.001} with momentum (where small learning rate convergence was considerably faster), and widths {50, 100, 200, 400, 800, 1600} to generate the bounds in Table 1. Using the test set, we also verified that assumption ( ) holds in all cases in which it is used to provide bounds. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory, or cloud instance types) were mentioned for running experiments. |
| Software Dependencies | No | The paper mentions common deep learning libraries like PyTorch for the erf function, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We trained using SGD with momentum = 0.9 (as suggested by Hendrycks & Gimpel, 2016 and following Biggs & Guedj, 2022) and a batch size of 200, or without momentum and a batch size of 1000 (with this larger batch size stabilising training). We evaluated for ten different random seeds, a grid search of learning rates {0.1, 0.03, 0.01} without momentum, and additionally {0.003, 0.001} with momentum (where small learning rate convergence was considerably faster), and widths {50, 100, 200, 400, 800, 1600} to generate the bounds in Table 1. The parameter β appears in the non-stochastic shallow network FU,V and thus affects the final predictions made and the training by SGD, and can be related to data normalisation as discussed above. We therefore set it to the fixed value of β = 5 in all our experiments. |