Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers

Authors: Liam Hodgkinson, Umut Simsekli, Rajiv Khanna, Michael Mahoney

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support our theory with empirical results from a variety of neural networks, showing correlations between generalization error and lower tail exponents.Finally, we supported our theory with empirical results on several simple neural network models, finding correlations between the lower tail exponent, generalization gap at the end of training, step-size/batch-size ratio, and upper tail exponents.
Researcher Affiliation Academia 1ICSI and Department of Statistics, University of California, Berkeley, USA 2INRIA Département d Informatique de l École Normale Supérieure, PSL Research University, Paris, France 3Department of Computer Science, Purdue University, Indiana, USA.
Pseudocode Yes Algorithm 1 Fernique Talagrand functional
Open Source Code No The paper states 'Our code is implemented in Py Torch and executed on 5 Ge Force GTX 1080 GPUs.' but does not provide any explicit statement or link for public access to the source code for the methodology described in the paper.
Open Datasets Yes For our experiments, we consider three architectures and two standard image classification datasets. In particular, we consider (i) a fully connected model with 5 layers (FCN5), (ii) a fully connected model with 7 layers (FCN7), and (iii) a convolutional model with 9 layers (CNN9); and two datasets (i) MNIST and (ii) CIFAR10.
Dataset Splits No The paper mentions 'For measuring training and test accuracies, we use standard training-test splits.' but does not provide specific details for validation splits or percentages for any splits.
Hardware Specification Yes Our code is implemented in Py Torch and executed on 5 Ge Force GTX 1080 GPUs.
Software Dependencies No The paper mentions 'Our code is implemented in Py Torch' and 'by using the powerlaw toolbox (Clauset et al., 2009).' but does not provide specific version numbers for PyTorch or the powerlaw toolbox, or any other software dependencies.
Experiment Setup Yes All models use the Re LU activation function and all are trained with constant stepsize SGD, without weight-decay or momentum. [...] For each architecture, we trained the networks with different step-sizes and batch-sizes, where we varied the step-size in the range r0.002, 0.35s and the batch-size in the set t50, 100u. We trained all models until training accuracy reaches exactly 100%.Models are trained from the same (random) initialization for 30 epochs (before reaching 100% training accuracy) using SGD with constant step size η P t0.01, 0.005, 0.001u, batch size b P t20, 50, 100u, weight decay parameter λ P r10 4, 5 ˆ 10 4, 10 3s, and added zero-mean Gaussian noise to the input data with variance σ2 for σ P t0, 0.05, 0.1u.