Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

Authors: Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, Peter Orbanz

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network. ... In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the Image Net classification problem. ... We give two examples applying our generalization bounds to the models output by modern neural net compression schemes.
Researcher Affiliation Academia Wenda Zhou Columbia University New York, NY wz2335@columbia.edu Victor Veitch Columbia University New York, NY victorveitch@gmail.com Morgane Austern Columbia University New York, NY ma3293@columbia.edu Ryan P. Adams Princeton University Princeton, NJ rpa@princeton.edu Peter Orbanz Columbia University New York, NY porbanz@stat.columbia.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code to reproduce the experiments is available in the supplementary material.
Open Datasets Yes Our first experiment is performed on the MNIST dataset, a dataset of 60k grayscale images of handwritten digits. ... The Image Net dataset (Russakovsky et al., 2015) is a dataset of about 1.2 million natural images... We consider the CIFAR-10 dataset, a collection of 40000 images categorized into 10 classes.
Dataset Splits No The paper mentions 'validation accuracy' and 'training accuracy' but does not specify the exact percentages, sample counts, or methodology for the training/validation/test dataset splits used in their experiments. For example, for MobileNet it states 'The pruned model achieves a validation accuracy of 60 %' and 'the top-1 training accuracy is reduced to 65%' but not how these splits were defined.
Hardware Specification No The paper only generally acknowledges 'computing resources from Columbia University s Shared Research Computing Facility project' but does not specify any particular CPU models, GPU models, memory, or other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions the use of optimizers (e.g., 'ADAM optimizer', 'stochastic gradient descent') and techniques ('Dynamic Network Surgery', 'k-means') but does not specify any software names with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 10.x) that are necessary for reproducibility.
Experiment Setup Yes For Le Net-5: 'The batch size is set to 1024, and the learning rate is decayed using an inverse time decay starting at 0.01 and decaying every 125 steps. We apply a small ℓ2 penalty of 0.005. We train a total of 20000 steps.' For MobileNet: 'stochastic gradient descent with momentum and decay the learning with an inverse time decay schedule, starting at 10 3 and decaying by 0.05 every 2000 steps. We use a minibatch size of 64 and train for a total of 300000 steps, but tune the pruning schedule so that the target sparsity is reached after 200000 steps.'