PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Authors: Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew G. Wilson

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam s razor. We also argue for data-independent bounds in explaining generalization.
Researcher Affiliation Academia Sanae Lotfi Marc Finzi Sanyam Kapoor Andres Potapczynski Micah Goldblum Andrew Gordon Wilson New York University
Pseudocode Yes Algorithm 1 Compute PAC-Bayes Bound.
Open Source Code Yes All code to reproduce results is available here.
Open Datasets Yes We present our bounds for the data-independent prior in Table 2. We derive the first non-vacuous bounds on Fashion MNIST, CIFAR-10, and CIFAR-100 without data-dependent priors.
Dataset Splits Yes We can consider the Hoeffding bound as the simplest data-dependent bound without any fine-tuning so that the prior, a single pre-trained checkpoint, is directly evaluated on held-out validation data with no KL-divergence term.
Hardware Specification No The paper mentions general computing resources like 'GPU Hours' in Figure 2 and 'NYU IT High Performance Computing resources', but does not specify exact GPU/CPU models or other specific hardware configurations.
Software Dependencies No The paper mentions 'Python', 'PyTorch' implicitly through context (e.g., in reference to code being available), and 'CUDA' for GPU usage, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We additionally describe hyperparameters, architecture specifications for each experiment, and other experimental details in Appendix E. ... In summary, we use d H(p) + 2 bits for coding the quantized weights ˆw, 16L bits for the codebook c (represented in half precision), and additional L log2 d bits for representing the probabilities pk, arriving at l(w) d H(p) +L (16+ log2 d )+2. As we show in Appendix B, we optimize over the subspace dimension d and the number of quantization levels L and any other hyperparameters, by including them in the compressed description of our model, contributing only a few extra bits.