reproducibilityindex.ai

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Authors: Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew G. Wilson

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam s razor. We also argue for data-independent bounds in explaining generalization.
Researcher Affiliation	Academia	Sanae Lotfi Marc Finzi Sanyam Kapoor Andres Potapczynski Micah Goldblum Andrew Gordon Wilson New York University
Pseudocode	Yes	Algorithm 1 Compute PAC-Bayes Bound.
Open Source Code	Yes	All code to reproduce results is available here.
Open Datasets	Yes	We present our bounds for the data-independent prior in Table 2. We derive the first non-vacuous bounds on Fashion MNIST, CIFAR-10, and CIFAR-100 without data-dependent priors.
Dataset Splits	Yes	We can consider the Hoeffding bound as the simplest data-dependent bound without any fine-tuning so that the prior, a single pre-trained checkpoint, is directly evaluated on held-out validation data with no KL-divergence term.
Hardware Specification	No	The paper mentions general computing resources like 'GPU Hours' in Figure 2 and 'NYU IT High Performance Computing resources', but does not specify exact GPU/CPU models or other specific hardware configurations.
Software Dependencies	No	The paper mentions 'Python', 'PyTorch' implicitly through context (e.g., in reference to code being available), and 'CUDA' for GPU usage, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We additionally describe hyperparameters, architecture specifications for each experiment, and other experimental details in Appendix E. ... In summary, we use d H(p) + 2 bits for coding the quantized weights ˆw, 16L bits for the codebook c (represented in half precision), and additional L log2 d bits for representing the probabilities pk, arriving at l(w) d H(p) +L (16+ log2 d )+2. As we show in Appendix B, we optimize over the subspace dimension d and the number of quantization levels L and any other hyperparameters, by including them in the compressed description of our model, contributing only a few extra bits.