PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization
Authors: Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew G. Wilson
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam s razor. We also argue for data-independent bounds in explaining generalization. |
| Researcher Affiliation | Academia | Sanae Lotfi Marc Finzi Sanyam Kapoor Andres Potapczynski Micah Goldblum Andrew Gordon Wilson New York University |
| Pseudocode | Yes | Algorithm 1 Compute PAC-Bayes Bound. |
| Open Source Code | Yes | All code to reproduce results is available here. |
| Open Datasets | Yes | We present our bounds for the data-independent prior in Table 2. We derive the first non-vacuous bounds on Fashion MNIST, CIFAR-10, and CIFAR-100 without data-dependent priors. |
| Dataset Splits | Yes | We can consider the Hoeffding bound as the simplest data-dependent bound without any fine-tuning so that the prior, a single pre-trained checkpoint, is directly evaluated on held-out validation data with no KL-divergence term. |
| Hardware Specification | No | The paper mentions general computing resources like 'GPU Hours' in Figure 2 and 'NYU IT High Performance Computing resources', but does not specify exact GPU/CPU models or other specific hardware configurations. |
| Software Dependencies | No | The paper mentions 'Python', 'PyTorch' implicitly through context (e.g., in reference to code being available), and 'CUDA' for GPU usage, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We additionally describe hyperparameters, architecture specifications for each experiment, and other experimental details in Appendix E. ... In summary, we use d H(p) + 2 bits for coding the quantized weights ˆw, 16L bits for the codebook c (represented in half precision), and additional L log2 d bits for representing the probabilities pk, arriving at l(w) d H(p) +L (16+ log2 d )+2. As we show in Appendix B, we optimize over the subspace dimension d and the number of quantization levels L and any other hyperparameters, by including them in the compressed description of our model, contributing only a few extra bits. |