Does the Data Induce Capacity Control in Deep Learning?
Authors: Rubing Yang, Jialin Mao, Pratik Chaudhari
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. and 6. Empirical Validation |
| Researcher Affiliation | Academia | 1Applied Mathematics and Computational Science, University of Pennsylvania 2Electrical and Systems Engineering, University of Pennsylvania. Correspondence to: Rubing Yang <rubingy@sas.upenn.edu>. |
| Pseudocode | No | The paper includes a code snippet (Figure S-6) in Appendix E, but it is presented as actual PyTorch code rather than structured pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | All the code for experiments in this paper is provided at https://github.com/grasp-lyrl/sloppy. |
| Open Datasets | Yes | We use the MNIST dataset for experiments on fully-connected networks and Le Net. and We use the CIFAR-10 dataset for experiments using two architectures, an All-CNN network and a wide residual network. |
| Dataset Splits | Yes | We use 55000 samples from the training set to train the model and to optimize the PAC-Bayes bound. We set aside 5000 samples for calculating the FIM, which is used in Method 4 of PAC-Bayes bound optimization. Strictly speaking, it is not required to do so because a prior that depends upon the FIM is an expectation-prior (as discussed in Parrado-Hern andez et al. (2012)) but we set aside these samples to compare in a systematic manner to existing methods in the literature that use 55,000 samples. Test error of all models is estimated using the validation set of MNIST. We use the CIFAR-10 dataset for experiments for 50, 000 samples for training and 10, 000 samples for estimating the test error. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used, such as CPU or GPU models, for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'Py Torch' and 'BACKPACK library' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Training procedure We train for 30 epochs on MNIST and for 100 epochs on CIFAR-10. The batch-size is fixed to 500 for both datasets. For all experiments with train with Adam and reduce the learning rate using a cosine annealing schedule starting from an initial learning rate of 10 3 and ending at a learning rate of 10 5. |