Does the Data Induce Capacity Control in Deep Learning?

Authors: Rubing Yang, Jialin Mao, Pratik Chaudhari

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. and 6. Empirical Validation
Researcher Affiliation Academia 1Applied Mathematics and Computational Science, University of Pennsylvania 2Electrical and Systems Engineering, University of Pennsylvania. Correspondence to: Rubing Yang <rubingy@sas.upenn.edu>.
Pseudocode No The paper includes a code snippet (Figure S-6) in Appendix E, but it is presented as actual PyTorch code rather than structured pseudocode or a clearly labeled algorithm block.
Open Source Code Yes All the code for experiments in this paper is provided at https://github.com/grasp-lyrl/sloppy.
Open Datasets Yes We use the MNIST dataset for experiments on fully-connected networks and Le Net. and We use the CIFAR-10 dataset for experiments using two architectures, an All-CNN network and a wide residual network.
Dataset Splits Yes We use 55000 samples from the training set to train the model and to optimize the PAC-Bayes bound. We set aside 5000 samples for calculating the FIM, which is used in Method 4 of PAC-Bayes bound optimization. Strictly speaking, it is not required to do so because a prior that depends upon the FIM is an expectation-prior (as discussed in Parrado-Hern andez et al. (2012)) but we set aside these samples to compare in a systematic manner to existing methods in the literature that use 55,000 samples. Test error of all models is estimated using the validation set of MNIST. We use the CIFAR-10 dataset for experiments for 50, 000 samples for training and 10, 000 samples for estimating the test error.
Hardware Specification No The paper does not provide specific details on the hardware used, such as CPU or GPU models, for running the experiments.
Software Dependencies No The paper mentions software like 'Py Torch' and 'BACKPACK library' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Training procedure We train for 30 epochs on MNIST and for 100 epochs on CIFAR-10. The batch-size is fixed to 500 for both datasets. For all experiments with train with Adam and reduce the learning rate using a cosine annealing schedule starting from an initial learning rate of 10 3 and ending at a learning rate of 10 5.