Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

Authors: Yijun Wan, Melih Barsbey, Abdellatif Zaidi, Umut Simsekli

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments suggest that the proposed approach not only achieves increased compressibility with various models and datasets, but also leads to robust test performance under pruning, even in more realistic architectures that lie beyond our theoretical setting.
Researcher Affiliation Collaboration 1Paris Research Center, Huawei Technologies France 2Bo gaziçi University, Istanbul, Turkey 3Université Gustave Eiffel, France 4Inria, CNRS, Ecole Normale Supérieure, PSL Research University, Paris, France.
Pseudocode No The paper describes the SGD updates in equation (2), but does not present it as a formal pseudocode block or algorithm.
Open Source Code Yes All the experimentation details are given in Appendix E, in addition to the extended versions of the results presented here, and our source code includes the relevant implementation details6. 6https://github.com/mbarsbey/imp_comp
Open Datasets Yes For our experiments we use the ECG5000 (Baim et al., 2000), MNIST (Le Cun et al., 2010), CIFAR-10, and CIFAR-100 (Krizhevsky, 2009) datasets.
Dataset Splits Yes After random shuffling, we use 500 sequences for the training phase and 4500 sequences for the test phase. The MNIST database (Le Cun et al., 2010) of black and white handwritten digits consists of a training set of 60,000 examples and a test set of 10,000 examples of dimensions 28 x 28. CIFAR-10 and CIFAR-100... making up 10 and 100 classes, respectively. We use the default split of 50,000 training and 10,000 test examples.
Hardware Specification Yes Experiments were run on the server of an educational institution, using NVIDIA 1080 and 1080 Ti GPUs.
Software Dependencies No The experiments have been implemented in Python, using the deep learning framework Py Torch. While software is named, specific version numbers are not provided, preventing full reproducibility of the software environment.
Experiment Setup Yes For SGD, the step size is chosen to be small enough to approximate the continuous dynamics given by the Mc Kean Vlasov equation in order to stay close to the theory, but also not too small so that SGD converges in a reasonable amount of time. We fix the batch size to be as large as possible within memory constraints. For all experiments, the training was continued until reaching 95% accuracy on the training set. As for the noise level σ, we try a range of values for each dataset and n, and we choose the largest σ such that the perturbed SGD converges, without a dramatic performance cost to the pruned model. Learning rates, batch sizes, and σ values have been provided in the Table 7.