Norm-based Generalization Bounds for Sparse Neural Networks

Authors: Tomer Galanti, Mengjia Xu, Liane Galanti, Tomaso Poggio

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, they offer relatively tight estimations of generalization for various simple classification problems. Collectively, these findings suggest that the sparsity of the underlying target function and the model s architecture plays a crucial role in the success of deep learning. 4 Experiments In this section, we empirically evaluate the generalization bounds derived in section 3.
Researcher Affiliation Academia Tomer Galanti Center for Brains, Mind, and Machines Massachusetts Institute of Technology galanti@mit.edu Mengjia Xu Department of Data Science New Jersey Institute of Technology mx6@njit.edu Liane Galanti School of Computer Science Tel Aviv University lianegalanti@mail.tau.ac.il Tomaso Poggio Center for Brains, Mind, and Machines Massachusetts Institute of Technology tp@csail.mit.edu
Pseudocode No The paper describes mathematical proofs and theoretical derivations but does not include any explicit pseudocode blocks or algorithms.
Open Source Code No The paper does not contain any statements about making source code publicly available or links to a code repository.
Open Datasets Yes We conduct multiple experiments to evaluate our bounds for overparameterized convolutional neural networks trained on simple classification problems. These experiments show that in these settings, our bound is significantly tighter than many bounds in the literature [14, 33, 32, 47]. As a result, this research provides a better understanding of the pivotal influence of the structure of the network s architecture [30, 34, 2] on its test performance. ... we train a CONV-L-H network on MNIST with a different number of channels H.
Dataset Splits No The paper mentions training on MNIST and monitoring train and test errors, but does not specify a separate validation dataset split (e.g., specific percentages or counts for training, validation, and testing).
Hardware Specification Yes Each of the runs was done using a single GPU for at most 20 hours on a computing cluster with several available GPU types (e.g., Ge Force RTX 2080, Ge Force RTX 2080 Ti, Quadro RTX 6000, Tesla V-100, Ge Force RTX A6000, A100, and Ge Force GTX 1080 Ti).
Software Dependencies No The paper mentions using SGD and weight normalization but does not list specific software libraries or their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Each model was trained using SGD for MSE-loss minimization between the logits of the network and the one-hot encodings of the training labels. We applied weight normalization [52] to all trainable layers, except for the last one, which is left un-normalized. In order to regularize the weight parameters, we used weight decay for each one of the layers of the network with the same regularization parameter λ > 0. To train each model, we used an initial learning rate of µ = 0.01 that is decayed by a factor of 0.1 at epochs 60, 100, 300, batch size 32, momentum of 0.9, and λ = 3e 3 by default.