Norm-based Generalization Bounds for Sparse Neural Networks
Authors: Tomer Galanti, Mengjia Xu, Liane Galanti, Tomaso Poggio
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, they offer relatively tight estimations of generalization for various simple classification problems. Collectively, these findings suggest that the sparsity of the underlying target function and the model s architecture plays a crucial role in the success of deep learning. 4 Experiments In this section, we empirically evaluate the generalization bounds derived in section 3. |
| Researcher Affiliation | Academia | Tomer Galanti Center for Brains, Mind, and Machines Massachusetts Institute of Technology galanti@mit.edu Mengjia Xu Department of Data Science New Jersey Institute of Technology mx6@njit.edu Liane Galanti School of Computer Science Tel Aviv University lianegalanti@mail.tau.ac.il Tomaso Poggio Center for Brains, Mind, and Machines Massachusetts Institute of Technology tp@csail.mit.edu |
| Pseudocode | No | The paper describes mathematical proofs and theoretical derivations but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain any statements about making source code publicly available or links to a code repository. |
| Open Datasets | Yes | We conduct multiple experiments to evaluate our bounds for overparameterized convolutional neural networks trained on simple classification problems. These experiments show that in these settings, our bound is significantly tighter than many bounds in the literature [14, 33, 32, 47]. As a result, this research provides a better understanding of the pivotal influence of the structure of the network s architecture [30, 34, 2] on its test performance. ... we train a CONV-L-H network on MNIST with a different number of channels H. |
| Dataset Splits | No | The paper mentions training on MNIST and monitoring train and test errors, but does not specify a separate validation dataset split (e.g., specific percentages or counts for training, validation, and testing). |
| Hardware Specification | Yes | Each of the runs was done using a single GPU for at most 20 hours on a computing cluster with several available GPU types (e.g., Ge Force RTX 2080, Ge Force RTX 2080 Ti, Quadro RTX 6000, Tesla V-100, Ge Force RTX A6000, A100, and Ge Force GTX 1080 Ti). |
| Software Dependencies | No | The paper mentions using SGD and weight normalization but does not list specific software libraries or their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | Each model was trained using SGD for MSE-loss minimization between the logits of the network and the one-hot encodings of the training labels. We applied weight normalization [52] to all trainable layers, except for the last one, which is left un-normalized. In order to regularize the weight parameters, we used weight decay for each one of the layers of the network with the same regularization parameter λ > 0. To train each model, we used an initial learning rate of µ = 0.01 that is decayed by a factor of 0.1 at epochs 60, 100, 300, batch size 32, momentum of 0.9, and λ = 3e 3 by default. |