Subquadratic Overparameterization for Shallow Neural Networks
Authors: ChaeHwan Song, Ali Ramezani-Kebrya, Thomas Pethick, Armin Eftekhari, Volkan Cevher
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Łojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory. ... In Figure 1, we observe that while SGD achieves zero training error for every ω2, as suggested by Theorem 3 applicable in the full batch setting, the generalization ability increases as the ratio ω2/ω1 grows. |
| Researcher Affiliation | Academia | 1Laboratory for Information and Inference Systems (LIONS), EPFL 2Umea University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See https://github.com/LIONS-EPFL/Subquadratic-Overparameterization |
| Open Datasets | Yes | To ensure that perfect generalization is possible, we adopt the teacher-student setup, where, for the teacher network, we train a two-layer fully connected neural network, on MNIST [25] |
| Dataset Splits | No | The provided text mentions using MNIST but does not specify train/validation/test splits with percentages or counts. |
| Hardware Specification | No | The provided text does not contain specific hardware details like CPU/GPU models or memory amounts. Appendix G is referenced for setup details, which might include this information, but it is not available in the provided text. |
| Software Dependencies | No | The provided text does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | The student networks are trained for 300 epochs to ensure convergence. ... We use mean-square loss and a smooth activation function (Ge LU [18]) for the student network to match the problem setup as closely as possible. ... Specifically, we fix the product of the weight initialization ω1ω2 and then proceed by varying ω2. |