Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network, and connect it to the SGD theory of escaping saddle points. (Figure 1: Performance comparison... See Appendix 7 for our experiment setup, how we choose such target function, and more experiments.) |
| Researcher Affiliation | Collaboration | Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu Yuanzhi Li Carnegie Mellon University yuanzhil@andrew.cmu.edu Yingyu Liang University of Wisconsin-Madison yliang@cs.wisc.edu |
| Pseudocode | Yes | Algorithm 1 SGD for three-layer networks (second variant (4.2)) |
| Open Source Code | No | The paper states: "Full version and future updates can be found on https://arxiv.org/abs/1811.04918." This is a link to the arXiv preprint of the paper itself, not to source code for the methodology. There is no explicit statement about releasing code or a link to a code repository. |
| Open Datasets | No | The paper describes using "synthetic data where feature vectors x R4 are generated as normalized random Gaussian, and label is generated by target function F (x) = (sin(3x1) + sin(3x2) + sin(3x3) 2)2 cos(7x4)." This is custom-generated data and not a publicly available dataset with a link, DOI, or formal citation. |
| Dataset Splits | No | The paper states, "We use N training samples." It does not provide specific percentages or counts for training, validation, or test splits, nor does it refer to standard predefined splits for a named dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their versions). |
| Experiment Setup | Yes | We use N training samples, and SGD with mini-batch size 50 and best tune learning rates and weight decay parameters. (Section 3.1 also specifies `εa = ε/eΘ(1)`, `η = eΘ 1 εkm`, and `T = eΘ (Cs(φ, 1))2 k3p2`) |