Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Authors: Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network, and connect it to the SGD theory of escaping saddle points. (Figure 1: Performance comparison... See Appendix 7 for our experiment setup, how we choose such target function, and more experiments.)
Researcher Affiliation Collaboration Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu Yuanzhi Li Carnegie Mellon University yuanzhil@andrew.cmu.edu Yingyu Liang University of Wisconsin-Madison yliang@cs.wisc.edu
Pseudocode Yes Algorithm 1 SGD for three-layer networks (second variant (4.2))
Open Source Code No The paper states: "Full version and future updates can be found on https://arxiv.org/abs/1811.04918." This is a link to the arXiv preprint of the paper itself, not to source code for the methodology. There is no explicit statement about releasing code or a link to a code repository.
Open Datasets No The paper describes using "synthetic data where feature vectors x R4 are generated as normalized random Gaussian, and label is generated by target function F (x) = (sin(3x1) + sin(3x2) + sin(3x3) 2)2 cos(7x4)." This is custom-generated data and not a publicly available dataset with a link, DOI, or formal citation.
Dataset Splits No The paper states, "We use N training samples." It does not provide specific percentages or counts for training, validation, or test splits, nor does it refer to standard predefined splits for a named dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not list any specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their versions).
Experiment Setup Yes We use N training samples, and SGD with mini-batch size 50 and best tune learning rates and weight decay parameters. (Section 3.1 also specifies `εa = ε/eΘ(1)`, `η = eΘ 1 εkm`, and `T = eΘ (Cs(φ, 1))2 k3p2`)