Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint
Authors: Jimmy Ba, Murat Erdogdu, Taiji Suzuki, Denny Wu, Tianzong Zhang
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and experimental results suggest that previously studied model setups that provably give rise to double descent might not translate to optimizing two-layer neural networks.F EXPERIMENT SETUP |
| Researcher Affiliation | Collaboration | University of Toronto1, Vector Institute2, University of Tokyo3, RIKEN AIP4, Tsinghua University5 |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper uses synthetically generated data based on a student-teacher setup and Gaussian features, not a publicly available or open dataset. |
| Dataset Splits | No | The paper discusses 'training samples' but does not provide specific details on how data is split into training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or their version numbers. |
| Experiment Setup | Yes | Optimizing the Second Layer. We compute the minimum-norm solution by directly solving the pseudo-inverse. We set n = 1000 and vary γ1, γ2 from 0.1 to 3. The linear teacher model F(x) = x β is fixed as β = 1d/ d. For each (γ1, γ2) we average across 50 random draws of data. Optimizing the First Layer. For both initializations, we use gradient descent with small step size (η = 0.1) and train the model for minimally 25000 steps and till W f(X, W) 2 F < 10 6. We fix n = 320 and vary γ1, γ2 from 0.1 to 3 with the same linear teacher model β = 1d/ d. The risk is averaged across 20 models trained from different initializations. |