The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Authors: Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically.
Researcher Affiliation Collaboration Princeton University. Work partly performed at Google. Email: huwei@cs.princeton.edu Google Research, Brain Team. Email: xlc@google.com Google Research, Brain Team. Work done as a member of the Google AI Residency program (http: //g.co/brainresidency). Email: adlam@google.com Google Research, Brain Team. Email: jpennin@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes We perform experiments on a binary classification task from CIFAR-10 ( cats vs horses ) using a multi-layer FC network and a CNN.
Dataset Splits No The paper mentions "20,000 training samples and 2,000 test samples" for synthetic data and "10,000 training and 2,000 test data" for CIFAR-10, but does not explicitly provide information on a validation split.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper mentions network architecture details like activation function, width, and number of layers, but does not provide concrete hyperparameter values such as specific learning rates, batch sizes, or optimizer settings in the main text.