Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Authors: Amit Daniely

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper we make a step towards learnability results with near optimal network size. We give a tight analysis on the rate in which the Neural Tangent Kernel Jacot et al. [2018], a fundamental tool in the analysis of SGD on networks, converges to its expectations. This results enable us to prove that SGD on depth two neural networks, starting from a (non standard) variant of Xavier initialization Glorot and Bengio [2010] can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with O m hidden neurons (and hence O(m) parameters) can memorize m random labeled points in Sd 1.
Researcher Affiliation Collaboration Amit Daniely The Hebrew University and Google Research Tel-Aviv amit.daniely@mail.huji.ac.il
Pseudocode Yes Algorithm 1 Neural Network Training
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No The paper does not provide concrete access information for a publicly available or open dataset. It discusses theoretical concepts like 'R-bounded distribution D' and 'random labeled points'.
Dataset Splits No The paper is theoretical and does not describe specific dataset splits (training, validation, test) for reproduction of experiments.
Hardware Specification No The paper describes theoretical work and does not specify hardware used for experiments.
Software Dependencies No The paper is theoretical and does not specify software dependencies with version numbers.
Experiment Setup Yes Algorithm 1 Neural Network Training Input: Network parameters σ and d, q, loss , initialization parameter B > 0, learning rate η > 0, batch size b, number of steps T > 0, access to samples from a distribution D. ... there is a choice of q = O M 2R2 d 2 , T = O M 2 2 , as well as B > 0 and η > 0