reproducibilityindex.ai

Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Authors: Amit Daniely

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper we make a step towards learnability results with near optimal network size. We give a tight analysis on the rate in which the Neural Tangent Kernel Jacot et al. [2018], a fundamental tool in the analysis of SGD on networks, converges to its expectations. This results enable us to prove that SGD on depth two neural networks, starting from a (non standard) variant of Xavier initialization Glorot and Bengio [2010] can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with O m hidden neurons (and hence O(m) parameters) can memorize m random labeled points in Sd 1.
Researcher Affiliation	Collaboration	Amit Daniely The Hebrew University and Google Research Tel-Aviv amit.daniely@mail.huji.ac.il
Pseudocode	Yes	Algorithm 1 Neural Network Training
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	No	The paper does not provide concrete access information for a publicly available or open dataset. It discusses theoretical concepts like 'R-bounded distribution D' and 'random labeled points'.
Dataset Splits	No	The paper is theoretical and does not describe specific dataset splits (training, validation, test) for reproduction of experiments.
Hardware Specification	No	The paper describes theoretical work and does not specify hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not specify software dependencies with version numbers.
Experiment Setup	Yes	Algorithm 1 Neural Network Training Input: Network parameters σ and d, q, loss , initialization parameter B > 0, learning rate η > 0, batch size b, number of steps T > 0, access to samples from a distribution D. ... there is a choice of q = O M 2R2 d 2 , T = O M 2 2 , as well as B > 0 and η > 0