Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Authors: Amit Daniely
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper we make a step towards learnability results with near optimal network size. We give a tight analysis on the rate in which the Neural Tangent Kernel Jacot et al. [2018], a fundamental tool in the analysis of SGD on networks, converges to its expectations. This results enable us to prove that SGD on depth two neural networks, starting from a (non standard) variant of Xavier initialization Glorot and Bengio [2010] can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with O m hidden neurons (and hence O(m) parameters) can memorize m random labeled points in Sd 1. |
| Researcher Affiliation | Collaboration | Amit Daniely The Hebrew University and Google Research Tel-Aviv amit.daniely@mail.huji.ac.il |
| Pseudocode | Yes | Algorithm 1 Neural Network Training |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | The paper does not provide concrete access information for a publicly available or open dataset. It discusses theoretical concepts like 'R-bounded distribution D' and 'random labeled points'. |
| Dataset Splits | No | The paper is theoretical and does not describe specific dataset splits (training, validation, test) for reproduction of experiments. |
| Hardware Specification | No | The paper describes theoretical work and does not specify hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | Algorithm 1 Neural Network Training Input: Network parameters σ and d, q, loss , initialization parameter B > 0, learning rate η > 0, batch size b, number of steps T > 0, access to samples from a distribution D. ... there is a choice of q = O M 2R2 d 2 , T = O M 2 2 , as well as B > 0 and η > 0 |