Agnostic Learning of General ReLU Activation Using Gradient Descent
Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We provide a convergence analysis of gradient descent for the problem of agnostically learning a single Re LU function under Gaussian distributions. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a Re LU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of O(OPT) and OVERVIEW OF THE ANALYSIS (PROOF OF THEOREM 1.1). |
| Researcher Affiliation | Collaboration | Pranjal Awasthi Google Research pranjalawasthi@google.com, Northwestern University alextang@u.northwestern.edu, Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu. |
| Pseudocode | Yes | Multiscale random initialization We are given a parameter M such that kvk2 2 [2 log M, 2log M] (note that M can have large dependencies on d and other parameters; our guarantees will be polynomial in log M). A random initializer w = ( w, 0) is drawn from Dunknown(M) as follows: 1. Pick j uniformly at random from dlog Me, dlog Me + 1, . . . , 1, 0, 1, . . . , dlog Me 2. 2 R+ is drawn according to D as follows: we first pick1 g N(0, 1) and set = 2j|g|. 3. A uniformly random unit vector ˆw 2 Rd is drawn and we output w = ˆw. |
| Open Source Code | No | The paper does not contain any statement about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | No | The paper specifies theoretical assumptions about data distribution, such as 'Let D be a distribution over (ex, y) 2 Rd R where the marginal over ex is the standard Gaussian N(0, I)'. However, it does not refer to or provide access information for any concrete, publicly available dataset used for training or empirical evaluation. |
| Dataset Splits | No | As the paper is theoretical and does not conduct experiments on a specific dataset, it does not provide details regarding training, validation, or test dataset splits. |
| Hardware Specification | No | This is a theoretical paper focused on convergence analysis; no hardware specifications for running experiments are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe any implementation that would require specific software dependencies with version numbers. |
| Experiment Setup | No | As a theoretical paper, it does not describe an experimental setup with specific hyperparameters or system-level training settings. It only mentions 'standard gradient descent algorithm with a fixed learning rate'. |