Agnostic Learning of General ReLU Activation Using Gradient Descent

Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We provide a convergence analysis of gradient descent for the problem of agnostically learning a single Re LU function under Gaussian distributions. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a Re LU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of O(OPT) and OVERVIEW OF THE ANALYSIS (PROOF OF THEOREM 1.1).
Researcher Affiliation Collaboration Pranjal Awasthi Google Research pranjalawasthi@google.com, Northwestern University alextang@u.northwestern.edu, Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu.
Pseudocode Yes Multiscale random initialization We are given a parameter M such that kvk2 2 [2 log M, 2log M] (note that M can have large dependencies on d and other parameters; our guarantees will be polynomial in log M). A random initializer w = ( w, 0) is drawn from Dunknown(M) as follows: 1. Pick j uniformly at random from dlog Me, dlog Me + 1, . . . , 1, 0, 1, . . . , dlog Me 2. 2 R+ is drawn according to D as follows: we first pick1 g N(0, 1) and set = 2j|g|. 3. A uniformly random unit vector ˆw 2 Rd is drawn and we output w = ˆw.
Open Source Code No The paper does not contain any statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets No The paper specifies theoretical assumptions about data distribution, such as 'Let D be a distribution over (ex, y) 2 Rd R where the marginal over ex is the standard Gaussian N(0, I)'. However, it does not refer to or provide access information for any concrete, publicly available dataset used for training or empirical evaluation.
Dataset Splits No As the paper is theoretical and does not conduct experiments on a specific dataset, it does not provide details regarding training, validation, or test dataset splits.
Hardware Specification No This is a theoretical paper focused on convergence analysis; no hardware specifications for running experiments are mentioned.
Software Dependencies No The paper is theoretical and does not describe any implementation that would require specific software dependencies with version numbers.
Experiment Setup No As a theoretical paper, it does not describe an experimental setup with specific hyperparameters or system-level training settings. It only mentions 'standard gradient descent algorithm with a fixed learning rate'.