Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Authors: Surbhi Goel, Sushrut Karmalkar, Adam Klivans

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We consider the problem of computing the best-fitting Re LU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let opt < 1 be the population loss of the best-fitting Re LU. We prove:Finding a Re LU with square-loss opt+ϵ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a Re LU with respect to Gaussian marginals, and our results imply unconditionally that gradient descent cannot converge to the global minimum in polynomial time. There exists an efficient approximation algorithm for finding the best-fitting Re LU that achieves error O(opt2/3). The algorithm uses a novel reduction to noisy halfspace learning with respect to 0/1 loss.
Researcher Affiliation Academia Surbhi Goel Department of Computer Science University of Texas at Austin surbhi@cs.utexas.eduSushrut Karmalkar Department of Computer Science University of Texas at Austin sushrutk@cs.utexas.eduAdam R. Klivans Department of Computer Science University of Texas at Austin klivans@cs.utexas.edu
Pseudocode Yes Algorithm 1 Learning Sparse Parities with Noise using Agnostic Re LU learner; Algorithm 2
Open Source Code No The paper does not include any explicit statements about providing open-source code for the methodology, nor does it provide links to code repositories.
Open Datasets No The paper assumes data drawn 'according to a spherical Gaussian distribution' for its theoretical analysis, but it does not specify or provide concrete access information (link, DOI, repository, or citation) for any publicly available or open dataset.
Dataset Splits Yes Input Training set S of M1 samples (xi, yi)M1 i=1, validation set V of M2 samples (xi, yi)M1+M2 i=M1+1, error parameter ϵ... Using standard concentration inequalities for sub-Gaussian and subexponential random variables [Ver] we see that using a validation set of M2 = 100/ϵ2 samples, we have for all j, |c err Vj(hj) err Dj(hj)| ϵ/4.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types) used for running experiments.
Software Dependencies No The paper does not provide specific version numbers for any ancillary software components or libraries (e.g., Python, PyTorch, TensorFlow, specific solvers).
Experiment Setup No The paper mentions theoretical parameters like 'error parameter ϵ', 'M1 samples', and 'M2 samples', but these are not 'concrete hyperparameter values' or 'system-level training settings' as would be expected for an empirical experimental setup (e.g., learning rate, batch size, optimizer settings).