Learning ReLUs via Gradient Descent

Authors: Mahdi Soltanolkotabi

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we carry out a simple numerical experiment to corroborate our theoretical results.
Researcher Affiliation Academia Mahdi Soltanolkotabi Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA soltanol@usc.edu
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide any explicit statements or links indicating that source code for the methodology is openly available.
Open Datasets No The paper states: "For this purpose we generate a unit norm sparse vector w Rd of dimension d = 1000 containing s = d/50 non-zero entries. We also generate a random feature matrix X Rn d with n = 8slog(d/s) and containing i.i.d. N(0,1) entries." This indicates data was generated for the experiment, not from a publicly available source.
Dataset Splits No The paper describes generating synthetic data and running experiments, but does not specify details regarding training, validation, or testing dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library names with version numbers.
Experiment Setup Yes For this purpose we generate a unit norm sparse vector w Rd of dimension d = 1000 containing s = d/50 non-zero entries. We also generate a random feature matrix X Rn d with n = 8slog(d/s) and containing i.i.d. N(0,1) entries. We apply the projected gradient iterations to both observation models starting from w0 = 0. For the Re LU observations we use the step size discussed in Theorem 3.1. For the linear model we apply projected gradient descent updates of the form wτ+1 = PK (wτ 1/n XT (Xwτ y)). In both cases we use the regularizer R(w) = w ℓ0 so that the projection only keeps the top s entries of the vector (a.k.a. iterative hard thresholding).