reproducibilityindex.ai

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

Authors: Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, Murat A Erdogdu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors u1, . . . , uk of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of O( p kd/T) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained Re LU NNs can learn a single-index target of the form y = f( u, x ) + ϵ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known dΩ(p) sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.
Researcher Affiliation	Academia	Alireza Mousavi-Hosseini1,2, Sejun Park3, Manuela Girotti4, Ioannis Mitliagkas5,6, Murat A. Erdogdu1,2 1University of Toronto, 2Vector Insitute, 3Korea University, 4Saint Mary s University, 5Universit e de Montr eal, 6Mila Institute
Pseudocode	Yes	Algorithm 1 Training a two-layer Re LU network with SGD. Input: a0, b0 Rm, W 0 Rm d, {(x(t), y(t))}0 t T 1, (ηt)t 0, (η t)t 0, λ, λ , . 1: for t = 0, ..., T 1 do 2: W t+1 = (1 ηtλ)W t ηt W ℓ(ˆy(x(t); W t, a0, b0), y(t)). 3: end for 4: Let bj iid Unif( , ) for 1 j m. 5: for t = 0, ..., T 1 do 6: Sample it Unif{0, ..., T 1}. 7: at+1 = (1 η tλ )at η t aℓ(ˆy(x(it); W T , at, b), y(it)) 8: end for 9: return (W T , a T , b).
Open Source Code	No	The paper does not provide any concrete statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	No	The paper describes a theoretical setup where 'the input x Rd is Gaussian and the target y R follows a multiple-index model, i.e., y = g( u1, x , . . . , uk, x ) with a noisy link function g' and 'inputs satisfy x(i) iid N(0, Id)'. This describes the assumed data distribution for theoretical analysis, not a specific public or open dataset with access information.
Dataset Splits	No	The paper does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, or testing. It is a theoretical paper focusing on population risk and convergence.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for any computations or demonstrations. The work is theoretical.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). The work is theoretical.
Experiment Setup	No	The paper does not contain specific experimental setup details such as concrete hyperparameter values, training configurations, or system-level settings. The paper is theoretical in nature and discusses the dynamics of SGD without providing empirical setup details.