Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

Authors: Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, Murat A Erdogdu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors u1, . . . , uk of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of O( p kd/T) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained Re LU NNs can learn a single-index target of the form y = f( u, x ) + ϵ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known dΩ(p) sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.
Researcher Affiliation Academia Alireza Mousavi-Hosseini1,2, Sejun Park3, Manuela Girotti4, Ioannis Mitliagkas5,6, Murat A. Erdogdu1,2 1University of Toronto, 2Vector Insitute, 3Korea University, 4Saint Mary s University, 5Universit e de Montr eal, 6Mila Institute
Pseudocode Yes Algorithm 1 Training a two-layer Re LU network with SGD. Input: a0, b0 Rm, W 0 Rm d, {(x(t), y(t))}0 t T 1, (ηt)t 0, (η t)t 0, λ, λ , . 1: for t = 0, ..., T 1 do 2: W t+1 = (1 ηtλ)W t ηt W ℓ(ˆy(x(t); W t, a0, b0), y(t)). 3: end for 4: Let bj iid Unif( , ) for 1 j m. 5: for t = 0, ..., T 1 do 6: Sample it Unif{0, ..., T 1}. 7: at+1 = (1 η tλ )at η t aℓ(ˆy(x(it); W T , at, b), y(it)) 8: end for 9: return (W T , a T , b).
Open Source Code No The paper does not provide any concrete statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets No The paper describes a theoretical setup where 'the input x Rd is Gaussian and the target y R follows a multiple-index model, i.e., y = g( u1, x , . . . , uk, x ) with a noisy link function g' and 'inputs satisfy x(i) iid N(0, Id)'. This describes the assumed data distribution for theoretical analysis, not a specific public or open dataset with access information.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, or testing. It is a theoretical paper focusing on population risk and convergence.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for any computations or demonstrations. The work is theoretical.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). The work is theoretical.
Experiment Setup No The paper does not contain specific experimental setup details such as concrete hyperparameter values, training configurations, or system-level settings. The paper is theoretical in nature and discusses the dynamics of SGD without providing empirical setup details.