Beyond Lazy Training for Over-parameterized Tensor Decomposition

Authors: Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper we study a closely related tensor decomposition problem: given an l-th order tensor in (Rd) l of rank r (where r d), can variants of gradient descent find a rank m decomposition where m > r? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least m = Ω(dl 1), while a variant of gradient descent can find an approximate tensor when m = O (r2.5l log d). Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Researcher Affiliation Academia Xiang Wang Duke University xwang@cs.duke.edu Chenwei Wu Duke University cwwu@cs.duke.edu Jason D. Lee Princeton University jasonlee@princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu Rong Ge Duke University rongge@cs.duke.edu
Pseudocode Yes Algorithm 1 Variant of Gradient Descent for Tensor Decomposition
Open Source Code No The paper does not provide any statement or link indicating that source code for the described methodology is publicly available.
Open Datasets No This paper is theoretical and does not conduct experiments on datasets, so there is no mention of publicly available datasets for training.
Dataset Splits No As this is a theoretical paper, there are no empirical experiments with data splits for training, validation, or testing.
Hardware Specification No As this is a theoretical paper, no empirical experiments are conducted, and therefore, no hardware specifications are mentioned.
Software Dependencies No As this is a theoretical paper, no empirical experiments are conducted, and therefore, no specific software dependencies or versions are mentioned.
Experiment Setup No The paper describes theoretical parameters for the algorithm (e.g., K, H, δ, η), but these are part of the theoretical analysis and not concrete experimental setup details or hyperparameters for empirical reproduction.