reproducibilityindex.ai

Implicit Regularization and Convergence for Weight Normalization

Authors: Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We show that this non-convex formulation has beneﬁcial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum ℓ2 norm solution, even for initializations far from zero. For certain stepsizes of g and w, we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization. While normalization methods are practically popular and successful, their theoretical understanding has only started to emerge recently. Our Contributions. We consider the overparametrized least squares (LS) optimization problem...We show that WN and r PGD have the same limiting ﬂow the WN ﬂow in continuous time (Lemma 2.2). We characterize the stationary points of the loss, showing that the nonconvex reparametrization introduces some additional stationary points that are in general not global minima. However, we also show that the loss still decreases at a geometric rate, if we can control the scale parameter g.
Researcher Affiliation	Collaboration	University of Texas at Austin Edgar Dobriban University of Pennsylvania Tongzhenng Ren University of Texas at Austin Shanshan Wu Google Research Zhiyuan Li Princeton University Suriya Gunasekar Microsoft Research Rachel Ward University of Texas at Austin Qiang Liu University of Texas at Austin
Pseudocode	Yes	Algorithm 1 WN for (2) Input: Unit norm w0 and scalar g0,iterations T, step-sizes {γt}T 1 t=0 and {ηt}T 1 t=0 for t = 0, 1, 2, , T 1 do wt+1 = wt ηt wh(wt, gt) gt+1 = gt γt gh(wt, gt) Algorithm 2 r PGD for (3) Input: Unit norm w0 and g0, number of iterations T, step-sizes {γt}T 1 t=0 and {ηt}T 1 t=0 for t = 0, 1, 2, , T 1 do vt = wt ηt wf(wt, gt) (gradient step) wt+1 = vt vt (projection) gt+1 = gt γt gf(wt, gt) (gradient step) end for
Open Source Code	No	The paper does not include an unambiguous statement about releasing code or a link to a source code repository.
Open Datasets	No	The paper focuses on theoretical analysis of an optimization problem (least squares regression) and does not use or specify a publicly available or open dataset.
Dataset Splits	No	The paper is a theoretical work and does not describe experimental setups that would involve dataset splits for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments or simulations.
Software Dependencies	No	The paper does not provide specific software names with version numbers required for replication.
Experiment Setup	No	The paper discusses theoretical parameters like stepsizes and initialization for its algorithms but does not provide concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, epochs) or training configurations for an empirical setup.