Implicit Regularization and Convergence for Weight Normalization
Authors: Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum ℓ2 norm solution, even for initializations far from zero. For certain stepsizes of g and w, we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization. While normalization methods are practically popular and successful, their theoretical understanding has only started to emerge recently. Our Contributions. We consider the overparametrized least squares (LS) optimization problem...We show that WN and r PGD have the same limiting flow the WN flow in continuous time (Lemma 2.2). We characterize the stationary points of the loss, showing that the nonconvex reparametrization introduces some additional stationary points that are in general not global minima. However, we also show that the loss still decreases at a geometric rate, if we can control the scale parameter g. |
| Researcher Affiliation | Collaboration | University of Texas at Austin Edgar Dobriban University of Pennsylvania Tongzhenng Ren University of Texas at Austin Shanshan Wu Google Research Zhiyuan Li Princeton University Suriya Gunasekar Microsoft Research Rachel Ward University of Texas at Austin Qiang Liu University of Texas at Austin |
| Pseudocode | Yes | Algorithm 1 WN for (2) Input: Unit norm w0 and scalar g0,iterations T, step-sizes {γt}T 1 t=0 and {ηt}T 1 t=0 for t = 0, 1, 2, , T 1 do wt+1 = wt ηt wh(wt, gt) gt+1 = gt γt gh(wt, gt) Algorithm 2 r PGD for (3) Input: Unit norm w0 and g0, number of iterations T, step-sizes {γt}T 1 t=0 and {ηt}T 1 t=0 for t = 0, 1, 2, , T 1 do vt = wt ηt wf(wt, gt) (gradient step) wt+1 = vt vt (projection) gt+1 = gt γt gf(wt, gt) (gradient step) end for |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing code or a link to a source code repository. |
| Open Datasets | No | The paper focuses on theoretical analysis of an optimization problem (least squares regression) and does not use or specify a publicly available or open dataset. |
| Dataset Splits | No | The paper is a theoretical work and does not describe experimental setups that would involve dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments or simulations. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers required for replication. |
| Experiment Setup | No | The paper discusses theoretical parameters like stepsizes and initialization for its algorithms but does not provide concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, epochs) or training configurations for an empirical setup. |