Does Momentum Change the Implicit Regularization on Separable Data?
Authors: Bohan Wang, Qi Meng, Huishuai Zhang, Ruoyu Sun, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments are conducted to support our theoretical results. |
| Researcher Affiliation | Collaboration | Bohan Wang University of Science & Technology of China bhwangfy@gmail.com Qi Meng Microsoft Research Asia meq@microsoft.com Huishuai Zhang Microsoft Research Asia huzhang@microsoft.com Ruoyu Sun The Chinese University of Hong Kong, Shenzhen, China sunruoyu@cuhk.edu.cn Wei Chen Chinese Academy of Sciences chenwei2022@ict.ac.cn Zhi-Ming Ma Chinese Academy of Sciences mazm@amt.ac.cn Tie-Yan Liu Microsoft Research Asia tyliu@microsoft.com |
| Pseudocode | Yes | GDM s update rule is m(0) = 0, m(t) = βm(t 1) + (1 β) L(w(t)), w(t + 1) = w(t) ηm(t). (1) |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code specific to the methodology described in the paper. |
| Open Datasets | Yes | We use the synthetic dataset in [34] with learning rate 1 σ2max . Figure (b) shows (1). all the optimizers converge to the max margin solution, and (2). the asymptotic behaviors with & without momentum are similar. The experimental observation support our theoretical results. Furthermore, it is worth mentioning that experimental phenomenons that adding momentum will not change the implicit regularization have also been observed by existing literature [34, 23, 43]. ...run SGD and SGDM on neural networks to classify the MNIST dataset and compare their implicit regularization. |
| Dataset Splits | No | The paper mentions using a "synthetic dataset" and the "MNIST dataset" but does not specify the training, validation, or test splits used for its experiments. While these datasets often have standard splits, the paper itself does not explicitly state them. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | We consider the optimizers with constant learning rate and constant momentum hyper-parameters, which are widely adopted in practice, e.g., the default setting in popular machine learning frameworks [27] and in experiments [46]. Our main results are summarized in Theorem 1. Theorem 1 (informal). With linearly separable dataset S, linear model and exponential-tailed loss: For GDM with a constant learning rate, the parameter norm diverges to infinity, with its direction converging to the L2 max-margin solution. The same conclusion holds for SGDM with a constant learning rate. For deterministic Adam with a constant learning rate and stochastic RMSProp (i.e., Adam without momentum) with a decaying learning rate, the same conclusion holds. ...we use the synthetic dataset in [34] with learning rate 1 σ2max . |