Adaptive Methods for Nonconvex Optimization
Authors: Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis shows that under this scenario such methods do converge to stationarity up to the statistical limit of variance in the stochastic gradients (scaled by a constant factor). In particular, our result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the nonconvergence issues. Furthermore, we provide a new adaptive optimization algorithm, YOGI, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence. Extensive experiments show that YOGI with very little hyperparameter tuning outperforms methods such as ADAM in several challenging machine learning tasks. |
| Researcher Affiliation | Collaboration | Manzil Zaheer Google Research manzilzaheer@google.comSashank J. Reddi Google Research sashank@google.comDevendra Sachan Carnegie Mellon University dsachan@andrew.cmu.eduSatyen Kale Google Research satyenkale@google.comSanjiv Kumar Google Research sanjivk@google.com |
| Pseudocode | Yes | Algorithm 1 ADAM; Algorithm 2 YOGI |
| Open Source Code | No | The paper mentions that "The code for Inception-Resnet-v2 is available at https://github.com/tensorflow/models/blob/ master/research/slim/train_image_classifier.py." This link refers to a third-party model used in one specific experiment, not the authors' own implementation of YOGI or the general methodology described in the paper. There is no clear statement or link providing access to the source code for their proposed algorithm or its general framework. |
| Open Datasets | Yes | Deep Auto Encoder. ...CURVES and MNIST... Neural Machine Translation. ...IWSLT 15 En-Vi [18] and WMT 14 En-De datasets... Res Nets and Dense Nets. ...CIFAR-10 dataset... Deep Sets. ...Model Net40 dataset [38]... Named Entity Recognition (NER). ...BC5CDR biomedical data [17] |
| Dataset Splits | Yes | To this end, we chose the simple learning rate schedule of reducing the learning rate by a constant factor when performance metric plateaus on the validation/test set... We perform experiments on the IWSLT 15 En-Vi [18] and WMT 14 En-De datasets with the standard train, validation and test splits. |
| Hardware Specification | Yes | We run our experiments on a commodity machine with Intel R Xeon R CPU E5-2630 v4 CPU, 256GB RAM, and 8 Nvidia R Titan Xp GPU. |
| Software Dependencies | No | The paper mentions the use of certain frameworks implicitly (e.g., by linking to a TensorFlow model), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries). |
| Experiment Setup | Yes | Typically, for obtaining the state-of-the-art results extensive hyperparameter tuning and carefully designed learning rate schedules are required... For YOGI, we propose to initialize the vt based on gradient square evaluated at the initial point averaged over a (reasonably large) mini-batch. Decreasing learning rate is typically necessary for superior performance. To this end, we chose the simple learning rate schedule of reducing the learning rate by a constant factor when performance metric plateaus on the validation/test set (commonly known as Reduce LRon Plateau). Inspired from our theoretical analysis, we set a moderate value of ϵ = 10 3 in YOGI for all the experiments in order to control the adaptivity. ...All our experiments were run for 5000 epochs utilizing the Reduce LRon Plateau schedule with patience of 20 epochs and decay factor of 0.5 with a batch size of 128. |