Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Authors: Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further conduct experiments on various popular tasks and models, which is often insufficient in previous work. Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time. |
| Researcher Affiliation | Collaboration | MOE Key Lab of Computational Linguistics, School of EECS, Peking University College of Information Science and Electronic Engineering, Zhejiang University Department of Computer Science, University of Southern California Center for Data Science, Beijing Institute of Big Data Research, Peking University {luolc,xusun}@pku.edu.cn xiongyh@zju.edu.cn yanliu.cs@usc.edu Equal contribution. This work was done when the first and second authors were on an internship at Di Di AI Labs. |
| Pseudocode | Yes | Algorithm 1 Generic framework of optimization methods |
| Open Source Code | Yes | The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound. |
| Open Datasets | Yes | We focus on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank (Marcus et al., 1993). |
| Dataset Splits | No | Adaptive methods often display faster progress in the initial portion of the training, but their performance quickly plateaus on the unseen data (development/test set) (Wilson et al., 2017). |
| Hardware Specification | No | We focus on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank (Marcus et al., 1993). |
| Software Dependencies | No | The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound. |
| Experiment Setup | Yes | To tune the step size, we follow the method in Wilson et al. (2017). We implement a logarithmically-spaced grid of five step sizes. If the best performing parameter is at one of the extremes of the grid, we will try new grid points so that the best performing parameters are at one of the middle points in the grid. Specifically, we tune over hyperparameters in the following way. |