Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Authors: Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further conduct experiments on various popular tasks and models, which is often insufficient in previous work. Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time.
Researcher Affiliation Collaboration MOE Key Lab of Computational Linguistics, School of EECS, Peking University College of Information Science and Electronic Engineering, Zhejiang University Department of Computer Science, University of Southern California Center for Data Science, Beijing Institute of Big Data Research, Peking University {luolc,xusun}@pku.edu.cn xiongyh@zju.edu.cn yanliu.cs@usc.edu Equal contribution. This work was done when the first and second authors were on an internship at Di Di AI Labs.
Pseudocode Yes Algorithm 1 Generic framework of optimization methods
Open Source Code Yes The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound.
Open Datasets Yes We focus on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank (Marcus et al., 1993).
Dataset Splits No Adaptive methods often display faster progress in the initial portion of the training, but their performance quickly plateaus on the unseen data (development/test set) (Wilson et al., 2017).
Hardware Specification No We focus on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank (Marcus et al., 1993).
Software Dependencies No The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound.
Experiment Setup Yes To tune the step size, we follow the method in Wilson et al. (2017). We implement a logarithmically-spaced grid of five step sizes. If the best performing parameter is at one of the extremes of the grid, we will try new grid points so that the best performing parameters are at one of the middle points in the grid. Specifically, we tune over hyperparameters in the following way.