AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly

Authors: Yuchen Jin, Tianyi Zhou, Liangyu Zhao, Yibo Zhu, Chuanxiong Guo, Marco Canini, Arvind Krishnamurthy

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantages and the generality of Auto LRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers.
Researcher Affiliation Collaboration Yuchen Jin, Tianyi Zhou, Liangyu Zhao University of Washington {yuchenj, tianyizh, liangyu}@cs.washington.edu Yibo Zhu, Chuanxiong Guo Byte Dance Inc. {zhuyibo, guochuanxiong}@bytedance.com Marco Canini KAUST marco@kaust.edu.sa Arvind Krishnamurthy University of Washington arvind@cs.washington.edu
Pseudocode Yes Algorithm 1: Auto LRS Input : (1) Number of steps in each training stage, τ (2) Learning-rate search interval (ηmin, ηmax) (3) Number of LRs to evaluate by BO in each training stage, k (4) Number of training steps to evaluate each LR in BO, τ (5) Trade-off weight in the acquisition function of BO, κ
Open Source Code Yes The Auto LRS implementation is available at https://github.com/Yuchen Jin/autolrs.
Open Datasets Yes Res Net-50 (He et al., 2016a) on Image Net classification (Russakovsky et al., 2015); Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) for NLP tasks. We train Res Net-50 on Image Net (Russakovsky et al., 2015) using SGD with momentum on 32 NVIDIA Tesla V100 GPUs with data parallelism and a mini-batch size of 1024.
Dataset Splits Yes Auto LRS aims to find an LR applied to every τ steps that minimizes the resulted validation loss.
Hardware Specification Yes We train Res Net-50 on Image Net (Russakovsky et al., 2015) using SGD with momentum on 32 NVIDIA Tesla V100 GPUs with data parallelism and a mini-batch size of 1024.
Software Dependencies No The paper mentions using 'Py Torch implementation' but does not specify a version number for PyTorch or any other software dependencies with specific versions.
Experiment Setup Yes In our default setting, we set k = 10 and τ = τ/10 so that the training steps spent on BO equals the training steps spent on updating the DNN model. We start from τ = 1000 and τ = 100 and double τ and τ after each stage until τ reaches τmax. We use τmax = 8000 for Res Net-50 and Transformer, τmax = 32000 for BERT.