AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
Authors: Yuchen Jin, Tianyi Zhou, Liangyu Zhao, Yibo Zhu, Chuanxiong Guo, Marco Canini, Arvind Krishnamurthy
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantages and the generality of Auto LRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. |
| Researcher Affiliation | Collaboration | Yuchen Jin, Tianyi Zhou, Liangyu Zhao University of Washington {yuchenj, tianyizh, liangyu}@cs.washington.edu Yibo Zhu, Chuanxiong Guo Byte Dance Inc. {zhuyibo, guochuanxiong}@bytedance.com Marco Canini KAUST marco@kaust.edu.sa Arvind Krishnamurthy University of Washington arvind@cs.washington.edu |
| Pseudocode | Yes | Algorithm 1: Auto LRS Input : (1) Number of steps in each training stage, τ (2) Learning-rate search interval (ηmin, ηmax) (3) Number of LRs to evaluate by BO in each training stage, k (4) Number of training steps to evaluate each LR in BO, τ (5) Trade-off weight in the acquisition function of BO, κ |
| Open Source Code | Yes | The Auto LRS implementation is available at https://github.com/Yuchen Jin/autolrs. |
| Open Datasets | Yes | Res Net-50 (He et al., 2016a) on Image Net classification (Russakovsky et al., 2015); Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) for NLP tasks. We train Res Net-50 on Image Net (Russakovsky et al., 2015) using SGD with momentum on 32 NVIDIA Tesla V100 GPUs with data parallelism and a mini-batch size of 1024. |
| Dataset Splits | Yes | Auto LRS aims to find an LR applied to every τ steps that minimizes the resulted validation loss. |
| Hardware Specification | Yes | We train Res Net-50 on Image Net (Russakovsky et al., 2015) using SGD with momentum on 32 NVIDIA Tesla V100 GPUs with data parallelism and a mini-batch size of 1024. |
| Software Dependencies | No | The paper mentions using 'Py Torch implementation' but does not specify a version number for PyTorch or any other software dependencies with specific versions. |
| Experiment Setup | Yes | In our default setting, we set k = 10 and τ = τ/10 so that the training steps spent on BO equals the training steps spent on updating the DNN model. We start from τ = 1000 and τ = 100 and double τ and τ after each stage until τ reaches τmax. We use τmax = 8000 for Res Net-50 and Transformer, τmax = 32000 for BERT. |