Learning to Schedule Learning rate with Graph Neural Networks
Authors: Yuanhao Xiong, Li-Cheng Lan, Xiangning Chen, Ruochen Wang, Cho-Jui Hsieh
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework on benchmarking datasets, Fashion-MNIST and CIFAR10 for image classification, and GLUE for language understanding. GNS shows consistent improvement over popular baselines when training CNN and Transformer models. Moreover, GNS demonstrates great generalization to different datasets and network structures. Our code is available at https://github.com/xyh97/GNS. ... To validate the effectiveness of GNS, we evaluate our method on various tasks in image classification and language understanding, and compare it with popular learning rate scheduling rules. We further investigate the generalization of GNS on different transfer tasks. In addition, we conduct an ablation study to analyze the state representation and reward collection. |
| Researcher Affiliation | Academia | Yuanhao Xiong, Li-Cheng Lan, Xiangning Chen, Ruochen Wang, Cho-Jui Hsieh Department of Computer Science, UCLA {yhxiong, lclan, xiangning, chohsieh}@cs.ucla.edu ruocwang@ucla.edu |
| Pseudocode | Yes | Algorithm 1 Graph Network-based Scheduler. Input: Value network parameterized by φ, action network parameterized by ϕ, # updates T, decision interval K, prior learning rate distribution Dα |
| Open Source Code | Yes | Our code is available at https://github.com/xyh97/GNS. |
| Open Datasets | Yes | Image classification. We consider two benchmark datasets in image classification, Fashion MNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al., 2014). ... Language understanding. For language understanding, we conduct experiments on GLUE (Wang et al., 2019), a benchmark consisting of eight sentenceor sentence-pair tasks. |
| Dataset Splits | Yes | These two datasets are first split into the standard training and test sets. Then we randomly sample 10k images for each dataset from the training set to construct a validation set. ... For language understanding, we conduct experiments on GLUE (Wang et al., 2019), a benchmark consisting of eight sentenceor sentence-pair tasks. They are divided into training, validation and test sets and we have no access to ground truth labels of test sets. |
| Hardware Specification | Yes | For instance, when running on MRPC for 5 epochs, we need to make 58 decisions with the number of network updates K = 10. The average time of one episode with one NVIDIA 1080Ti GPU of SRLS is 405s while GNS only takes 259s, which decreases the original cost by 30%. |
| Software Dependencies | No | All Ro BERTa models in this paper are implemented by Hugging Face1 Wolf et al. (2020) and pre-trained models are obtained from the corresponding model hub2. ... Hugging Face1 Wolf et al. (2020) and pre-trained models are obtained from the corresponding model hub2. ... 1https://github.com/huggingface/transformers. The paper mentions Hugging Face and its transformers library, but does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In this section, we present our experimental settings. Further details can be found in Appendix B. ... We use Adam (Kingma & Ba, 2014) with the batch size of 128 for 200 epochs to train these two image classification tasks. ... The Adam W (Loshchilov & Hutter, 2017) optimizer is adopted to train Ro BERTa models. Details of other hyperparameters like batch size and episode length for each task are provided in Appendix B. ... Table 7: Hyperparameter configuration for GLUE benchmarking datasets. ... Table 8: Hyperparameter configuration for GLUE benchmarking datasets. |