Learning to Schedule Learning rate with Graph Neural Networks

Authors: Yuanhao Xiong, Li-Cheng Lan, Xiangning Chen, Ruochen Wang, Cho-Jui Hsieh

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our framework on benchmarking datasets, Fashion-MNIST and CIFAR10 for image classification, and GLUE for language understanding. GNS shows consistent improvement over popular baselines when training CNN and Transformer models. Moreover, GNS demonstrates great generalization to different datasets and network structures. Our code is available at https://github.com/xyh97/GNS. ... To validate the effectiveness of GNS, we evaluate our method on various tasks in image classification and language understanding, and compare it with popular learning rate scheduling rules. We further investigate the generalization of GNS on different transfer tasks. In addition, we conduct an ablation study to analyze the state representation and reward collection.
Researcher Affiliation Academia Yuanhao Xiong, Li-Cheng Lan, Xiangning Chen, Ruochen Wang, Cho-Jui Hsieh Department of Computer Science, UCLA {yhxiong, lclan, xiangning, chohsieh}@cs.ucla.edu ruocwang@ucla.edu
Pseudocode Yes Algorithm 1 Graph Network-based Scheduler. Input: Value network parameterized by φ, action network parameterized by ϕ, # updates T, decision interval K, prior learning rate distribution Dα
Open Source Code Yes Our code is available at https://github.com/xyh97/GNS.
Open Datasets Yes Image classification. We consider two benchmark datasets in image classification, Fashion MNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al., 2014). ... Language understanding. For language understanding, we conduct experiments on GLUE (Wang et al., 2019), a benchmark consisting of eight sentenceor sentence-pair tasks.
Dataset Splits Yes These two datasets are first split into the standard training and test sets. Then we randomly sample 10k images for each dataset from the training set to construct a validation set. ... For language understanding, we conduct experiments on GLUE (Wang et al., 2019), a benchmark consisting of eight sentenceor sentence-pair tasks. They are divided into training, validation and test sets and we have no access to ground truth labels of test sets.
Hardware Specification Yes For instance, when running on MRPC for 5 epochs, we need to make 58 decisions with the number of network updates K = 10. The average time of one episode with one NVIDIA 1080Ti GPU of SRLS is 405s while GNS only takes 259s, which decreases the original cost by 30%.
Software Dependencies No All Ro BERTa models in this paper are implemented by Hugging Face1 Wolf et al. (2020) and pre-trained models are obtained from the corresponding model hub2. ... Hugging Face1 Wolf et al. (2020) and pre-trained models are obtained from the corresponding model hub2. ... 1https://github.com/huggingface/transformers. The paper mentions Hugging Face and its transformers library, but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes In this section, we present our experimental settings. Further details can be found in Appendix B. ... We use Adam (Kingma & Ba, 2014) with the batch size of 128 for 200 epochs to train these two image classification tasks. ... The Adam W (Loshchilov & Hutter, 2017) optimizer is adopted to train Ro BERTa models. Details of other hyperparameters like batch size and episode length for each task are provided in Appendix B. ... Table 7: Hyperparameter configuration for GLUE benchmarking datasets. ... Table 8: Hyperparameter configuration for GLUE benchmarking datasets.