Scalable Language Model with Generalized Continual Learning
Authors: Bohao PENG, Zhuotao Tian, Shu Liu, Ming-Chang Yang, Jiaya Jia
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Extensive experiments demonstrate remarkable efficacy and stability of our method on widely recognized benchmarks, reaching state-of-the-art performance on various models, including BERT, T5 and the latest LLa MA-2 (Devlin et al., 2018; Qin & Joty, 2021; Touvron et al., 2023). |
| Researcher Affiliation | Collaboration | Bohao PENG Zhuotao TIAN Shu LIU Mingchang YANG Jiaya JIA The Chinese University of Hong Kong SMart More |
| Pseudocode | Yes | Algorithm 1 The training pipeline of Scalable Language Model |
| Open Source Code | Yes | The code is available on the project website1. 1https://github.com/Pbihao/SLM |
| Open Datasets | Yes | We first test our method on the widely adopted continual learning benchmarks for language models following de Masson D Autume et al. (2019), which use five text classification datasets (Zhang et al., 2015; Chen et al., 2020) including AG News (news classification), Yelp (sentiment analysis), DBPedia (Wikipedia article classification), Amazon (sentiment analysis) and Yahoo Answers (Q&A classification). We further extend our method to large generation language models with LLa MA-2 backbone (Touvron et al., 2023) and introduce a new benchmark that spans multiple domains and task types. This benchmark includes three types of tasks: question answering (medical), multiple-choice examination (mmlu), and sentiment classification (finance) (Li et al., 2023; Hendrycks et al., 2021b;a). |
| Dataset Splits | Yes | On the contrary, we conduct the few-shot continual learning setup with T5-large backbone (Raffel et al., 2020), following the approach of LFPT5 (Qin & Joty, 2021). This setup involves sampling 16 examples per class in the training and validation sets to evaluate the performance of our proposed method on limited training resources. |
| Hardware Specification | Yes | We conducted trials using the BERT and T5 backbones with 4 NVIDIA Ge Force RTX 3090 GPUs. Additionally, for experiments involving the LLa MA2-7B backbone, we utilized 4 NVIDIA A100 GPUs with a batch size of 2. |
| Software Dependencies | No | The paper mentions Hugging Face Transformers, Deep Speed, Adam W, and Sentence-BERT, but does not provide specific version numbers for any of these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We set the batch size to 8 and the maximum sequence length to 512 for these experiments. Additionally, for experiments involving the LLa MA2-7B backbone, we utilized 4 NVIDIA A100 GPUs with a batch size of 2. To enhance training efficiency, we employed Deep Speed (Rasley et al., 2020) as a training optimization. Adam W is employed as the optimizer (Loshchilov & Hutter, 2017) for our experiments. For the preparation stage, we set the learning rate lr = 1e 3 and the random mask rate p = 20% for all scenarios. Specifically, we set the learning rate to 2e 4 for fully continual learning using the BERT and LLa MA2 backbones. For the few-shot continual learning scenario with the T5 model, we set the learning rate to 2e 2. The weight decay is set to 0.01. More configuration details can be found in Appendix A.4. Table 10: The details of the optimization hyperparameter. |