Scalable Language Model with Generalized Continual Learning

Authors: Bohao PENG, Zhuotao Tian, Shu Liu, Ming-Chang Yang, Jiaya Jia

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Extensive experiments demonstrate remarkable efficacy and stability of our method on widely recognized benchmarks, reaching state-of-the-art performance on various models, including BERT, T5 and the latest LLa MA-2 (Devlin et al., 2018; Qin & Joty, 2021; Touvron et al., 2023).
Researcher Affiliation Collaboration Bohao PENG Zhuotao TIAN Shu LIU Mingchang YANG Jiaya JIA The Chinese University of Hong Kong SMart More
Pseudocode Yes Algorithm 1 The training pipeline of Scalable Language Model
Open Source Code Yes The code is available on the project website1. 1https://github.com/Pbihao/SLM
Open Datasets Yes We first test our method on the widely adopted continual learning benchmarks for language models following de Masson D Autume et al. (2019), which use five text classification datasets (Zhang et al., 2015; Chen et al., 2020) including AG News (news classification), Yelp (sentiment analysis), DBPedia (Wikipedia article classification), Amazon (sentiment analysis) and Yahoo Answers (Q&A classification). We further extend our method to large generation language models with LLa MA-2 backbone (Touvron et al., 2023) and introduce a new benchmark that spans multiple domains and task types. This benchmark includes three types of tasks: question answering (medical), multiple-choice examination (mmlu), and sentiment classification (finance) (Li et al., 2023; Hendrycks et al., 2021b;a).
Dataset Splits Yes On the contrary, we conduct the few-shot continual learning setup with T5-large backbone (Raffel et al., 2020), following the approach of LFPT5 (Qin & Joty, 2021). This setup involves sampling 16 examples per class in the training and validation sets to evaluate the performance of our proposed method on limited training resources.
Hardware Specification Yes We conducted trials using the BERT and T5 backbones with 4 NVIDIA Ge Force RTX 3090 GPUs. Additionally, for experiments involving the LLa MA2-7B backbone, we utilized 4 NVIDIA A100 GPUs with a batch size of 2.
Software Dependencies No The paper mentions Hugging Face Transformers, Deep Speed, Adam W, and Sentence-BERT, but does not provide specific version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup Yes We set the batch size to 8 and the maximum sequence length to 512 for these experiments. Additionally, for experiments involving the LLa MA2-7B backbone, we utilized 4 NVIDIA A100 GPUs with a batch size of 2. To enhance training efficiency, we employed Deep Speed (Rasley et al., 2020) as a training optimization. Adam W is employed as the optimizer (Loshchilov & Hutter, 2017) for our experiments. For the preparation stage, we set the learning rate lr = 1e 3 and the random mask rate p = 20% for all scenarios. Specifically, we set the learning rate to 2e 4 for fully continual learning using the BERT and LLa MA2 backbones. For the few-shot continual learning scenario with the T5 model, we set the learning rate to 2e 2. The weight decay is set to 0.01. More configuration details can be found in Appendix A.4. Table 10: The details of the optimization hyperparameter.