Progressive Prompts: Continual Learning for Language Models

Authors: Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, Amjad Almahairi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.
Researcher Affiliation Collaboration University of Toronto & Vector Institute Meta AI anastasia.razdaibiedina@mail.utoronto.ca, {yuningm,rayhou,mkhabsa,mikelewis,aalmah}@meta.com
Pseudocode No The paper describes the method verbally and with a diagram (Figure 1) but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the methodology described in this paper.
Open Datasets Yes We first evaluate our approach on the widely adopted CL benchmark for language models, which includes five text classification datasets by Zhang et al. (2015): AG News (4 classes, news classification), Amazon reviews (5 classes, sentiment analysis), Yelp reviews (5 classes, sentiment analysis), DBpedia (14 classes, Wikipedia text classification) and Yahoo Answers (10 classes, Q&A classification). We provide the task details in Appendix A.1 and sequences of tasks used in our experiments in Appendix A.2. ... For the standard CL benchmark, we use official datasets provided by Zhang et al. (2015) available at http://goo.gl/Jy Cn Zq, following de Masson D Autume et al. (2019); Zhang et al. (2015). We use Hugging Face datasets (https://github.com/ huggingface/datasets) to download data for GLUE tasks (Wang et al., 2018), Super GLUE tasks (Wang et al., 2019) tasks, and IMDB movie reviews dataset (Maas et al., 2011), which we use for long-sequence CL experiments and/or ablation studies.
Dataset Splits Yes We use the same train and test sets as IDBR (Huang et al., 2021) and Mb PA++ (de Masson D Autume et al., 2019), consisting of 115,000 training and 7,600 test examples for each task. Following Huang et al. (2021), for every task we randomly hold out 500 samples per class from the training set for validation, and use early stopping according to the validation accuracy on all seen tasks.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments.
Software Dependencies No The paper mentions using PyTorch and Hugging Face Transformers library, citing their respective papers (Paszke et al., 2019; Wolf et al., 2019), but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'Transformers 4.x').
Experiment Setup Yes We use Adam optimizer (Kingma & Ba, 2014) and set batch size to 8 for all the experiments, except for MTL runs with a batch size of 2 (due to memory limitations). We train each prompt between 10 and 300 epochs, depending on the number of data points. We use the prompt checkpoints with the best validation set score as our final prompts. Prompts are initialized from randomly sampled tokens as in Lester et al. (2021), hyperparametes are shown in the Table 7 below: BERT Epochs 40/40/150/300, Learning rate 1e-4, Prompt length 20. T5 Epochs 10/10/150/300, Learning rate 0.3, Prompt length 50/10/10/10.