Pretrained Language Model in Continual Learning: A Comparative Study

Authors: Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, Gholamreza Haffari

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental analyses reveal interesting performance differences across PLMs and across CL methods. We conduct experiments over (1) two primary continual learning setting, including taskincremental learning and class-incremental learning; (2) three benchmark datasets with different data distributions and task definitions, including relation extraction, event classification, and intent detection; (3) four CL approaches with six baseline methods implemented for systematic comparison; and (4) five pretrained language models.
Researcher Affiliation Academia Tongtong Wu1,2, Massimo Caccia3, Zhuang Li2, Yuan-Fang Li2, Guilin Qi1, Gholamreza Haffari 2 1Southeast University 2Monash University 3MILA
Pseudocode Yes Algorithm 1: Function of Layer Evaluation Evaluate Layer( )
Open Source Code Yes To encourage more research on continual learning in NLP, we release the code and dataset as an open-access resource on https://github.com/wutong8023/PLM4CL. git.
Open Datasets Yes We evaluate our methods on 3 datasets with distinct label distributions, covering the following domains. CLINC150 (Larson et al., 2019) is an intent classification dataset... Maven (Wang et al., 2020) is a long-tailed event detection dataset... Web RED (Ormandi et al., 2021) is a severely long-tailed relation classification dataset... To encourage more research on continual learning in NLP, we release the code and dataset as an open-access resource on https://github.com/wutong8023/PLM4CL. git.
Dataset Splits Yes For each class, we randomly split the dataset set into train, validation and test set by 10:2:3.
Hardware Specification No The computational resources for this work were supported by the Multi-modal Australian Science S Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au).
Software Dependencies No To provide a fair comparison among CL methods, we train all the networks using the Adam W Mosbach et al. (2021) optimizer, and select 10e-5 as the learning rate for all pretrained backbone models.
Experiment Setup Yes To provide a fair comparison among CL methods, we train all the networks using the Adam W Mosbach et al. (2021) optimizer, and select 10e-5 as the learning rate for all pretrained backbone models. (Table 2 lists specific hyper-parameters for EWC: "λ 1,000,000", "γ 0.2". For ER: "buffer size 200". For HAT: "smax 400". For DERPP: "α 0.5", "β 1".)