Pretrained Language Model in Continual Learning: A Comparative Study
Authors: Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, Gholamreza Haffari
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental analyses reveal interesting performance differences across PLMs and across CL methods. We conduct experiments over (1) two primary continual learning setting, including taskincremental learning and class-incremental learning; (2) three benchmark datasets with different data distributions and task definitions, including relation extraction, event classification, and intent detection; (3) four CL approaches with six baseline methods implemented for systematic comparison; and (4) five pretrained language models. |
| Researcher Affiliation | Academia | Tongtong Wu1,2, Massimo Caccia3, Zhuang Li2, Yuan-Fang Li2, Guilin Qi1, Gholamreza Haffari 2 1Southeast University 2Monash University 3MILA |
| Pseudocode | Yes | Algorithm 1: Function of Layer Evaluation Evaluate Layer( ) |
| Open Source Code | Yes | To encourage more research on continual learning in NLP, we release the code and dataset as an open-access resource on https://github.com/wutong8023/PLM4CL. git. |
| Open Datasets | Yes | We evaluate our methods on 3 datasets with distinct label distributions, covering the following domains. CLINC150 (Larson et al., 2019) is an intent classification dataset... Maven (Wang et al., 2020) is a long-tailed event detection dataset... Web RED (Ormandi et al., 2021) is a severely long-tailed relation classification dataset... To encourage more research on continual learning in NLP, we release the code and dataset as an open-access resource on https://github.com/wutong8023/PLM4CL. git. |
| Dataset Splits | Yes | For each class, we randomly split the dataset set into train, validation and test set by 10:2:3. |
| Hardware Specification | No | The computational resources for this work were supported by the Multi-modal Australian Science S Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au). |
| Software Dependencies | No | To provide a fair comparison among CL methods, we train all the networks using the Adam W Mosbach et al. (2021) optimizer, and select 10e-5 as the learning rate for all pretrained backbone models. |
| Experiment Setup | Yes | To provide a fair comparison among CL methods, we train all the networks using the Adam W Mosbach et al. (2021) optimizer, and select 10e-5 as the learning rate for all pretrained backbone models. (Table 2 lists specific hyper-parameters for EWC: "λ 1,000,000", "γ 0.2". For ER: "buffer size 200". For HAT: "smax 400". For DERPP: "α 0.5", "β 1".) |