Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study
Authors: Mingxu Tao, Yansong Feng, Dongyan Zhao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting1. |
| Researcher Affiliation | Academia | Mingxu Tao1,2, Yansong Feng1,3, Dongyan Zhao1,2 1Wangxuan Institute of Computer Technology, Peking University, China 2Center for Data Science, Peking University, China 3The MOE Key Laboratory of Computational Linguistics, Peking University, China {thomastao, fengyansong, zhaody}@pku.edu.cn |
| Pseudocode | Yes | Algorithm 1: Calculating the Representation Cone |
| Open Source Code | No | Code will be released at https://github.com/kobayashikanna01/plms_are_lifelong_learners |
| Open Datasets | Yes | Its text classification part is rearranged from five datasets used by Zhang et al. (2015), consisting of 4 text classification tasks: news classification (AGNews, 4 classes), ontology prediction (DBPedia, 14 classes), sentiment analysis (Amazon and Yelp, 5 shared classes), topic classification (Yahoo, 10 classes). ... As for question answering, this benchmark contains 3 datasets: SQu AD 1.1 (Rajpurkar et al., 2016), Trivia QA (Joshi et al., 2017), and Qu AC (Choi et al., 2018). |
| Dataset Splits | No | The paper mentions training and testing examples but does not explicitly describe a separate validation set or its split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | To compare with prior works (d Autume et al., 2019; Wang et al., 2020b), we retain consistent experimental setups with them,where the maximum length of tokens and batch size are set to 128 and 32, separately. ... We employ Adam (Kingma & Ba, 2015) as the optimizer. ... On each task, the model is finetuned for 15K steps... We set batch size as 16 and learning rate as 3 10 5 without decay. |