Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

Authors: Mingxu Tao, Yansong Feng, Dongyan Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting1.
Researcher Affiliation Academia Mingxu Tao1,2, Yansong Feng1,3, Dongyan Zhao1,2 1Wangxuan Institute of Computer Technology, Peking University, China 2Center for Data Science, Peking University, China 3The MOE Key Laboratory of Computational Linguistics, Peking University, China {thomastao, fengyansong, zhaody}@pku.edu.cn
Pseudocode Yes Algorithm 1: Calculating the Representation Cone
Open Source Code No Code will be released at https://github.com/kobayashikanna01/plms_are_lifelong_learners
Open Datasets Yes Its text classification part is rearranged from five datasets used by Zhang et al. (2015), consisting of 4 text classification tasks: news classification (AGNews, 4 classes), ontology prediction (DBPedia, 14 classes), sentiment analysis (Amazon and Yelp, 5 shared classes), topic classification (Yahoo, 10 classes). ... As for question answering, this benchmark contains 3 datasets: SQu AD 1.1 (Rajpurkar et al., 2016), Trivia QA (Joshi et al., 2017), and Qu AC (Choi et al., 2018).
Dataset Splits No The paper mentions training and testing examples but does not explicitly describe a separate validation set or its split.
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes To compare with prior works (d Autume et al., 2019; Wang et al., 2020b), we retain consistent experimental setups with them,where the maximum length of tokens and batch size are set to 128 and 32, separately. ... We employ Adam (Kingma & Ba, 2015) as the optimizer. ... On each task, the model is finetuned for 15K steps... We set batch size as 16 and learning rate as 3 10 5 without decay.