Effective Continual Learning for Text Classification with Lightweight Snapshots

Authors: Jue Wang, Dajie Dong, Lidan Shou, Ke Chen, Gang Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments over various task sequences show that our approach effectively mitigates catastrophic forgetting and outperforms all baselines. We conduct extensive experiments over several task sequences, including cross-dataset and cross-language task sequences. The results show that our approach can effectively mitigate catastrophic forgetting under different task settings without using any old training data.
Researcher Affiliation Academia 1Key Lab of Intelligent Computing Based Big Data of Zhejiang Province, Zhejiang University 2College of Computer Science and Technology, Zhejiang University
Pseudocode Yes Algorithm 1: Overall training procedure.
Open Source Code Yes Code available at: https://github.com/LorrinWWW/Snapshot.
Open Datasets Yes THUCNews dataset (Sun et al. 2016), AG’s news corpus (Zhang, Zhao, and Le Cun 2015), Yelp reviews (Asghar 2016), Amazon reviews (Mc Auley and Leskovec 2013), and DBPedia dataset (Zhang, Zhao, and Le Cun 2015). ... SNIPS benchmark (Coucke et al. 2018)
Dataset Splits No The paper uses standard datasets but does not explicitly provide the specific percentages or counts for training, validation, and test splits used in their experiments. It mentions 'tuning hyperparameters by performing a grid search' which implies a validation set, but the split method is not detailed.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory specifications) used to run the experiments.
Software Dependencies No The paper mentions using the Adam optimizer and fine-tuning a BERT model, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We use the Adam optimizer, and set the learning rate to 2e-5 for the global model and 1e-4 for the adapter-based snapshot. We set the training batch size to 32. We tune the hyperparameters by performing a grid search over d [12, 96], T [1, 6], and M [1, 4]. Our approach works well with a bottleneck size d = 48, temperature T = 3, and maximum number of snapshots for each training step M = 3. We train each task for one epoch and report the average results over 3 runs with different random seeds.