Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning
Authors: Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu, Lei Shu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate the effectiveness of CTR. Empirical evaluations show that CTR outperforms strong baselines. Ablation experiments have also been conducted to study where to insert the CL-plugin module in BERT in order to achieve the best performance (see Sec. 5.4). |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Illinois at Chicago 2Facebook AI Research 3Amazon AWS AI 1{zke4,liub,nma4}@uic.edu 2huxu@fb.com 3shulindt@gmail.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code of CTR can be found at https://github.com/Zixuan Ke/Py Continual |
| Open Datasets | Yes | We employ a set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, Tools, Video, Pet, Musical, Movies, Garden, Offices, and Kindle [22]. We employ a set of 19 ASC datasets (review sentences of 19 products) to produce sequences of tasks. Each dataset represents a task. The datasets are from 4 sources: (1) HL5Domains [17] with reviews of 5 products; (2) Liu3Domains [32] with reviews of 3 products; (3) Ding9Domains [9] with reviews of 9 products; and (4) Sem Eval14 with reviews of 2 products Sem Eval 2014 Task 4 for laptop and restaurant. Text classification using 20News data. This dataset [28] has 20 classes and each class has about 1000 documents. |
| Dataset Splits | Yes | In training each task, we use its validation set to decide when to stop training. The same validation reviews (250 positive and 250 negative) and the same test reviews (250 positive and 250 negative) are used in both experiments. For (1), (2) and (3), we use about 10% of the original data as the validate data, another about 10% of the original data as the testing data. For (4), we use 150 examples from the training set for validation. The data split for train/validation/test is 1600/200/200. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'BERTBASE (uncased)' and 'Adam optimizer' but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility. |
| Experiment Setup | Yes | Unless otherwise stated, the same hyper-parameters are used in experiments for ASC, DSC and 20News datasets. For the knowledge sharing module (KSM), we employ 2 layers of fully connected network with dimensions 768 in the TK-Layer. We employ 3 transfer capsules. For the task specific module (TSM), we use 2000 dimensions as the final and the hidden layers of the TSM. The task ID embeddings have 2000 dimensions. A fully connected layer with softmax output is used as the classification heads in the last layer of the BERT, together with the categorical cross-entropy loss. dropout of 0.5 between fully connected layers. We adopt BERTBASE (uncased). The maximum input length is set to 128. We use Adam optimizer and set the learning rate to 3e-5. We use 10 epochs for Sem Eval datasets and 30 epochs for the other datasets in the ASC application. For DSC, we use 20 epochs. For 20News, we use 10 epochs. The batch size is set to 32 for all cases. |