Towards Continual Knowledge Learning of Language Models

Authors: Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun KIM, Stanley Jungkyu Choi, Minjoon Seo

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously.
Researcher Affiliation Collaboration Joel Jang1 Seonghyeon Ye1 Sohee Yang1 Joongbo Shin2 Janghoon Han2 Gyeonghun Kim2 Stanley Jungkyu Choi2 Minjoon Seo1 1KAIST AI 2LG AI Research {joeljang,vano1205,sohee.yang,minjoon}@kaist.ac.kr {jb.shin,janghoon.han,ghkayne.kim,stanleyjk.choi}@lgresearch.ai
Pseudocode No No structured pseudocode or algorithm blocks (e.g., sections labeled 'Pseudocode' or 'Algorithm') were found in the paper. The methodology is described in narrative text and flow diagrams.
Open Source Code Yes The benchmark datasets, model checkpoints, and code to reproduce our results are available at this https URL.
Open Datasets Yes The benchmark datasets, model checkpoints, and code to reproduce our results are available at this https URL. ... We first construct CC-RECENTNEWS, a novel text corpus containing relatively new knowledge as D1. ... We create INVARIANTLAMA, a subset of the LAMA (Petroni et al., 2019) task... ... We construct UPDATEDLAMA and NEWLAMA for measuring the update of outdated knowledge and acquisition of new knowledge during CKL. ... We use news-please (Hamborg et al., 2017), similar to the CCNEWS (Liu et al., 2019) and REALNEWS dataset (Zellers et al., 2019), to crawl 221,779 news articles published from May 2020 to April 2021. ... T5 was initially pretrained on the C4 dataset (about 750 GB), which is a cleansed dump of Common Crawl extracted from the web in April 2019. ... LAMA (Petroni et al., 2019) ... T-REx (Elsahar et al., 2018)
Dataset Splits Yes We search for the hyperparameters such as training epochs, batch size, input size, output size, and learning rate of each individual KILT task to match the T5-base dev performance reported by Petroni et al. (2021). ... Table 9: Hyperparameters and dataset details for all tasks of KILT. ... Train Size ... Dev Size
Hardware Specification Yes For all of the experiments, we use 4 32GB V100 GPUs for training with each method except Mix-Review, where we use 16 32GB V100 GPUs. ... For both tuning processes, 4 V100 32GB GPUs are used.
Software Dependencies No The paper mentions specific models (T5, GPT-2, BERT, RoBERTa) and references tools like 'news-please' (Hamborg et al., 2017) and the 'Transformers' library (Wolf et al., 2020), but it does not provide explicit version numbers for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes The input and output sequence length is fixed to 350. We use gradient accumulation for cases where the same number of training batches could not be loaded on the GPUs due to the varying memory consumption required for different methods and set the global batch size to 60. We use Adafactor optimizer with an initial learning rate of 1e-3. We use learning rate warm-up for the first 10% of training and linearly decay the learning rate to half of the initial learning rate towards the end of training. ... The details of the configurations used for evaluation on each individual CKL task are provided in Appendix C. ... Rec Adam (Chen et al., 2020) We use the same hyperparameter setting for the optimizer as in Chen et al. (2020): we set the coefficient of the quadratic penalty γ to 5,000, and select the best t0 and k in 100, 250, 500, 1,000 and 0.05, 0.1, 0.2, 0.5, 1 respectively for the annealing coefficient λ(t). ... Mix-Review (He et al., 2021) ... The mix-decay and mix-ratio are set to 4 and 0.7, respectively... Lo RA (Hu et al., 2021) ... We use the optimal rank r of 4 and adapt both Wq and Wv in the self-attention module... K-Adapter (Wang et al., 2021b) ... We implement k = 2,3 for both T5 and GPT-2... For INVARIANTLAMA, the input and output length is fixed as 25 and 4 respectively. For UPDATEDLAMA and NEWLAMA, the input and output length is 50 and 10 respectively. Lastly, the input and output length is 150 and 10 respectively for NEWLAMA-EASY.