Episodic Memory in Lifelong Language Learning

Authors: Cyprien de Masson d'Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We evaluate our proposed model against several baselines on text classification and question answering tasks. Table 1 provides a summary of our main results.
Researcher Affiliation Industry Cyprien de Masson d Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama Deep Mind London, United Kingdom {cyprien,ruder,lingpenk,dyogatama}@google.com
Pseudocode Yes Algorithm 1 Training. Algorithm 2 Inference.
Open Source Code No The paper mentions 'https://github.com/google-research/bert' but this refers to the BERT model that the authors used, not the source code for their specific methodology described in the paper.
Open Datasets Yes We use publicly available text classification datasets from Zhang et al. (2015) to evaluate our models (http://goo.gl/Jy Cn Zq). We use three question answering datasets: SQu AD 1.1 (Rajpurkar et al., 2016), Trivia QA (Joshi et al., 2017), and Qu AC (Choi et al., 2018).
Dataset Splits Yes We create a balanced version all datasets used in our experiments by randomly sampling 115,000 training examples and 7,600 test examples from all datasets (i.e., the size of the smallest training and test sets). In total, we have 575,000 training examples and 38,000 test examples. SQu AD... It includes almost 90,000 training examples and 10,000 validation examples. Trivia QA... 76,000 training examples and 10,000 (unverified) validation examples, whereas the Wikipedia section has about 60,000 training examples and 8,000 validation examples. Qu AC... 80,000 training examples and approximately 7,000 validation examples.
Hardware Specification Yes For each experiment, we use 4 Intel Skylake x86-64 CPUs at 2 GHz, 1 Nvidia Tesla V100 GPU, and 20 GB of RAM.
Software Dependencies No The paper mentions using a 'pretrained BERTBASE model' and 'Adam' as an optimizer, but it does not specify version numbers for any software libraries or frameworks (e.g., Python, TensorFlow, PyTorch) that would be needed for replication.
Experiment Setup Yes We use Adam (Kingma & Ba, 2015) as our optimizer. We set dropout (Srivastava et al., 2014) to 0.1 and λ in Eq. 1 to 0.001. We set the base learning rate to 3e 5 (based on preliminary experiments, in line with the suggested learning rate for using BERT). For text classification, we use a training batch of size 32. For question answering, the batch size is 8. The only hyperparameter that we tune is the local adaptation learning rate {5e 3, 1e 3}. We set the number of neighbors K = 32 and the number of local adaptation steps L = 30.