Memory-based Parameter Adaptation

Authors: Pablo Sprechmann, Siddhant M. Jayakumar, Jack W. Rae, Alexander Pritzel, Adria Puigdomenech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, Charles Blundell

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work generalises these approaches and we present experimental results where we apply our model to both continual or incremental learning tasks, as well as language modelling.
Researcher Affiliation Industry Pablo Sprechmann*, Siddhant M. Jayakumar*, Jack W. Rae, Alexander Pritzel Adri a Puigdom enech Badia, Benigno Uria, Oriol Vinyals Demis Hassabis, Razvan Pascanu, Charles Blundell Deep Mind London, UK {psprechmann, sidmj, jwrae, apritzel, adriap, buria, vinyals, dhcontact, razp, cblundell}@google.com
Pseudocode Yes Algorithm 1 Model-based Parameter Adaptation
Open Source Code No Not found. The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We considered the permuted MNIST setup (Goodfellow et al., 2013). ... Specifically we considered the problem of image classification on the Image Net dataset (Russakovsky et al., 2015). ... We considered two datasets with established performance benchmarks, Penn Treebank (PTB; Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2016).
Dataset Splits Yes We trained all models using 10,000 examples per task, comparing to elastic weight consolidation (EWC; Kirkpatrick et al., 2017) and regular gradient descent training. ... Hyperparameters were tuned for all models using the first split and the validation set, and we report the average performance on the remaining splits evaluated on the test set. ... Penn Treebank is a small text corpus containing 887,521 train tokens, 70,390 validation tokens, and 78,669 test tokens; with a vocabulary size of 10,000. ... Wiki Text-2 is a larger text corpus than PTB, derived from Wikipedia articles. It contains 2,088,628 train tokens, 217,646 validation tokens, and 245,569 test tokens, with a vocabulary of 33,278.
Hardware Specification No Not found. The paper does not mention any specific hardware details such as GPU or CPU models used for experiments.
Software Dependencies No In all cases we rely on a two layer MLP and use Adam (Kingma & Ba, 2014) as the optimiser. ... For Mb PA, we used... RMSprop with a local learning rate αM... ... For both datasets we used a single-layer LSTM baseline trained with Adam (Kingma & Ba, 2014) using the regularisation techniques described in Melis et al. (2017).
Experiment Setup Yes The EWC penalty cost was chosen using a grid search, as was the local Mb PA learning rate (between 0.0 and 1.0) and number of optimisation steps for Mb PA (between 1 and 20). ... Mb PA was applied at test time, using RMSprop with a local learning rate αM and the number of optimisation steps (as in Algorithm 1) tuned as hyper-parameters. ... We swept over the following hyper-parameters: Memory size: N {500, 1000, 5000} Nearest neighbours: K {256, 512} Cache interpolation: λcache {0, 0.05, 0.1, 0.15} Mb PA interpolation: λmbpa {0, 0.05, 0.1, 0.15} Number of Mb PA optimisation steps: T {1, 5, 10} Mb PA optimization learning rate: α {0.01, 0.1, 0.15, 0.2, 0.5, 1}. ... The optimal parameters were: N = 5000, K = 256, λcache = 0.15, λmbpa = 0.1, T = 1, α = 0.15.