Memory-based Parameter Adaptation
Authors: Pablo Sprechmann, Siddhant M. Jayakumar, Jack W. Rae, Alexander Pritzel, Adria Puigdomenech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, Charles Blundell
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work generalises these approaches and we present experimental results where we apply our model to both continual or incremental learning tasks, as well as language modelling. |
| Researcher Affiliation | Industry | Pablo Sprechmann*, Siddhant M. Jayakumar*, Jack W. Rae, Alexander Pritzel Adri a Puigdom enech Badia, Benigno Uria, Oriol Vinyals Demis Hassabis, Razvan Pascanu, Charles Blundell Deep Mind London, UK {psprechmann, sidmj, jwrae, apritzel, adriap, buria, vinyals, dhcontact, razp, cblundell}@google.com |
| Pseudocode | Yes | Algorithm 1 Model-based Parameter Adaptation |
| Open Source Code | No | Not found. The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We considered the permuted MNIST setup (Goodfellow et al., 2013). ... Specifically we considered the problem of image classification on the Image Net dataset (Russakovsky et al., 2015). ... We considered two datasets with established performance benchmarks, Penn Treebank (PTB; Marcus et al., 1993) and Wiki Text-2 (Merity et al., 2016). |
| Dataset Splits | Yes | We trained all models using 10,000 examples per task, comparing to elastic weight consolidation (EWC; Kirkpatrick et al., 2017) and regular gradient descent training. ... Hyperparameters were tuned for all models using the first split and the validation set, and we report the average performance on the remaining splits evaluated on the test set. ... Penn Treebank is a small text corpus containing 887,521 train tokens, 70,390 validation tokens, and 78,669 test tokens; with a vocabulary size of 10,000. ... Wiki Text-2 is a larger text corpus than PTB, derived from Wikipedia articles. It contains 2,088,628 train tokens, 217,646 validation tokens, and 245,569 test tokens, with a vocabulary of 33,278. |
| Hardware Specification | No | Not found. The paper does not mention any specific hardware details such as GPU or CPU models used for experiments. |
| Software Dependencies | No | In all cases we rely on a two layer MLP and use Adam (Kingma & Ba, 2014) as the optimiser. ... For Mb PA, we used... RMSprop with a local learning rate αM... ... For both datasets we used a single-layer LSTM baseline trained with Adam (Kingma & Ba, 2014) using the regularisation techniques described in Melis et al. (2017). |
| Experiment Setup | Yes | The EWC penalty cost was chosen using a grid search, as was the local Mb PA learning rate (between 0.0 and 1.0) and number of optimisation steps for Mb PA (between 1 and 20). ... Mb PA was applied at test time, using RMSprop with a local learning rate αM and the number of optimisation steps (as in Algorithm 1) tuned as hyper-parameters. ... We swept over the following hyper-parameters: Memory size: N {500, 1000, 5000} Nearest neighbours: K {256, 512} Cache interpolation: λcache {0, 0.05, 0.1, 0.15} Mb PA interpolation: λmbpa {0, 0.05, 0.1, 0.15} Number of Mb PA optimisation steps: T {1, 5, 10} Mb PA optimization learning rate: α {0.01, 0.1, 0.15, 0.2, 0.5, 1}. ... The optimal parameters were: N = 5000, K = 256, λcache = 0.15, λmbpa = 0.1, T = 1, α = 0.15. |