Fast Parametric Learning with Activation Memorization
Authors: Jack Rae, Chris Dyer, Peter Dayan, Timothy Lillicrap
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this model adapts quickly to novel classes in a simple image classification task using handwritten characters from Omniglot (Lake et al., 2015). We then show it improves overall test perplexity for two medium-scale language modelling corpora, Wiki Text103 (wikipedia articles) from Merity et al. (2016) and Project Gutenberg (books), alongside a large-scale corpus Giga Word v5 (news articles) from Parker et al. (2011). By splitting accuracy over word frequency buckets, we see improved perplexity for less frequent words. |
| Researcher Affiliation | Collaboration | 1Deep Mind, London, UK 2Co MPLEX, Computer Science, University College London, London, UK 3Gatsby Computational Neuroscience Unit, University College London, UK. |
| Pseudocode | Yes | Algorithm 1 Hebbian Softmax batched update |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | Omniglot data (Lake et al., 2015), Wiki Text-103 (wikipedia articles) from Merity et al. (2016) and Project Gutenberg (books), alongside a large-scale corpus Giga Word v5 (news articles) from Parker et al. (2011). |
| Dataset Splits | Yes | We partition the first 5 examples per class to a test set, and assign the rest for training." and "2017 training books (175, 181, 505 tokens), 12 validation books (609, 545 tokens), and 13 test books (526, 646 tokens) |
| Hardware Specification | Yes | 6 days of training with 8 P100s training synchronously. |
| Software Dependencies | No | The paper mentions optimizers (RMSProp, Adam, Ada Grad) but does not provide specific version numbers for software dependencies or libraries used. |
| Experiment Setup | Yes | Models were trained with 20% dropout on the final layer and a small amount of data augmentation was applied to training examples (rotation 2 [ 30, 30], translation) to avoid overfitting." and "Hyper-parameters and further training details are described in Appendix A.1." (Appendix A.1 mentions: "The LSTM language models are trained with a learning rate of 0.2 using Adam optimizer with β1 = 0, β2 = 0.999." "We used a batch size of 64.") |