Memory Architectures in Recurrent Neural Network Language Models
Authors: Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, Phil Blunsom
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the Penn Treebank and Wikitext-2 datasets show that stack-based memory architectures consistently achieve the best performance in terms of held out perplexity. We also propose a generalization to existing continuous stack models (Joulin & Mikolov, 2015; Grefenstette et al., 2015) to allow a variable number of pop operations more naturally that further improves performance. We further evaluate these language models in terms of their ability to capture non-local syntactic dependencies on a subject-verb agreement dataset (Linzen et al., 2016) and establish new state of the art results using memory augmented language models. |
| Researcher Affiliation | Collaboration | Deep Mind and University of Oxford dyogatama@google.com, yishu.miao@cs.ox.ac.uk {melisgl,lingwang,akuncoro,cdyer,pblunsom}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, only mathematical formulations and textual descriptions of the models. |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | We use standard language modeling datasets, the Penn Tree Bank (PTB) and Wikitext-2 (Wik-2) corpora to evaluate perplexity. We evaluate these memory models for learning syntax-sensitive dependencies on the number prediction dataset from Linzen et al. (2016). |
| Dataset Splits | No | The paper states using Penn Tree Bank and Wikitext-2, which have standard splits, and a development set for tuning on the Linzen dataset. However, it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits used across all datasets, nor does it cite the exact predefined splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions techniques like recurrent dropout and RMSprop, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | Following Inan et al. (2017), we tie word embedding and word classifier layers and apply dropout to these layers with probability 0.6 (value chosen based on preliminary experiment results). We also use recurrent dropout (Semeniuta et al., 2016) and set it to 0.1. We perform non-episodic training with batch size 32 using RMSprop (Hinton, 2012) as our optimization method. We tune the RMSprop learning rate and ℓ2 regularization parameter for all models on a development set by random search from [0.004, 0.009] and [0.0001, 0.0005] respectively, and use perplexity on the development set to choose the best model. |