Frustratingly Short Attention Spans in Neural Language Modeling
Authors: Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles... In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016). |
| Researcher Affiliation | Academia | Michał Daniluk, Tim Rocktaschel, Johannes Welbl & Sebastian Riedel Department of Computer Science University College London michal.daniluk.15@ucl.ac.uk, {t.rocktaschel,j.welbl,s.riedel}@cs.ucl.ac.uk |
| Pseudocode | No | The paper describes methods using mathematical equations but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper only provides a link to the Wikipedia corpus dataset ('The wikipedia corpus is available at https://goo.gl/s8cy Ya.') and does not mention providing access to the source code for the methodology described. |
| Open Datasets | Yes | We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles (dump from 6 Feb 2015)... The wikipedia corpus is available at https://goo.gl/s8cy Ya. In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016). |
| Dataset Splits | Yes | Subsequently, we split this corpus into a train, development, and test part, resulting in corpora of 22.5M words, 1.2M and 1.2M words, respectively. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments. |
| Software Dependencies | No | The paper mentions ADAM for optimization and LSTM as a model component, but does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | We use ADAM (Kingma & Ba, 2015) with an initial learning rate of 0.001 and a mini-batch size of 64 for optimization. Furthermore, we apply gradient clipping at a gradient norm of 5 (Pascanu et al., 2013). The bias of the LSTM s forget gate is initialized to 1 (Jozefowicz et al., 2016), while other parameters are initialized uniformly from the range ( 0.1, 0.1). Backpropagation Through Time (Rumelhart et al., 1985; Werbos, 1990) was used to train the network with 20 steps of unrolling. |