Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Frustratingly Short Attention Spans in Neural Language Modeling
Authors: Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel
ICLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles... In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016). |
| Researcher Affiliation | Academia | Michał Daniluk, Tim Rocktaschel, Johannes Welbl & Sebastian Riedel Department of Computer Science University College London EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper only provides a link to the Wikipedia corpus dataset ('The wikipedia corpus is available at https://goo.gl/s8cy Ya.') and does not mention providing access to the source code for the methodology described. |
| Open Datasets | Yes | We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles (dump from 6 Feb 2015)... The wikipedia corpus is available at https://goo.gl/s8cy Ya. In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016). |
| Dataset Splits | Yes | Subsequently, we split this corpus into a train, development, and test part, resulting in corpora of 22.5M words, 1.2M and 1.2M words, respectively. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments. |
| Software Dependencies | No | The paper mentions ADAM for optimization and LSTM as a model component, but does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | We use ADAM (Kingma & Ba, 2015) with an initial learning rate of 0.001 and a mini-batch size of 64 for optimization. Furthermore, we apply gradient clipping at a gradient norm of 5 (Pascanu et al., 2013). The bias of the LSTM s forget gate is initialized to 1 (Jozefowicz et al., 2016), while other parameters are initialized uniformly from the range ( 0.1, 0.1). Backpropagation Through Time (Rumelhart et al., 1985; Werbos, 1990) was used to train the network with 20 steps of unrolling. |