Frustratingly Short Attention Spans in Neural Language Modeling

Authors: Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles... In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016).
Researcher Affiliation Academia Michał Daniluk, Tim Rocktaschel, Johannes Welbl & Sebastian Riedel Department of Computer Science University College London michal.daniluk.15@ucl.ac.uk, {t.rocktaschel,j.welbl,s.riedel}@cs.ucl.ac.uk
Pseudocode No The paper describes methods using mathematical equations but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper only provides a link to the Wikipedia corpus dataset ('The wikipedia corpus is available at https://goo.gl/s8cy Ya.') and does not mention providing access to the source code for the methodology described.
Open Datasets Yes We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles (dump from 6 Feb 2015)... The wikipedia corpus is available at https://goo.gl/s8cy Ya. In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016).
Dataset Splits Yes Subsequently, we split this corpus into a train, development, and test part, resulting in corpora of 22.5M words, 1.2M and 1.2M words, respectively.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments.
Software Dependencies No The paper mentions ADAM for optimization and LSTM as a model component, but does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, etc.).
Experiment Setup Yes We use ADAM (Kingma & Ba, 2015) with an initial learning rate of 0.001 and a mini-batch size of 64 for optimization. Furthermore, we apply gradient clipping at a gradient norm of 5 (Pascanu et al., 2013). The bias of the LSTM s forget gate is initialized to 1 (Jozefowicz et al., 2016), while other parameters are initialized uniformly from the range ( 0.1, 0.1). Backpropagation Through Time (Rumelhart et al., 1985; Werbos, 1990) was used to train the network with 20 steps of unrolling.