reproducibilityindex.ai

Frustratingly Short Attention Spans in Neural Language Modeling

Authors: Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate models on two different corpora for language modeling. The first is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles... In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016).
Researcher Affiliation	Academia	Michał Daniluk, Tim Rocktaschel, Johannes Welbl & Sebastian Riedel Department of Computer Science University College London michal.daniluk.15@ucl.ac.uk, {t.rocktaschel,j.welbl,s.riedel}@cs.ucl.ac.uk
Pseudocode	No	The paper describes methods using mathematical equations but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper only provides a link to the Wikipedia corpus dataset ('The wikipedia corpus is available at https://goo.gl/s8cy Ya.') and does not mention providing access to the source code for the methodology described.
Open Datasets	Yes	We evaluate models on two different corpora for language modeling. The ﬁrst is a subset of the Wikipedia corpus.1 It consists of 7500 English Wikipedia articles (dump from 6 Feb 2015)... The wikipedia corpus is available at https://goo.gl/s8cy Ya. In addition to this Wikipedia corpus, we also run experiments on the Children s Book Test (CBT Hill et al., 2016).
Dataset Splits	Yes	Subsequently, we split this corpus into a train, development, and test part, resulting in corpora of 22.5M words, 1.2M and 1.2M words, respectively.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments.
Software Dependencies	No	The paper mentions ADAM for optimization and LSTM as a model component, but does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	We use ADAM (Kingma & Ba, 2015) with an initial learning rate of 0.001 and a mini-batch size of 64 for optimization. Furthermore, we apply gradient clipping at a gradient norm of 5 (Pascanu et al., 2013). The bias of the LSTM s forget gate is initialized to 1 (Jozefowicz et al., 2016), while other parameters are initialized uniformly from the range ( 0.1, 0.1). Backpropagation Through Time (Rumelhart et al., 1985; Werbos, 1990) was used to train the network with 20 steps of unrolling.