reproducibilityindex.ai

Character-Level Language Modeling with Deeper Self-Attention

Authors: Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones3159-3166

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For evaluation we focus mainly on text8 (Mahoney 2009). This dataset consists of English Wikipedia articles... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)... We report the performance of our best model (T64) on the validation and test sets. Table 1 compares our models against several recent results. On the test set, we achieve a new state of the art, 1.13 bpc... Ablation Experiments To better understand the relative importance of the several modiﬁcations we proposed, we run an ablation analysis.
Researcher Affiliation	Industry	Rami Al-Rfou,* Dokook Choe,* Noah Constant,* Mandy Guo,* Llion Jones* Google AI 1600 Amphitheatre Parkway Mountain View, California 94043 {rmyeid, choed, nconstant, xyguo, llion}@google.com
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	No explicit statement providing access to the source code for the methodology described in this paper was found. The paper references the 'tensor2tensor library' but does not provide its own code.
Open Datasets	Yes	For evaluation we focus mainly on text8 (Mahoney 2009)... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)...
Dataset Splits	Yes	Following Mikolov et al. (2012) and Zhang et al. (2016), we split the data into 90M characters for train, 5M characters for dev, and 5M characters for test.
Hardware Specification	Yes	Our best model is achieved after around 2.5 million steps of training, which takes 175 hours on a single Google Cloud TPU v2.
Software Dependencies	No	The paper mentions the 'tensor2tensor library' but does not specify its version or provide other software dependencies with version numbers.
Experiment Setup	Yes	Each transformer layer has a hidden size of 512 and a ﬁlter size of 2048. We feed our model sequences of length 512... The model has approximately 235 million parameters... To regularize the model, we apply dropout in the attention and Re LU layers with a probability of 0.55. We use the momentum optimizer with 0.99 momentum. The learning rate is ﬁxed during training to 0.003. We train our model for 4 million steps, with each step processing a batch of 16 randomly selected sequences. We drop the intermediate layer losses consecutively, as described in the Intermediate Layer Losses section above. Starting from the ﬁrst layer, after every 62.5K (= 4M 1 2 64) steps, we drop the losses introduced by the next layer.