Character-Level Language Modeling with Deeper Self-Attention

Authors: Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones3159-3166

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluation we focus mainly on text8 (Mahoney 2009). This dataset consists of English Wikipedia articles... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)... We report the performance of our best model (T64) on the validation and test sets. Table 1 compares our models against several recent results. On the test set, we achieve a new state of the art, 1.13 bpc... Ablation Experiments To better understand the relative importance of the several modifications we proposed, we run an ablation analysis.
Researcher Affiliation Industry Rami Al-Rfou,* Dokook Choe,* Noah Constant,* Mandy Guo,* Llion Jones* Google AI 1600 Amphitheatre Parkway Mountain View, California 94043 {rmyeid, choed, nconstant, xyguo, llion}@google.com
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No No explicit statement providing access to the source code for the methodology described in this paper was found. The paper references the 'tensor2tensor library' but does not provide its own code.
Open Datasets Yes For evaluation we focus mainly on text8 (Mahoney 2009)... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)...
Dataset Splits Yes Following Mikolov et al. (2012) and Zhang et al. (2016), we split the data into 90M characters for train, 5M characters for dev, and 5M characters for test.
Hardware Specification Yes Our best model is achieved after around 2.5 million steps of training, which takes 175 hours on a single Google Cloud TPU v2.
Software Dependencies No The paper mentions the 'tensor2tensor library' but does not specify its version or provide other software dependencies with version numbers.
Experiment Setup Yes Each transformer layer has a hidden size of 512 and a filter size of 2048. We feed our model sequences of length 512... The model has approximately 235 million parameters... To regularize the model, we apply dropout in the attention and Re LU layers with a probability of 0.55. We use the momentum optimizer with 0.99 momentum. The learning rate is fixed during training to 0.003. We train our model for 4 million steps, with each step processing a batch of 16 randomly selected sequences. We drop the intermediate layer losses consecutively, as described in the Intermediate Layer Losses section above. Starting from the first layer, after every 62.5K (= 4M 1 2 64) steps, we drop the losses introduced by the next layer.