Character-Level Language Modeling with Deeper Self-Attention
Authors: Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones3159-3166
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation we focus mainly on text8 (Mahoney 2009). This dataset consists of English Wikipedia articles... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)... We report the performance of our best model (T64) on the validation and test sets. Table 1 compares our models against several recent results. On the test set, we achieve a new state of the art, 1.13 bpc... Ablation Experiments To better understand the relative importance of the several modifications we proposed, we run an ablation analysis. |
| Researcher Affiliation | Industry | Rami Al-Rfou,* Dokook Choe,* Noah Constant,* Mandy Guo,* Llion Jones* Google AI 1600 Amphitheatre Parkway Mountain View, California 94043 {rmyeid, choed, nconstant, xyguo, llion}@google.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | No explicit statement providing access to the source code for the methodology described in this paper was found. The paper references the 'tensor2tensor library' but does not provide its own code. |
| Open Datasets | Yes | For evaluation we focus mainly on text8 (Mahoney 2009)... To aid in comparison with other recent approaches, we also evaluate our model on enwik8 (Mahoney 2009)... |
| Dataset Splits | Yes | Following Mikolov et al. (2012) and Zhang et al. (2016), we split the data into 90M characters for train, 5M characters for dev, and 5M characters for test. |
| Hardware Specification | Yes | Our best model is achieved after around 2.5 million steps of training, which takes 175 hours on a single Google Cloud TPU v2. |
| Software Dependencies | No | The paper mentions the 'tensor2tensor library' but does not specify its version or provide other software dependencies with version numbers. |
| Experiment Setup | Yes | Each transformer layer has a hidden size of 512 and a filter size of 2048. We feed our model sequences of length 512... The model has approximately 235 million parameters... To regularize the model, we apply dropout in the attention and Re LU layers with a probability of 0.55. We use the momentum optimizer with 0.99 momentum. The learning rate is fixed during training to 0.003. We train our model for 4 million steps, with each step processing a batch of 16 randomly selected sequences. We drop the intermediate layer losses consecutively, as described in the Intermediate Layer Losses section above. Starting from the first layer, after every 62.5K (= 4M 1 2 64) steps, we drop the losses introduced by the next layer. |