Adaptive Input Representations for Neural Language Modeling
Authors: Alexei Baevski, Michael Auli
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WIKITEXT-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity. |
| Researcher Affiliation | Industry | Alexei Baevski & Michael Auli Facebook AI Research, Menlo Park, CA, USA |
| Pseudocode | No | The paper describes the architecture and processes in text and with diagrams, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and pre-trained models available at http://github.com/pytorch/fairseq |
| Open Datasets | Yes | We experiment on the BILLION WORD benchmark and WIKITEXT-103. BILLION WORD contains 768M word tokens and has a vocabulary of about 800K word types, which corresponds to words with more than 3 occurrences in the training set (Chelba et al., 2013). The training data of WIKITEXT-103 comprises about 100M tokens and a vocabulary of around 260K, corresponding to types with more than 3 occurrences in the training data (Merity et al., 2016). |
| Dataset Splits | Yes | We tuned this choice on the validation set (Appendix A). We take care to score all tokens in the test and validation sets. |
| Hardware Specification | Yes | We run experiments on DGX-1 machines with 8 NVIDIA V100 GPUs and machines are interconnected by InfiniBand. |
| Software Dependencies | No | The paper mentions 'NCCL2 library' and 'torch.distributed package' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a dropout rate of 0.1 and attention dropout of 0.1 for BILLION WORD models, and increase regularization for WIKITEXT-103 by using dropout 0.3, and 0.1 Re LU dropout as well as attention dropout 0.1. We use Nesterov s accelerated gradient method (Sutskever et al., 2013) with a momentum of 0.99 and we renormalize gradients if their norm exceeds 0.1 (Pascanu et al., 2013). The learning rate is linearly warmed up from 10 7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with C cycles. |