Adaptive Input Representations for Neural Language Modeling

Authors: Alexei Baevski, Michael Auli

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WIKITEXT-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity.
Researcher Affiliation Industry Alexei Baevski & Michael Auli Facebook AI Research, Menlo Park, CA, USA
Pseudocode No The paper describes the architecture and processes in text and with diagrams, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code and pre-trained models available at http://github.com/pytorch/fairseq
Open Datasets Yes We experiment on the BILLION WORD benchmark and WIKITEXT-103. BILLION WORD contains 768M word tokens and has a vocabulary of about 800K word types, which corresponds to words with more than 3 occurrences in the training set (Chelba et al., 2013). The training data of WIKITEXT-103 comprises about 100M tokens and a vocabulary of around 260K, corresponding to types with more than 3 occurrences in the training data (Merity et al., 2016).
Dataset Splits Yes We tuned this choice on the validation set (Appendix A). We take care to score all tokens in the test and validation sets.
Hardware Specification Yes We run experiments on DGX-1 machines with 8 NVIDIA V100 GPUs and machines are interconnected by InfiniBand.
Software Dependencies No The paper mentions 'NCCL2 library' and 'torch.distributed package' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use a dropout rate of 0.1 and attention dropout of 0.1 for BILLION WORD models, and increase regularization for WIKITEXT-103 by using dropout 0.3, and 0.1 Re LU dropout as well as attention dropout 0.1. We use Nesterov s accelerated gradient method (Sutskever et al., 2013) with a momentum of 0.99 and we renormalize gradients if their norm exceeds 0.1 (Pascanu et al., 2013). The learning rate is linearly warmed up from 10 7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with C cycles.