Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adaptive Input Representations for Neural Language Modeling
Authors: Alexei Baevski, Michael Auli
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WIKITEXT-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity. |
| Researcher Affiliation | Industry | Alexei Baevski & Michael Auli Facebook AI Research, Menlo Park, CA, USA |
| Pseudocode | No | The paper describes the architecture and processes in text and with diagrams, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and pre-trained models available at http://github.com/pytorch/fairseq |
| Open Datasets | Yes | We experiment on the BILLION WORD benchmark and WIKITEXT-103. BILLION WORD contains 768M word tokens and has a vocabulary of about 800K word types, which corresponds to words with more than 3 occurrences in the training set (Chelba et al., 2013). The training data of WIKITEXT-103 comprises about 100M tokens and a vocabulary of around 260K, corresponding to types with more than 3 occurrences in the training data (Merity et al., 2016). |
| Dataset Splits | Yes | We tuned this choice on the validation set (Appendix A). We take care to score all tokens in the test and validation sets. |
| Hardware Specification | Yes | We run experiments on DGX-1 machines with 8 NVIDIA V100 GPUs and machines are interconnected by InfiniBand. |
| Software Dependencies | No | The paper mentions 'NCCL2 library' and 'torch.distributed package' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a dropout rate of 0.1 and attention dropout of 0.1 for BILLION WORD models, and increase regularization for WIKITEXT-103 by using dropout 0.3, and 0.1 Re LU dropout as well as attention dropout 0.1. We use Nesterov s accelerated gradient method (Sutskever et al., 2013) with a momentum of 0.99 and we renormalize gradients if their norm exceeds 0.1 (Pascanu et al., 2013). The learning rate is linearly warmed up from 10 7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with C cycles. |