Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Authors: Qian Yang, Zhouyuan Huo, Wenlin Wang, Lawrence Carin

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy.
Researcher Affiliation Academia 1 Duke University 2 University of Pittsburgh
Pseudocode Yes Algorithm 1 Ouroboros + SGD
Open Source Code Yes Code to reproduce experiments is to be found at https://github. com/Lara Qian Yang/Ouroboros.
Open Datasets Yes (i) enwiki8, containing 100M bytes of unprocessed Wikipedia text [33]; (ii) text8, containing 100M processed lower-case Wikipedia characters and removing any character other than the 26 letters a through z, and space [33]; and (iii) Wiki Text-103, the largest available word-level language modeling benchmark with long-term dependency [34].
Dataset Splits No The paper mentions using training and test datasets but does not provide specific details on validation splits (percentages or counts) or explicitly describe a validation set setup.
Hardware Specification Yes All experiments are performed on a machine with 4 TESLA V100 GPUs.
Software Dependencies No The paper mentions 'PyTorch' and 'Python3' as software used but does not specify their version numbers.
Experiment Setup Yes According to [16], we use the Adam optimizer, where β1 = 0.9, β2 = 0.999 and ε = 1e 8 [28]. For comparison, we use Ouroboros+Adam (see Appendix) in the experiments. The learning rate is set to be 0.00025 and it decreases following a cosine learning rate schedule [35].