Ouroboros: On Accelerating Training of Transformer-Based Language Models

Authors: Qian Yang, Zhouyuan Huo, Wenlin Wang, Lawrence Carin

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy.
Researcher Affiliation Academia 1 Duke University 2 University of Pittsburgh
Pseudocode Yes Algorithm 1 Ouroboros + SGD
Open Source Code Yes Code to reproduce experiments is to be found at https://github. com/Lara Qian Yang/Ouroboros.
Open Datasets Yes (i) enwiki8, containing 100M bytes of unprocessed Wikipedia text [33]; (ii) text8, containing 100M processed lower-case Wikipedia characters and removing any character other than the 26 letters a through z, and space [33]; and (iii) Wiki Text-103, the largest available word-level language modeling benchmark with long-term dependency [34].
Dataset Splits No The paper mentions using training and test datasets but does not provide specific details on validation splits (percentages or counts) or explicitly describe a validation set setup.
Hardware Specification Yes All experiments are performed on a machine with 4 TESLA V100 GPUs.
Software Dependencies No The paper mentions 'PyTorch' and 'Python3' as software used but does not specify their version numbers.
Experiment Setup Yes According to [16], we use the Adam optimizer, where β1 = 0.9, β2 = 0.999 and ε = 1e 8 [28]. For comparison, we use Ouroboros+Adam (see Appendix) in the experiments. The learning rate is set to be 0.00025 and it decreases following a cosine learning rate schedule [35].