Ouroboros: On Accelerating Training of Transformer-Based Language Models
Authors: Qian Yang, Zhouyuan Huo, Wenlin Wang, Lawrence Carin
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy. |
| Researcher Affiliation | Academia | 1 Duke University 2 University of Pittsburgh |
| Pseudocode | Yes | Algorithm 1 Ouroboros + SGD |
| Open Source Code | Yes | Code to reproduce experiments is to be found at https://github. com/Lara Qian Yang/Ouroboros. |
| Open Datasets | Yes | (i) enwiki8, containing 100M bytes of unprocessed Wikipedia text [33]; (ii) text8, containing 100M processed lower-case Wikipedia characters and removing any character other than the 26 letters a through z, and space [33]; and (iii) Wiki Text-103, the largest available word-level language modeling benchmark with long-term dependency [34]. |
| Dataset Splits | No | The paper mentions using training and test datasets but does not provide specific details on validation splits (percentages or counts) or explicitly describe a validation set setup. |
| Hardware Specification | Yes | All experiments are performed on a machine with 4 TESLA V100 GPUs. |
| Software Dependencies | No | The paper mentions 'PyTorch' and 'Python3' as software used but does not specify their version numbers. |
| Experiment Setup | Yes | According to [16], we use the Adam optimizer, where β1 = 0.9, β2 = 0.999 and ε = 1e 8 [28]. For comparison, we use Ouroboros+Adam (see Appendix) in the experiments. The learning rate is set to be 0.00025 and it decreases following a cosine learning rate schedule [35]. |