Alternating Updates for Efficient Transformers
Authors: Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of Alt Up on a diverse set of scenarios. |
| Researcher Affiliation | Collaboration | Cenk Baykal Google Research Dylan Cutler Google Research Nishanth Dikkala Google Research Nikhil Ghosh UC Berkeley Rina Panigrahy Google Research Xin Wang Google Research |
| Pseudocode | Yes | Algorithm 1 Alternating Updates (Alt Up) Layer |
| Open Source Code | No | The paper does not explicitly state that its source code is open-source or provide a link to a code repository. |
| Open Datasets | Yes | We performed all of our experiments using T5-model architectures [37] of varying sizes (small, base, large, and 3B) which we pretrained on the C4 dataset for 500, 000 steps with a batch size of 256. The pretrained models were then finetuned on either the GLUE [50], Super GLUE (SG) [49], SQu AD [39] or Trivia-QA (closed-book) [23, 42] benchmark tasks |
| Dataset Splits | Yes | We report both pretraining and finetuning metrics: for pretraining, we report span prediction accuracy on a hold-out validation set, and for finetuning, we follow the same recipe as the T5 models, see [37] for more details. |
| Hardware Specification | Yes | Latency is measured on TPUv3 with 8 cores. |
| Software Dependencies | No | The paper mentions software components like T5X and Adafactor optimizer, but does not provide specific version numbers for any software or libraries. |
| Experiment Setup | Yes | We performed all of our experiments using T5-model architectures...pretrained on the C4 dataset for 500,000 steps with a batch size of 256. The pretrained models were then finetuned...for a further 50,000 steps with a batch-size of 256. During pretraining, we use 256 batch size, Adafactor optimizer [46] with base learning rate 1.0 and reciprocal square-root decay with 10000 warmup steps, and zero dropout. During finetuning, we use 256 batch size, Adafactor optimizer with constant learning rate of 0.001 and 0.1 dropout. |